Files
microsoft-SkillOpt/tests
Yifan Yang 937bc1ec4d feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge)
The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:

  - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
  - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
    clean cwd; detects the call from the shim's log (not a self-reported marker).
  - Codex: same shim under `exec --sandbox workspace-write`.
  - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
  - replay_one routes tasks with a tool_called check through the tool loop and
    feeds detected calls to the rule judge; ReplayResult gains tools_called.

Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
2026-06-08 14:31:51 +00:00
..