mirror of
https://github.com/microsoft/SkillOpt.git
synced 2026-07-03 14:02:58 +08:00
The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:
- Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
- Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
clean cwd; detects the call from the shim's log (not a self-reported marker).
- Codex: same shim under `exec --sandbox workspace-write`.
- Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
- replay_one routes tasks with a tool_called check through the tool loop and
feeds detected calls to the rule judge; ReplayResult gains tools_called.
Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>