microsoft-SkillOpt

mirror of https://github.com/microsoft/SkillOpt.git synced 2026-07-03 14:02:58 +08:00

Files

Yifan Yang 937bc1ec4d feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge)

The 4th gbrain seed (quick-answerer) is judged by tool_called=search: the agent
must ACTUALLY call a search tool. Add an honest tool loop:

  - Backend.attempt_with_tools(task, skill, memory, tools) -> (response, tools_called)
  - Claude: exposes a real ./search shell shim, runs with --allowedTools Bash in a
    clean cwd; detects the call from the shim's log (not a self-reported marker).
  - Codex: same shim under `exec --sandbox workspace-write`.
  - Mock: deterministic — "calls" a tool iff skill/memory instructs it (for CI).
  - replay_one routes tasks with a tool_called check through the tool loop and
    feeds detected calls to the rule judge; ReplayResult gains tools_called.

Verified live (Claude haiku): deficient skill -> tools_called=[] hard=0;
learned "must run ./search" rule -> tools_called=['search'] hard=1.0.
20 tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

2026-06-08 14:31:51 +00:00

__init__.py

test: add unit test suite for core utility modules

2026-06-01 02:04:22 +08:00

test_json_utils.py

test: add unit test suite for core utility modules

2026-06-01 02:04:22 +08:00

test_qwen_backend.py

fix(model): forward Qwen timeout and only set enable_thinking when true

2026-06-07 07:41:35 -07:00

test_scoring.py

test: add unit test suite for core utility modules

2026-06-01 02:04:22 +08:00

test_sleep_engine.py

feat(sleep): real tool-loop replay for gbrain quick-answerer (tool_called judge)

2026-06-08 14:31:51 +00:00

test_types.py

test: add unit test suite for core utility modules

2026-06-01 02:04:22 +08:00