Files
github-spec-kit/src
Huy Do 20f430686c feat(workflows): honor max_concurrency in fan-out via a bounded thread pool (#3224)
* feat(workflows): honor max_concurrency in fan-out via a bounded thread pool

* feat(workflows): address review — sliding-window fan-out, locked output, faithful halt

Address the reviewer feedback on the bounded fan-out concurrency:

- Sliding submission window: keep at most `workers` items in flight and stop
  launching new items once the run is halting, instead of submitting all items
  up front (which let the pool keep starting queued work after a halt).
- Faithful halt prefix: attribute a halt to the specific item whose own
  recorded result halted the run (replaying the sequential break condition,
  honoring continue_on_error/aborted), not the shared run status a later
  concurrent item may have flipped. The returned prefix now includes the actual
  halting item, matching the sequential path. An item that fails before
  recording a result (e.g. an unknown step type) is attributed too, since every
  item runs the same template.
- Lock the parent fan-out output mutation: route the post-fan-out
  step_results[...]['output'] update through a new RunState.set_step_output()
  under the run lock, so it cannot race a concurrent save().
- Docstring: describe int() coercion accurately (numeric strings / floats are
  honored; only non-coercible or <= 1 runs sequentially).

Tests: add concurrent halt-includes-halting-item, continue_on_error-does-not-
truncate, and unknown-template-type-matches-sequential coverage; make the
timing test use a monotonic clock with a looser threshold to avoid CI flakiness.

* feat(workflows): address second review pass — concurrency hardening

- append_log: serialize the log_entries append + log.jsonl write under a
  dedicated RunState._log_lock so concurrent fan-out workers can't interleave
  or corrupt log lines (kept separate from the state lock; never nested).
- _run_fan_out.run_item: read the item output back through the item_ctx it
  executed against rather than the outer context closure — clearer and robust
  if StepContext ever stops sharing the steps dict by reference.
- StepBase: document the thread-safety contract — STEP_REGISTRY holds one shared
  instance per type, so concurrent fan-out invokes execute() on the same object;
  implementations must be stateless/thread-safe (the built-ins already are).
- test_concurrency_is_real: prove parallelism deterministically with a
  threading.Barrier (sequential execution can't clear it) instead of a
  wall-clock timing assertion.

* feat(workflows): address review — stamp updated_at under lock, clarify cancel semantics

- RunState.save(): move the updated_at timestamp assignment inside the run lock
  so the timestamp matches the snapshot the thread serializes and concurrent
  savers don't race on it.
- _run_fan_out docstring: clarify that on a halt only not-yet-started items are
  cancelled; items already running finish but their outputs are ignored
  (Future.cancel() can't stop running work, and the pool joins on exit).

* feat(workflows): serialize on_step_start callback under a lock

The concurrent fan-out path invokes _execute_steps from worker threads, which
calls the engine's on_step_start callback (the CLI sets it to a console.print
lambda). Concurrent invocation could interleave/garble progress output. Guard
the call with a WorkflowEngine._callback_lock so callbacks are serialized;
the lock is uncontended for sequential runs.

* feat(workflows): re-raise worker exceptions in-place to preserve traceback

In _run_fan_out's concurrent path, a worker exception was stashed in first_exc
and re-raised after the loop. Re-raise it from within the except block with a
bare `raise` (after cancelling outstanding futures) so the original traceback is
preserved, and drop the now-unneeded first_exc variable. The ThreadPoolExecutor
__exit__ still joins any already-running workers before the exception escapes.

* feat(workflows): lock final fan-out status, drop redundant output write, bound workers

Address third review pass:

- Remove the unlocked `context.steps[step_id]["output"] = …` writes in the
  fan-out parent update. context.steps[step_id] is the same dict object that
  set_step_output() updates under the run lock, so the direct (unsynchronized)
  mutation was redundant.
- Preserve sequential halt semantics under concurrency: a later in-flight item
  could overwrite state.status after the halting item was identified. _run_fan_out
  now derives the halting item's run status (item_halt_status, replacing the bool
  item_halted) and restores it after the pool joins, so the final status is the
  first halting item's outcome.
- Bound the pool: workers = min(max_concurrency, len(items)) and early-return for
  empty items, so a user-controlled max_concurrency can't over-allocate threads.

Add coverage that an earlier PAUSED item's status wins over a later concurrent
FAILED item.

* feat(workflows): avoid unlocked context.steps writes when it aliases step_results

On a resume run, StepContext is built with steps=state.step_results, so the two
direct `context.steps[...] = ...` writes mutated the shared dict outside the run
lock and could race save(). Route both through a new _record_result helper that
mirrors into context.steps only when it is a distinct object (a fresh run) and
otherwise relies solely on record_step_result's locked write.
2026-06-30 08:23:27 -05:00
..