Summary
Open-Sleigh commission-first recovery can restart a WorkCommission at preflight after an execute/runtime disruption, even though Haft already records the WorkCommission as running.
This can produce an infinite retry loop.
Observed sequence
- WorkCommission starts in
preflighting.
- Preflight passes.
- Haft records
start_after_preflight, so the WorkCommission state becomes running.
- Execute stalls or the runtime is disrupted.
- Open-Sleigh re-enters the same commission via the local commission-first/intake path at workflow
entry_phase=preflight.
- Replayed preflight writes:
record_run_event succeeds because it is allowed in running.
record_preflight / start_after_preflight previously failed because they were only allowed in preflighting.
- That lifecycle write failure becomes
tool_execution_failed.
- Orchestrator retries, starts preflight again, and repeats.
Representative symptoms:
agent_failed: commission=<id> phase=execute reason=stalled
failed: commission=<id> phase=execute reason=stalled
started: commission=<id> phase=preflight
...
repeated preflight phase_outcome events
recent_failures: tool_execution_failed phase=preflight
Tactical patch already made
A fork patch exists on branch pr/open-sleigh-bugfixes:
3eb4783 fix(harness): tolerate replayed preflight lifecycle
The patch makes preflight pass lifecycle markers idempotent once a commission is already running:
record_preflight with verdict=pass is accepted in running.
start_after_preflight with verdict=pass is accepted in running.
- blocked/stale preflight replay is still rejected in
running.
This breaks the immediate poison-pill loop, but it is not complete phase recovery.
Why this is still an issue
Open-Sleigh appears to rely on in-memory workflow state for the current phase. If that state is lost/bypassed after a stalled execute, commission-first intake starts from the workflow entry phase (preflight) rather than reconstructing the durable phase from Haft lifecycle/runtime facts.
The tactical idempotency fix makes replay non-fatal, but Open-Sleigh should not have to rediscover progress by replaying preflight unless explicitly requeued/freshly started.
Expected behavior
For commission-first WorkCommissions after execute/runtime disruption:
- retry the phase that failed (
execute) when retrying the same runtime attempt, or
- reconstruct resume phase from durable Haft lifecycle events, or
- terminalize/block the WorkCommission after a bounded retry budget.
Do not restart from preflight indefinitely for a WorkCommission already in running.
Acceptance criteria
- Regression test covers a WorkCommission already in
running with a local/intake replay or runtime restart.
- Open-Sleigh either resumes the correct phase from durable state or blocks/fails after a bounded retry policy.
- Preflight replay, if still possible, is explicitly idempotent and cannot block a running commission with stale preflight facts.
- Harness status/result should not report a cancelled/terminal WorkCommission as active due to stale runtime status projections.
Summary
Open-Sleigh commission-first recovery can restart a WorkCommission at
preflightafter an execute/runtime disruption, even though Haft already records the WorkCommission asrunning.This can produce an infinite retry loop.
Observed sequence
preflighting.start_after_preflight, so the WorkCommission state becomesrunning.entry_phase=preflight.record_run_eventsucceeds because it is allowed inrunning.record_preflight/start_after_preflightpreviously failed because they were only allowed inpreflighting.tool_execution_failed.Representative symptoms:
Tactical patch already made
A fork patch exists on branch
pr/open-sleigh-bugfixes:The patch makes preflight pass lifecycle markers idempotent once a commission is already
running:record_preflightwithverdict=passis accepted inrunning.start_after_preflightwithverdict=passis accepted inrunning.running.This breaks the immediate poison-pill loop, but it is not complete phase recovery.
Why this is still an issue
Open-Sleigh appears to rely on in-memory workflow state for the current phase. If that state is lost/bypassed after a stalled execute, commission-first intake starts from the workflow entry phase (
preflight) rather than reconstructing the durable phase from Haft lifecycle/runtime facts.The tactical idempotency fix makes replay non-fatal, but Open-Sleigh should not have to rediscover progress by replaying preflight unless explicitly requeued/freshly started.
Expected behavior
For commission-first WorkCommissions after execute/runtime disruption:
execute) when retrying the same runtime attempt, orDo not restart from
preflightindefinitely for a WorkCommission already inrunning.Acceptance criteria
runningwith a local/intake replay or runtime restart.