Skip to content

Open-Sleigh commission-first recovery can restart running commission at preflight #84

@karabelaselias

Description

@karabelaselias

Summary

Open-Sleigh commission-first recovery can restart a WorkCommission at preflight after an execute/runtime disruption, even though Haft already records the WorkCommission as running.

This can produce an infinite retry loop.

Observed sequence

  1. WorkCommission starts in preflighting.
  2. Preflight passes.
  3. Haft records start_after_preflight, so the WorkCommission state becomes running.
  4. Execute stalls or the runtime is disrupted.
  5. Open-Sleigh re-enters the same commission via the local commission-first/intake path at workflow entry_phase=preflight.
  6. Replayed preflight writes:
    • record_run_event succeeds because it is allowed in running.
    • record_preflight / start_after_preflight previously failed because they were only allowed in preflighting.
  7. That lifecycle write failure becomes tool_execution_failed.
  8. Orchestrator retries, starts preflight again, and repeats.

Representative symptoms:

agent_failed: commission=<id> phase=execute reason=stalled
failed: commission=<id> phase=execute reason=stalled
started: commission=<id> phase=preflight
...
repeated preflight phase_outcome events
recent_failures: tool_execution_failed phase=preflight

Tactical patch already made

A fork patch exists on branch pr/open-sleigh-bugfixes:

3eb4783 fix(harness): tolerate replayed preflight lifecycle

The patch makes preflight pass lifecycle markers idempotent once a commission is already running:

  • record_preflight with verdict=pass is accepted in running.
  • start_after_preflight with verdict=pass is accepted in running.
  • blocked/stale preflight replay is still rejected in running.

This breaks the immediate poison-pill loop, but it is not complete phase recovery.

Why this is still an issue

Open-Sleigh appears to rely on in-memory workflow state for the current phase. If that state is lost/bypassed after a stalled execute, commission-first intake starts from the workflow entry phase (preflight) rather than reconstructing the durable phase from Haft lifecycle/runtime facts.

The tactical idempotency fix makes replay non-fatal, but Open-Sleigh should not have to rediscover progress by replaying preflight unless explicitly requeued/freshly started.

Expected behavior

For commission-first WorkCommissions after execute/runtime disruption:

  • retry the phase that failed (execute) when retrying the same runtime attempt, or
  • reconstruct resume phase from durable Haft lifecycle events, or
  • terminalize/block the WorkCommission after a bounded retry budget.

Do not restart from preflight indefinitely for a WorkCommission already in running.

Acceptance criteria

  • Regression test covers a WorkCommission already in running with a local/intake replay or runtime restart.
  • Open-Sleigh either resumes the correct phase from durable state or blocks/fails after a bounded retry policy.
  • Preflight replay, if still possible, is explicitly idempotent and cannot block a running commission with stale preflight facts.
  • Harness status/result should not report a cancelled/terminal WorkCommission as active due to stale runtime status projections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions