Skip to content

Integrate retry into run.md and milestone.md #141

@snipcodeit

Description

@snipcodeit

Context

Phase 35 — Retry Architecture, integration. After #140 creates retry.cjs, this issue wires it into the two pipeline commands that handle failures. The goal: a transient GitHub API failure no longer immediately kills the pipeline; it retries up to 3 times with backoff. Permanent failures and needs-info cases still surface to the user immediately but with a failure class in the comment so they know whether to wait and retry or fix the issue.

What Already Exists

  • lib/retry.cjs — created by Create lib/retry.cjs with backoff and failure taxonomy #140; exports: classifyFailure, canRetry, incrementRetry, resetRetryState, getBackoffMs
  • commands/run.md (1282 lines) — failure handling: sets pipeline_stage to failed, applies pipeline-failed label via gh issue edit ${ISSUE_NUMBER} --add-label "pipeline-failed"; no retry logic; no failure classification in posted comments
  • commands/milestone.md (952 lines) — failed-issue recovery at ~line 907: shows Milestone NOT closed message with results table; no automated retry, no retry_count tracking, no backoff
  • .mgw/active/*.json — issue state files; will gain retry_count, last_failure_class, dead_letter fields after this issue
  • lib/state.cjs migrateProjectState() (lines 145-178) — will need a migration path to add retry_count: 0, dead_letter: false to existing active issue files

Description

Integrate lib/retry.cjs into commands/run.md and commands/milestone.md.

Technical Approach

  • run.md validate_and_load step: add --retry flag handler; if retry.dead_letter === true and --retry flag: call resetRetryState(), clear pipeline-failed label, re-queue
  • run.md failure handling: wrap the execute step in retry loop using canRetry() + getBackoffMs(); on failure call classifyFailure(); if transient and canRetry(): sleep(backoff), incrementRetry(), loop; if permanent or retry exhausted: set dead_letter=true, post comment with failure class
  • milestone.md recovery prompt: surface failure_class from active issue state in the results table; "Retry" option calls resetRetryState() then re-invokes run for that issue
  • migrateProjectState() in state.cjs: add migration step that sets retry_count: 0, dead_letter: false on active issues missing these fields

Done When

  • commands/run.md retries transient failures up to 3 times with exponential backoff before marking dead_letter=true
  • commands/run.md failure comment includes failure_class (transient/permanent/needs-info) from classifyFailure()
  • commands/milestone.md recovery display shows failure_class for each failed issue; Retry option calls resetRetryState()
  • lib/state.cjs migrateProjectState() sets retry_count: 0 and dead_letter: false on active issue files that lack these fields
  • After integration, --retry flag in run.md clears dead_letter state and re-queues the issue

GSD Route

Quick task or GSD Phase 35 plan.

Phase Context

Phase 35 of 36 — Retry Architecture. Issues: #140 (blocks this), #141 (this).

Depends on

#140 — lib/retry.cjs must exist before integration.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions