-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
Phase 35 — Retry Architecture. Currently, a pipeline failure in run.md or milestone.md immediately marks the issue as failed with the pipeline-failed label. There is no distinction between "GitHub API returned 503" (retry in 30s) and "issue body is malformed" (human intervention needed). Users see every failure as permanent. This issue builds the infrastructure to classify and recover from transient failures.
What Already Exists
commands/run.md(1282 lines) — handles failures by setting pipeline_stage to failed and applying pipeline-failed label (viagh issue edit ${ISSUE_NUMBER} --add-label "pipeline-failed"); no retry state, no backoff, no failure classificationcommands/milestone.md(952 lines) — line ~907: "If some issues failed" block shows results table with Milestone NOT closed message; no Retry/Skip/Abort prompt with automated backoff.mgw/active/*.json— issue state files; will gain retry_count, last_failure_class, dead_letter fieldslib/state.cjs(216 lines) — writeProjectState/loadActiveIssue can persist retry state; migrateProjectState() can add default retry fields to existing issue files
Description
Create lib/retry.cjs — formal retry, backoff, and failure-handling infrastructure.
Technical Approach
- Failure taxonomy: Transient (retry): rate limit, 5xx, network timeout, worktree lock, model overload. Permanent (don't retry): 404, non-rate-limit 403, GSD tools missing. NeedsInfo: issue body ambiguous, missing required fields, contradictory requirements.
- Exports: classifyFailure(error) → { class: 'transient'|'permanent'|'needs-info', reason }, canRetry(issueState) → boolean (checks retry_count < MAX, dead_letter not set), incrementRetry(issueState) → updated state, resetRetryState(issueState) → cleared state, getBackoffMs(retryCount) → exponential with jitter (base 5s, max 5min)
- Constants: MAX_RETRIES = 3, BACKOFF_BASE_MS = 5000, BACKOFF_MAX_MS = 300000
- Add to lib/index.cjs
Done When
-
lib/retry.cjsexists with exports: classifyFailure, canRetry, incrementRetry, resetRetryState, getBackoffMs - classifyFailure() correctly classifies rate-limit errors (status 429) as transient and 404 as permanent — verified with node -e test
- canRetry() returns false when retry_count >= 3 or dead_letter === true
- getBackoffMs() returns increasing values for retry counts 0, 1, 2 with jitter (not identical on repeat calls)
-
lib/index.cjsexports retry.cjs andnode -e "require('./lib/retry.cjs').classifyFailure({status:429})"returns{class:'transient',...}
GSD Route
Quick task or GSD Phase 35 plan.
Phase Context
Phase 35 of 36 — Retry Architecture. Depends on Phase 34 (retry integrates into adapter-using commands). Issues: #140 (this), #141 (blocked on #140).
Depends on
Phase 34 (#138, #139) should be complete — retry.cjs integrates into commands that use gsd-adapter.