Skip to content

Create lib/retry.cjs with backoff and failure taxonomy #140

@snipcodeit

Description

@snipcodeit

Context

Phase 35 — Retry Architecture. Currently, a pipeline failure in run.md or milestone.md immediately marks the issue as failed with the pipeline-failed label. There is no distinction between "GitHub API returned 503" (retry in 30s) and "issue body is malformed" (human intervention needed). Users see every failure as permanent. This issue builds the infrastructure to classify and recover from transient failures.

What Already Exists

  • commands/run.md (1282 lines) — handles failures by setting pipeline_stage to failed and applying pipeline-failed label (via gh issue edit ${ISSUE_NUMBER} --add-label "pipeline-failed"); no retry state, no backoff, no failure classification
  • commands/milestone.md (952 lines) — line ~907: "If some issues failed" block shows results table with Milestone NOT closed message; no Retry/Skip/Abort prompt with automated backoff
  • .mgw/active/*.json — issue state files; will gain retry_count, last_failure_class, dead_letter fields
  • lib/state.cjs (216 lines) — writeProjectState/loadActiveIssue can persist retry state; migrateProjectState() can add default retry fields to existing issue files

Description

Create lib/retry.cjs — formal retry, backoff, and failure-handling infrastructure.

Technical Approach

  • Failure taxonomy: Transient (retry): rate limit, 5xx, network timeout, worktree lock, model overload. Permanent (don't retry): 404, non-rate-limit 403, GSD tools missing. NeedsInfo: issue body ambiguous, missing required fields, contradictory requirements.
  • Exports: classifyFailure(error) → { class: 'transient'|'permanent'|'needs-info', reason }, canRetry(issueState) → boolean (checks retry_count < MAX, dead_letter not set), incrementRetry(issueState) → updated state, resetRetryState(issueState) → cleared state, getBackoffMs(retryCount) → exponential with jitter (base 5s, max 5min)
  • Constants: MAX_RETRIES = 3, BACKOFF_BASE_MS = 5000, BACKOFF_MAX_MS = 300000
  • Add to lib/index.cjs

Done When

  • lib/retry.cjs exists with exports: classifyFailure, canRetry, incrementRetry, resetRetryState, getBackoffMs
  • classifyFailure() correctly classifies rate-limit errors (status 429) as transient and 404 as permanent — verified with node -e test
  • canRetry() returns false when retry_count >= 3 or dead_letter === true
  • getBackoffMs() returns increasing values for retry counts 0, 1, 2 with jitter (not identical on repeat calls)
  • lib/index.cjs exports retry.cjs and node -e "require('./lib/retry.cjs').classifyFailure({status:429})" returns {class:'transient',...}

GSD Route

Quick task or GSD Phase 35 plan.

Phase Context

Phase 35 of 36 — Retry Architecture. Depends on Phase 34 (retry integrates into adapter-using commands). Issues: #140 (this), #141 (blocked on #140).

Depends on

Phase 34 (#138, #139) should be complete — retry.cjs integrates into commands that use gsd-adapter.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions