Skip to content

Commit 7fc1229

Browse files
authored
Merge pull request #243 from snipcodeit/issue/234-implement-graceful-degradation-for-n
feat(reliability): graceful degradation for non-critical agent failures
2 parents 325c3a7 + 3c1e8e5 commit 7fc1229

4 files changed

Lines changed: 362 additions & 28 deletions

File tree

commands/run/execute.md

Lines changed: 115 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,10 @@ ic.buildGSDPromptContext({
7777
" 2>/dev/null || echo "")
7878
```
7979

80+
<!-- mgw:criticality=critical spawn_point=planner -->
81+
<!-- Critical: plan creation is required for pipeline to proceed.
82+
On failure: retry via existing retry loop, then dead-letter. -->
83+
8084
**Pre-spawn diagnostic hook (planner):**
8185
```bash
8286
DIAG_PLANNER=$(node -e "
@@ -90,7 +94,6 @@ const id = dh.beforeAgentSpawn({
9094
process.stdout.write(id);
9195
" 2>/dev/null || echo "")
9296
```
93-
9497
```
9598
Task(
9699
prompt="
@@ -176,6 +179,23 @@ PLAN_CHECK=$(node ~/.claude/get-shit-done/bin/gsd-tools.cjs verify plan-structur
176179
```
177180
Parse the JSON result. If structural issues found, include them in the plan-checker prompt below so it has concrete problems to evaluate rather than searching from scratch.
178181

182+
6. **(If --full) Spawn plan-checker, handle revision loop (max 2 iterations):**
183+
184+
<!-- mgw:criticality=advisory spawn_point=plan-checker -->
185+
<!-- Advisory: plan checking is a quality gate, not a pipeline blocker.
186+
If this agent fails, log a warning and proceed with the unchecked plan.
187+
The plan was already verified structurally by gsd-tools verify plan-structure.
188+
189+
Graceful degradation pattern:
190+
```
191+
PLAN_CHECK_RESULT=$(wrapAdvisoryAgent(Task(...), 'plan-checker', {
192+
issueNumber: ISSUE_NUMBER,
193+
fallback: '## VERIFICATION PASSED (plan-checker unavailable — structural check only)'
194+
}))
195+
# If fallback returned, skip the revision loop and proceed to execution
196+
```
197+
-->
198+
179199
6. **(If --full) Spawn plan-checker (with diagnostic capture), handle revision loop (max 2 iterations):**
180200

181201
**Pre-spawn diagnostic hook (plan-checker):**
@@ -191,7 +211,6 @@ const id = dh.beforeAgentSpawn({
191211
process.stdout.write(id);
192212
" 2>/dev/null || echo "")
193213
```
194-
195214
```
196215
Task(
197216
prompt="
@@ -248,6 +267,13 @@ dh.afterAgentSpawn({
248267
If issues found and iteration < 2: spawn planner revision, then re-check.
249268
If iteration >= 2: offer force proceed or abort.
250269

270+
7. **Spawn executor (task agent):**
271+
272+
<!-- mgw:criticality=critical spawn_point=executor -->
273+
<!-- Critical: execution produces the code changes. Without it, there is
274+
nothing to commit or PR. On failure: retry via existing retry loop,
275+
then dead-letter. -->
276+
251277
7. **Spawn executor (task agent, with diagnostic capture):**
252278

253279
**Pre-spawn diagnostic hook (executor):**
@@ -263,7 +289,6 @@ const id = dh.beforeAgentSpawn({
263289
process.stdout.write(id);
264290
" 2>/dev/null || echo "")
265291
```
266-
267292
```
268293
Task(
269294
prompt="
@@ -321,6 +346,25 @@ VERIFY_RESULT=$(node ~/.claude/get-shit-done/bin/gsd-tools.cjs verify-summary "$
321346
```
322347
Parse JSON result. Use `passed` field for go/no-go. Checks summary existence, files created, and commits.
323348

349+
9. **(If --full) Spawn verifier:**
350+
351+
<!-- mgw:criticality=advisory spawn_point=verifier -->
352+
<!-- Advisory: verification is quality assurance after execution is complete.
353+
The code changes and commits already exist. If verification fails,
354+
log a warning and proceed to PR creation with a note that verification
355+
was skipped.
356+
357+
Graceful degradation pattern:
358+
```
359+
VERIFY_RESULT=$(wrapAdvisoryAgent(Task(...), 'verifier', {
360+
issueNumber: ISSUE_NUMBER,
361+
fallback: null
362+
}))
363+
# If fallback returned (null), create a minimal VERIFICATION.md:
364+
# "## VERIFICATION SKIPPED\nVerifier agent unavailable. Manual review recommended."
365+
```
366+
-->
367+
324368
9. **(If --full) Spawn verifier (with diagnostic capture):**
325369

326370
**Pre-spawn diagnostic hook (verifier):**
@@ -336,7 +380,6 @@ const id = dh.beforeAgentSpawn({
336380
process.stdout.write(id);
337381
" 2>/dev/null || echo "")
338382
```
339-
340383
```
341384
Task(
342385
prompt="
@@ -404,8 +447,13 @@ If any step above fails (executor or verifier agent returns error, summary missi
404447

405448
```bash
406449
# On failure — classify and decide whether to retry
450+
# CRITICALITY-AWARE: check if the failing agent is advisory first.
451+
# Advisory agent failures are logged and skipped (pipeline continues).
452+
# Critical agent failures follow the existing retry/dead-letter flow.
453+
407454
FAILURE_CLASS=$(node -e "
408455
const { classifyFailure, canRetry, incrementRetry, getBackoffMs } = require('./lib/retry.cjs');
456+
const { isAdvisory } = require('./lib/agent-criticality.cjs');
409457
const { loadActiveIssue } = require('./lib/state.cjs');
410458
const fs = require('fs'), path = require('path');
411459
@@ -415,29 +463,43 @@ const file = files.find(f => f.startsWith('${ISSUE_NUMBER}-') && f.endsWith('.js
415463
const filePath = path.join(activeDir, file);
416464
let issueState = JSON.parse(fs.readFileSync(filePath, 'utf-8'));
417465
418-
// Classify the failure from the error context
419-
const error = { message: '${EXECUTION_ERROR_MESSAGE}' };
420-
const result = classifyFailure(error);
421-
console.error('Failure classified as: ' + result.class + ' — ' + result.reason);
422-
423-
// Persist failure class to state
424-
issueState.last_failure_class = result.class;
425-
426-
if (result.class === 'transient' && canRetry(issueState)) {
427-
const backoff = getBackoffMs(issueState.retry_count || 0);
428-
issueState = incrementRetry(issueState);
429-
fs.writeFileSync(filePath, JSON.stringify(issueState, null, 2));
430-
// Output: backoff ms so shell can sleep
431-
console.log('retry:' + backoff + ':' + result.class);
466+
// Check if the failing agent is advisory
467+
const failingSpawnPoint = '${FAILING_SPAWN_POINT}';
468+
if (failingSpawnPoint && isAdvisory(failingSpawnPoint)) {
469+
// Advisory agent failure — log warning and continue pipeline
470+
console.error('MGW WARNING: Advisory agent \"' + failingSpawnPoint + '\" failed. Continuing pipeline.');
471+
console.log('advisory_degraded:' + failingSpawnPoint);
432472
} else {
433-
// Permanent failure or retries exhausted — dead-letter
434-
issueState.dead_letter = true;
435-
fs.writeFileSync(filePath, JSON.stringify(issueState, null, 2));
436-
console.log('dead_letter:' + result.class);
473+
// Critical agent failure — classify and apply retry/dead-letter logic
474+
const error = { message: '${EXECUTION_ERROR_MESSAGE}' };
475+
const result = classifyFailure(error);
476+
console.error('Failure classified as: ' + result.class + ' — ' + result.reason);
477+
478+
// Persist failure class to state
479+
issueState.last_failure_class = result.class;
480+
481+
if (result.class === 'transient' && canRetry(issueState)) {
482+
const backoff = getBackoffMs(issueState.retry_count || 0);
483+
issueState = incrementRetry(issueState);
484+
fs.writeFileSync(filePath, JSON.stringify(issueState, null, 2));
485+
// Output: backoff ms so shell can sleep
486+
console.log('retry:' + backoff + ':' + result.class);
487+
} else {
488+
// Permanent failure or retries exhausted — dead-letter
489+
issueState.dead_letter = true;
490+
fs.writeFileSync(filePath, JSON.stringify(issueState, null, 2));
491+
console.log('dead_letter:' + result.class);
492+
}
437493
}
438494
")
439495

440496
case "$FAILURE_CLASS" in
497+
advisory_degraded:*)
498+
DEGRADED_AGENT=$(echo "$FAILURE_CLASS" | cut -d':' -f2)
499+
echo "MGW: Advisory agent '${DEGRADED_AGENT}' failed — gracefully degraded, continuing pipeline."
500+
# Do NOT set EXECUTION_SUCCEEDED=false — pipeline continues
501+
# Skip to next step (advisory output is optional)
502+
;;
441503
retry:*)
442504
BACKOFF_MS=$(echo "$FAILURE_CLASS" | cut -d':' -f2)
443505
BACKOFF_SEC=$(( (BACKOFF_MS + 999) / 1000 ))
@@ -642,7 +704,11 @@ fi
642704
" 2>/dev/null || echo "")
643705
```
644706

645-
**Pre-spawn diagnostic hook (milestone planner):**
707+
<!-- mgw:criticality=critical spawn_point=milestone-planner -->
708+
<!-- Critical: phase planning is required — cannot execute without a plan.
709+
On failure: retry via milestone retry loop, then dead-letter. -->
710+
711+
**Pre-spawn diagnostic hook (milestone planner):**
646712
```bash
647713
DIAG_MS_PLANNER=$(node -e "
648714
const dh = require('${REPO_ROOT}/lib/diagnostic-hooks.cjs');
@@ -655,7 +721,6 @@ fi
655721
process.stdout.write(id);
656722
" 2>/dev/null || echo "")
657723
```
658-
659724
```
660725
Task(
661726
prompt="
@@ -733,7 +798,12 @@ fi
733798
# Parse EXEC_INIT JSON for: executor_model, verifier_model, phase_dir, plans, incomplete_plans, plan_count
734799
```
735800

736-
**Pre-spawn diagnostic hook (milestone executor):**
801+
<!-- mgw:criticality=critical spawn_point=milestone-executor -->
802+
<!-- Critical: phase execution produces the code changes. Without it,
803+
there is nothing to commit or PR. On failure: retry via milestone
804+
retry loop, then dead-letter. -->
805+
806+
**Pre-spawn diagnostic hook (milestone executor):**
737807
```bash
738808
DIAG_MS_EXECUTOR=$(node -e "
739809
const dh = require('${REPO_ROOT}/lib/diagnostic-hooks.cjs');
@@ -746,7 +816,6 @@ fi
746816
process.stdout.write(id);
747817
" 2>/dev/null || echo "")
748818
```
749-
750819
```
751820
Task(
752821
prompt="
@@ -804,7 +873,26 @@ fi
804873
done
805874
```
806875

807-
**e. Spawn verifier agent (gsd:verify-phase, with diagnostic capture):**
876+
**e. Spawn verifier agent (gsd:verify-phase):**
877+
878+
<!-- mgw:criticality=advisory spawn_point=milestone-verifier -->
879+
<!-- Advisory: phase verification is quality assurance after execution
880+
completes. The code changes and commits already exist. If verification
881+
fails, log a warning and proceed with a note that verification was
882+
skipped for this phase.
883+
884+
Graceful degradation pattern:
885+
```
886+
VERIFY_RESULT=$(wrapAdvisoryAgent(Task(...), 'milestone-verifier', {
887+
issueNumber: ISSUE_NUMBER,
888+
fallback: null
889+
}))
890+
# If fallback returned, create minimal VERIFICATION.md:
891+
# "## VERIFICATION SKIPPED\nVerifier agent unavailable for phase ${PHASE_NUMBER}."
892+
```
893+
-->
894+
895+
**e. Spawn verifier agent (gsd:verify-phase, with diagnostic capture):**
808896

809897
**Pre-spawn diagnostic hook (milestone verifier):**
810898
```bash
@@ -819,7 +907,6 @@ fi
819907
process.stdout.write(id);
820908
" 2>/dev/null || echo "")
821909
```
822-
823910
```
824911
Task(
825912
prompt="

commands/run/pr-create.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,12 @@ ic.assembleIssueContext(${ISSUE_NUMBER})
105105

106106
Read issue state for context.
107107

108+
<!-- mgw:criticality=critical spawn_point=pr-creator -->
109+
<!-- Critical: PR creation is the pipeline's final output. Without it,
110+
the entire pipeline run produces no deliverable. On failure: the
111+
pipeline marks the issue as failed (no retry — PR creation errors
112+
are typically permanent: branch conflicts, permissions, etc.). -->
113+
108114
**Pre-spawn diagnostic hook (PR creator):**
109115
```bash
110116
DIAG_PR_CREATOR=$(node -e "

commands/run/triage.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,20 @@ NEW_COMMENTS=$(gh issue view $ISSUE_NUMBER --json comments \
361361

362362
2. **Spawn classification agent (with diagnostic capture):**
363363

364+
<!-- mgw:criticality=advisory spawn_point=comment-classifier -->
365+
<!-- Advisory: comment classification failure does not block the pipeline.
366+
If this agent fails, log a warning and treat all new comments as
367+
informational (safe default — pipeline continues with stale data).
368+
369+
Graceful degradation pattern:
370+
```
371+
CLASSIFICATION_RESULT=$(wrapAdvisoryAgent(Task(...), 'comment-classifier', {
372+
issueNumber: ISSUE_NUMBER,
373+
fallback: '{"classification":"informational","reasoning":"comment classifier unavailable","new_requirements":[],"blocking_reason":""}'
374+
}))
375+
```
376+
-->
377+
364378
**Pre-spawn diagnostic hook:**
365379
```bash
366380
CLASSIFIER_PROMPT="<full classifier prompt assembled above>"

0 commit comments

Comments
 (0)