Skip to content

feat(retry): configurable retry policy engine for agent failures#241

Merged
snipcodeit merged 1 commit intomainfrom
issue/232-implement-configurable-retry-policy-en
Mar 6, 2026
Merged

feat(retry): configurable retry policy engine for agent failures#241
snipcodeit merged 1 commit intomainfrom
issue/232-implement-configurable-retry-policy-en

Conversation

@snipcodeit
Copy link
Owner

Summary

  • Implement lib/retry-policy.cjs — a configurable retry policy engine for GSD agent failures with per-failure-type retry limits, exponential backoff with jitter, and per-agent-type overrides
  • Add lib/agent-errors.cjs — the agent failure taxonomy (dependency from Define agent failure taxonomy with structured error codes #229/PR feat(lib): define agent failure taxonomy with structured error codes #238) providing structured failure classification for timeout, malformed-output, partial-completion, hallucination, and permission-denied errors
  • The engine wraps Task() calls via executeWithPolicy() and automatically retries on transient failures, returning the original error when retries are exhausted

Closes #232

Milestone Context

  • Milestone: v8 — Agent Reliability & Failure Recovery
  • Phase: 45 — Retry Policies & Fallback Strategies
  • Issue: 4 of 9 in milestone

Changes

New: lib/retry-policy.cjs

  • RetryPolicyEngine class with configurable policies
  • DEFAULT_RETRY_POLICIES — timeout: 2, malformed-output: 1, partial-completion: 1, hallucination: 0, permission-denied: 0
  • DEFAULT_BACKOFF_CONFIG — 5s base, 300s max, 2x multiplier, full jitter
  • executeWithPolicy(fn, opts) — async wrapper with retry, backoff, abort support
  • loadConfig() — reads per-agent-type overrides from .mgw/config.json
  • RetryPolicyError error class extending MgwError

New: lib/agent-errors.cjs

Integration

  • Uses classifyAgentFailure() from agent-errors.cjs for agent-specific classification
  • Falls back to classifyFailure() from retry.cjs for generic errors
  • Does NOT modify lib/retry.cjs — coexists as a higher-level layer

Test Plan

  • Module loads cleanly: node -e "require('./lib/retry-policy.cjs')"
  • Default policies match spec: timeout=2, malformed-output=1, hallucination=0
  • shouldRetry() respects limits: timeout at attempt 2 returns false
  • Backoff values follow exponential formula: 5000, 10000, 20000 (no jitter)
  • Per-agent-type overrides correctly shadow global defaults
  • executeWithPolicy() retries transient failures and stops on permanent ones
  • executeWithPolicy() succeeds after retry on transient failure
  • onRetry callback fires with correct attempt, failureType, backoffMs
  • Non-retryable errors (hallucination) throw immediately without retry
  • Original error is thrown when retries exhausted
  • lib/retry.cjs exports and constants unchanged (19/19 verification checks pass)

Implement RetryPolicyEngine in lib/retry-policy.cjs that provides
per-failure-type retry limits (timeout: 2, malformed-output: 1,
hallucination: 0), exponential backoff with jitter, and per-agent-type
override configuration from .mgw/config.json.

The engine wraps Task() calls via executeWithPolicy() and classifies
failures using agent-errors.cjs taxonomy with retry.cjs fallback.
Returns the original error when all retries are exhausted.

Closes #232

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the core Changes to core library label Mar 6, 2026
@snipcodeit
Copy link
Owner Author

Testing Procedures

Quick Validation

node -e "
const { RetryPolicyEngine } = require('./lib/retry-policy.cjs');
const e = new RetryPolicyEngine();
console.log('timeout retries:', e.getMaxRetries('timeout'));       // 2
console.log('hallucination retries:', e.getMaxRetries('hallucination')); // 0
console.log('should retry timeout@0:', e.shouldRetry('timeout', null, 0)); // true
console.log('should retry timeout@2:', e.shouldRetry('timeout', null, 2)); // false
console.log('backoff@0:', e.getBackoffMs(0), 'ms');
"

Async Retry Test

node -e "
const { RetryPolicyEngine } = require('./lib/retry-policy.cjs');
const e = new RetryPolicyEngine({ backoff: { baseMs: 10, maxMs: 100, jitter: false } });
let attempt = 0;
e.executeWithPolicy(async () => {
  attempt++;
  if (attempt < 3) throw new Error('agent timed out');
  return 'recovered';
}).then(r => console.log('Result:', r, 'after', attempt, 'attempts'));
"

Agent Override Test

node -e "
const { RetryPolicyEngine } = require('./lib/retry-policy.cjs');
const e = new RetryPolicyEngine({ agentOverrides: { 'gsd-planner': { timeout: 5 } } });
console.log('planner timeout:', e.getMaxRetries('timeout', 'gsd-planner'));  // 5
console.log('executor timeout:', e.getMaxRetries('timeout', 'gsd-executor')); // 2
"

Non-modification Verification

node -e "
const r = require('./lib/retry.cjs');
console.log('retry.cjs MAX_RETRIES:', r.MAX_RETRIES);  // 3
console.log('classifyFailure:', typeof r.classifyFailure);  // function
console.log('withRetry:', typeof r.withRetry);  // function
"

@snipcodeit snipcodeit merged commit be9019f into main Mar 6, 2026
4 of 5 checks passed
@snipcodeit snipcodeit deleted the issue/232-implement-configurable-retry-policy-en branch March 6, 2026 07:39
snipcodeit pushed a commit that referenced this pull request Mar 6, 2026
…ent-ex

Resolve conflicts in lib/retry-policy.cjs: keep model fallback additions
from PR #242 on top of base retry engine from PR #241.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Changes to core library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement configurable retry policy engine for agent failures

1 participant