-
Notifications
You must be signed in to change notification settings - Fork 91
fix: add exponential backoff to stream retries for terminated errors #1227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…1187) Addresses issue where 'terminated' errors from OpenAI Responses Provider (especially in Codex mode) were breaking the agent loop instead of being properly retried with exponential backoff. Changes: - Increase maxStreamingAttempts from 2 to 5 for Codex mode, 4 for regular - Add exponential backoff with jitter between stream retries (5s initial, 30s max, ±30% jitter) matching fetch-level retry behavior - Import delay utility for backoff timing - Enhance debug logging to show retry delay information Tests added: - Verify exponential backoff delays between stream retries - Verify Codex mode uses 5 attempts (vs 4 for regular mode) - Verify regular mode fails after exhausting 4 attempts
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughOpenAIResponsesProvider now implements exponential backoff with jitter for stream retries, introducing per-stream delay state and different maximum retry attempts (5 for Codex, 4 for other modes). Corresponding tests validate the retry timing and behavior across both standard and Codex modes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings📚 Learning: 2025-11-16T22:51:26.374ZApplied to files:
🧬 Code graph analysis (1)packages/core/src/providers/openai-responses/OpenAIResponsesProvider.ts (1)
⏰ Context from checks skipped due to timeout of 270000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
🔇 Additional comments (6)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
LLxprt PR Review – PR #1227Title: fix: add exponential backoff to stream retries for terminated errors Issue AlignmentEvidence: The PR directly addresses Issue #1187. The original code had
Side Effects
Code Quality
Tests and CoverageCoverage Impact: Increase Three new tests added:
All tests mock VerdictReady The PR implements the fix as described in Issue #1187 with proper exponential backoff, increased retry attempts for flaky Codex API, and comprehensive test coverage. No blocking issues identified. |
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-24.x-ubuntu-latest' artifact from the main CI run. |
Issue: #1228 Problem: AnthropicProvider had retryWithBackoff around the initial API call but no retry logic for mid-stream errors during stream consumption. Transient network errors (like 'terminated') that occurred during streaming would break the agent loop instead of being retried. Solution: - Added stream retry constants (matching OpenAIResponsesProvider pattern): - STREAM_RETRY_INITIAL_DELAY_MS = 5000ms (5 seconds) - STREAM_RETRY_MAX_DELAY_MS = 30000ms (30 seconds) - STREAM_RETRY_MAX_ATTEMPTS = 4 (consistent retry count for Anthropic) - Imported delay utility for exponential backoff - Wrapped entire stream consumption in retry loop (lines 1948-2260): - Tracks streamingAttempt counter and currentDelay for backoff - On transient stream error: checks if error is retryable via isNetworkTransientError - If retryable and attempts remain: logs, delays with exponential backoff + jitter, makes fresh API call to get new stream, then restarts consumption - If not retryable or attempts exhausted: throws error - Key implementation details: - When stream fails mid-consumption, makes a new API call (apiCallWithResponse) to get a fresh stream from Anthropic SDK - Uses exponential backoff with 30% jitter to avoid thundering herd - Matches OpenAIResponsesProvider pattern (PR #1227) for consistency Testing: - Created AnthropicProvider.streamRetry.test.ts with comprehensive test coverage: - Test: retries when streaming terminates after successful API call - Test: waits with exponential backoff between retries (validates backoff increases) - Test: fails after max attempts with error thrown (validates 4 attempts) - Test: throws non-retryable errors immediately (no retry for bad request) All tests pass, including full verification cycle: - npm run test (all 5555 tests pass) - npm run lint (clean) - npm run typecheck (clean) - npm run format (clean) - npm run build (clean) - node scripts/start.js --profile-load synthetic "write me a haiku" (works) Fixes #1228
TLDR
Fixes the issue where `terminated` errors from OpenAI Responses Provider (especially in Codex mode) were breaking the agent loop instead of being properly retried with exponential backoff. This change treats stream-level terminated errors like 429 rate limit errors by adding proper exponential backoff with jitter between retry attempts.
Dive Deeper
Problem
The OpenAI Responses Provider had a stream-level retry mechanism that was insufficient:
While the `terminated` error was correctly classified as retryable via `isNetworkTransientError()`, the lack of backoff delays meant retries happened immediately, often failing again due to the transient nature of network issues.
Solution
Increased retry attempts:
Added exponential backoff with jitter between stream retries:
Enhanced debug logging to show retry attempt number, delay time, and error message
Implementation Details
The backoff formula matches the existing `retryWithBackoff` utility:
Reviewer Test Plan
Testing Matrix
Linked issues / bugs
closes #1187