Background
On Jan 9-10, 2026, the BGE Server hit OpenAI API rate limits (429 errors) due to an event flood. While retry logic was added (18b197c), a circuit breaker would provide better protection and faster failure.
Problem
Currently, when OpenAI returns 429 errors:
- Each request retries up to 5 times with exponential backoff
- During a flood, hundreds of requests queue up, all retrying
- This creates a "thundering herd" when the rate limit clears
Proposed Solution
Implement circuit breaker pattern for OpenAI API calls:
States:
┌────────┐ failures > threshold ┌────────┐
│ CLOSED │ ──────────────────────────▶ │ OPEN │
└────────┘ └────────┘
▲ │
│ success │ timeout
│ ┌─────────────┐ │
└────│ HALF-OPEN │◀───────────────────┘
└─────────────┘
CLOSED: Normal operation, requests go through
OPEN: All requests fail immediately (no API call), return cached/error
HALF-OPEN: Allow one test request, if success → CLOSED, if fail → OPEN
Configuration
CIRCUIT_BREAKER_FAILURE_THRESHOLD = 5 # failures before opening
CIRCUIT_BREAKER_SUCCESS_THRESHOLD = 2 # successes to close
CIRCUIT_BREAKER_TIMEOUT = 60 # seconds before half-open
Benefits
- Fast failure - Don't waste time on doomed requests
- Reduced load - Stop hammering rate-limited API
- Graceful degradation - Return cached embeddings or skip
Implementation Options
- pybreaker - Python circuit breaker library
- Custom implementation - Simple state machine
- tenacity - Already handles retries, can add circuit breaker
Files to Modify
/opt/projects/koi-processor/src/core/bge_server.py
- Possibly event bridge if it makes direct API calls
Related
Labels
enhancement, resilience
Background
On Jan 9-10, 2026, the BGE Server hit OpenAI API rate limits (429 errors) due to an event flood. While retry logic was added (18b197c), a circuit breaker would provide better protection and faster failure.
Problem
Currently, when OpenAI returns 429 errors:
Proposed Solution
Implement circuit breaker pattern for OpenAI API calls:
CLOSED: Normal operation, requests go through
OPEN: All requests fail immediately (no API call), return cached/error
HALF-OPEN: Allow one test request, if success → CLOSED, if fail → OPEN
Configuration
Benefits
Implementation Options
Files to Modify
/opt/projects/koi-processor/src/core/bge_server.pyRelated
Labels
enhancement, resilience