This is a Python-based proxy server designed to forward requests to the Cerebras API while managing multiple API keys through a round-robin rotation mechanism. It handles rate limiting (429) and server errors (500) by automatically rotating keys and retrying requests.
- Smart API Key Management:
- Sticks with one key until it hits rate limits (429 errors)
- Automatically switches to the next available key only when needed
- Tracks cooldown periods for each key
- Waits and retries if all keys are rate-limited instead of failing immediately
- Request Forwarding: Forwards all incoming HTTP requests to the Cerebras API endpoint (
https://api.cerebras.ai/v1/). - Dynamic Authorization: Injects the
Authorization: Bearer <api_key>header dynamically for each request using the current key. - Intelligent Error Handling:
- Automatically rotates keys on
429 Too Many Requestsor500 Internal Server Errorresponses - Marks keys as temporarily unavailable and tracks when they can be retried
- Waits for the next available key instead of immediately failing
- Automatically rotates keys on
- Concurrency Support: Built with
aiohttpto efficiently handle multiple concurrent requests with thread-safe key rotation. - Status Monitoring: Built-in
/_statusendpoint to monitor API key health and rotation state. - Request/Response Logging: Optional filesystem logging to save all requests and responses as JSON files for auditing, debugging, or analysis (enabled by default).
- Automatic Tool Call Validation: Detects and fixes missing tool responses in chat completion requests by automatically injecting fake "failed" responses to maintain valid conversation flow.
- Python 3.7 or higher
aiohttplibraryBrotlilibrary (required for handling Brotli-compressed responses from Cerebras API)
Install the required libraries:
pip install aiohttp BrotliOr install from requirements.txt:
pip install -r requirements.txtThe proxy server uses environment variables for configuration:
CEREBRAS_API_KEYS: JSON string containing your Cerebras API keys.
Example JSON format:
{
"key1": "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"key2": "sk-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
"key3": "sk-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
}CEREBRAS_COOLDOWN: Number of seconds to wait before retrying a rate-limited key (default: 60)
Example:
CEREBRAS_COOLDOWN=90LOG_REQUESTS: Enable or disable request/response logging (default: true)
Example:
LOG_REQUESTS=falseLOG_DIR: Directory to save request/response logs (default: ./logs)
Example:
LOG_DIR=/var/log/cerebras-proxy-
Set the
CEREBRAS_API_KEYSenvironment variable with your JSON configuration:export CEREBRAS_API_KEYS='{"key1":"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","key2":"sk-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy","key3":"sk-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"}'
-
Run the proxy server:
python proxy_server.py
By default, the server will start on
127.0.0.1:8080. You can modify the host and port by adjusting the parameters in therun()method call withinproxy_server.py.
This project includes Docker configuration for easy deployment.
-
Build the Docker image:
docker build -t cerebras-proxy . -
Run the container with your API keys:
docker run -p 18080:8080 -e CEREBRAS_API_KEYS='{"key1":"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","key2":"sk-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"}' cerebras-proxy
-
Create a
.envfile in the project root with your API keys:CEREBRAS_API_KEYS={"key1":"sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","key2":"sk-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"} -
Build and run with Docker Compose:
docker-compose up --build
The proxy will be available at http://localhost:18080.
Note: Logs are automatically mounted to ./logs on the host machine for easy access. You can customize logging behavior in the .env file:
LOG_REQUESTS=true
LOG_DIR=/app/logs-
Make requests to the proxy server as you would to the Cerebras API, but use the proxy's address instead:
For example, if the Cerebras API endpoint is:
POST https://api.cerebras.ai/v1/chat/completionsYou would send your request to:
POST http://127.0.0.1:8080/chat/completionsThe proxy will handle adding the appropriate
Authorizationheader.
Unlike traditional round-robin proxies that switch keys on every request, this proxy uses intelligent key management:
- Sticky Keys: The proxy sticks with one key and uses it for all requests until it encounters a problem
- Smart Switching: Only rotates to the next key when receiving a 429 (rate limit) or 500 (server error)
- Cooldown Tracking: Remembers when each key was rate-limited and won't try it again until the cooldown expires
- Automatic Waiting: If all keys are rate-limited, the proxy waits for the soonest available key instead of failing
- State Persistence: The current key position is maintained across requests, so you don't always start with key1
Example Flow:
Request 1-100: Uses key1 (all successful)
Request 101: key1 hits rate limit → switches to key2
Request 102-200: Uses key2 (all successful)
Request 201: key2 hits rate limit → switches to key3
Request 202: key3 hits rate limit → waits for key1 cooldown to expire
Request 203-300: Uses key1 again (cooldown expired)
The proxy provides a built-in status endpoint to monitor API key health and rotation state.
Access the status endpoint at:
GET http://localhost:18080/_status
Example response:
{
"keys": [
{
"name": "key1",
"available": true,
"rate_limited_for": 0,
"error_count": 0
},
{
"name": "key2",
"available": false,
"rate_limited_for": 45.2,
"error_count": 3
}
],
"current_key": "key1"
}Fields:
available: Whether the key is currently available for userate_limited_for: Seconds remaining until the key can be retried (0 if available)error_count: Number of consecutive errors for this keycurrent_key: Name of the key currently being used
The proxy includes optional filesystem logging to save all requests and responses for auditing, debugging, or analysis purposes.
By default, logging is enabled. To disable it:
export LOG_REQUESTS=falseLogs are saved to ./logs by default. You can specify a custom log directory:
export LOG_DIR=/var/log/cerebras-proxyEach request/response pair is saved as a separate JSON file with the following naming convention:
YYYYMMDD_HHMMSS_microseconds_METHOD_path_requestid.json
Logs are organized in date-based subdirectories:
logs/
├── 2025-11-06/
│ ├── 20251106_143022_123456_POST_chat_completions_abc123de.json
│ ├── 20251106_143023_234567_POST_chat_completions_xyz789ab.json
│ └── ...
└── 2025-11-07/
└── ...
Each log file contains:
{
"timestamp": "2025-11-06T14:30:22.123456",
"request_id": "abc123de",
"request": {
"method": "POST",
"path": "chat/completions",
"headers": {
"Content-Type": "application/json",
"Authorization": "[REDACTED]"
},
"body": {
"model": "llama3.1-70b",
"messages": [...]
}
},
"response": {
"status": 200,
"headers": {
"Content-Type": "application/json"
},
"body": {
"id": "chat-...",
"choices": [...]
}
},
"duration_ms": 1234.56
}- Authorization headers are automatically redacted in logs to prevent API key leakage
- Binary data is base64-encoded if the body is not valid JSON or UTF-8
- Logs are stored locally and never transmitted elsewhere
- Consider disk space when enabling logging for high-traffic deployments
The proxy automatically detects and fixes invalid tool call sequences in chat completion requests. This prevents API errors when clients fail to provide responses for tool calls.
When processing /chat/completions requests, the proxy:
- Scans the message array for assistant messages containing
tool_calls - Tracks pending tool calls that are waiting for responses
- Detects missing responses when:
- A tool call is followed by a non-tool message (like a user message)
- A tool call appears at the end of the messages array with no response
- Automatically injects fake responses with
content: "failed"for each missing tool call - Updates Content-Length header to match the modified request body
Before (Invalid - Would cause 422 error):
{
"messages": [
{
"role": "assistant",
"tool_calls": [{"id": "call_123", "function": {...}}]
},
{
"role": "user",
"content": "test"
}
]
}After (Valid - Automatically fixed by proxy):
{
"messages": [
{
"role": "assistant",
"tool_calls": [{"id": "call_123", "function": {...}}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "failed"
},
{
"role": "user",
"content": "test"
}
]
}When the fix is applied, you'll see log messages:
WARNING:__main__:Found 1 tool_calls without responses. Injecting fake 'failed' responses.
INFO:__main__:Injected fake tool response for tool_call_id: call_123
INFO:__main__:Applied tool_call fix: 2 -> 3 messages (size: 1234 -> 1456 bytes)
This feature is always enabled for all chat completion requests and requires no configuration.
This proxy server is built using the aiohttp framework for high performance and concurrency. It consists of two main components:
-
ApiKeyManager(api_key_manager.py): Manages the pool of API keys with intelligent rotation:- Tracks each key's state (available vs. rate-limited)
- Maintains the current active key instead of rotating on every request
- Only switches keys when a rate limit (429) or server error (500) is encountered
- Implements automatic wait/retry when all keys are temporarily unavailable
- Uses
asyncio.Lockfor thread-safe operation across concurrent requests
-
ProxyServer(proxy_server.py): The main application that listens for HTTP requests:- Catch-all route forwards requests to the Cerebras API
- Integrates with
ApiKeyManagerto get the current key and handle rotation - Automatically retries with the next available key on failures
- Provides a
/_statusendpoint for monitoring key health
The proxy implements smart error handling with automatic recovery:
-
Rate Limiting (429): When a key hits rate limits:
- The key is marked as unavailable with a cooldown period (default 60 seconds)
- The proxy automatically switches to the next available key
- The request is retried immediately with the new key
- After the cooldown period, the key becomes available again
-
Server Errors (500): Treated similarly to rate limits:
- Key is marked as temporarily unavailable
- Automatic rotation to the next available key
- Request retry with the new key
-
All Keys Rate-Limited: If all keys are temporarily unavailable:
- The proxy calculates which key will become available soonest
- Waits for that cooldown period
- Automatically retries the request when a key becomes available
- No immediate
503error - the proxy handles waiting for you
-
Other Errors: Non-retryable errors (4xx client errors, network issues) are returned to the client immediately
-
Maximum Retries: After (number of keys × 2) attempts, returns
503 Service Unavailable
This project is licensed under the MIT License - see the LICENSE file for details.