Add OpenAI-compatible audio transcription endpoint by madwiki · Pull Request #2224 · router-for-me/CLIProxyAPI

madwiki · 2026-03-19T00:58:02Z

Summary

This PR adds support for OpenAI-compatible audio transcription requests via POST /v1/audio/transcriptions.

The new endpoint is implemented as a provider-native multipart proxy path, reusing the existing auth selection flow instead of introducing any client-specific behavior or additional credential requirements.

What Changed

Added POST /v1/audio/transcriptions
Accepted multipart form fields:
- model
- file
- prompt (optional)
- language (optional)
Normalized responses to JSON with at least a text field
Added a raw HTTP execution path in the auth manager for provider-native endpoints that do not fit the existing JSON translation pipeline
Preserved existing Codex/OpenAI auth selection behavior for audio requests
Updated Codex HTTP request preparation so multipart requests keep their original content type while still applying provider-specific auth headers
Exposed POST /v1/audio/transcriptions in the root endpoint list
Registered gpt-4o-mini-transcribe in Codex model sets
Added tests covering route exposure, multipart validation, response normalization, and Codex request preparation

Why

CLIProxyAPI already supports OpenAI-compatible chat and responses endpoints, but it did not expose the speech-to-text transcription endpoint.

Because audio transcription uses multipart/form-data, it does not fit cleanly into the existing JSON translator flow. This change adds the missing capability with minimal impact on the current architecture by routing it through a dedicated raw HTTP execution path.

Testing

go test ./...

gemini-code-assist · 2026-03-19T00:58:24Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the CLI Proxy API by integrating an OpenAI-compatible audio transcription endpoint. This new capability allows users to perform speech-to-text conversions, addressing a previously missing feature. The implementation leverages a provider-native multipart proxy approach, ensuring compatibility with existing authentication mechanisms and minimizing architectural impact by routing through a specialized raw HTTP execution path for non-JSON requests.

Highlights

New Audio Transcription Endpoint: Added a new OpenAI-compatible endpoint at POST /v1/audio/transcriptions to support speech-to-text functionality.
Multipart Request Handling: Implemented robust handling for multipart form data, including file uploads and optional fields like model, prompt, and language, while preserving original content types during proxying.
Raw HTTP Execution Path: Introduced a dedicated raw HTTP execution path within the authentication manager to accommodate provider-native endpoints that do not fit the standard JSON translation pipeline.
Response Normalization: Ensured that all transcription responses are normalized to JSON format, always including at least a text field for consistency.
New Model Registration: Registered the gpt-4o-mini-transcribe model in the Codex model sets to enable its use with the new transcription service.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully adds a new OpenAI-compatible audio transcription endpoint. The changes are well-structured, including a new raw HTTP execution path in the auth manager to handle multipart form data, which is a thoughtful architectural addition. The implementation also correctly normalizes responses to ensure a consistent API for clients. My review includes two main points: a high-severity suggestion to improve memory efficiency by streaming audio files instead of buffering them entirely in memory, and a medium-severity suggestion to simplify some redundant header-setting logic for better code clarity and maintainability.

gemini-code-assist · 2026-03-19T01:00:46Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	fileData, err := io.ReadAll(file)
+	if err != nil {
+		return nil, &audioRequestError{status: http.StatusBadRequest, msg: fmt.Sprintf("failed to read uploaded file: %v", err)}
+	}


The current implementation reads the entire uploaded audio file into memory using io.ReadAll. This can lead to high memory consumption and potential out-of-memory errors for large files, creating a possible denial-of-service vulnerability. While the OpenAI API has a 25MB limit, this proxy might be used with backends that support larger files.

To improve efficiency and robustness, I recommend streaming the file content directly from the incoming request to the outgoing request to the provider, without buffering the entire file in memory. This could be achieved by modifying BuildHTTPRequest to use an io.Pipe to construct the multipart request body on-the-fly, which would avoid loading the whole file into r.FileData.

internal/runtime/executor/codex_executor.go

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a7d266b64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sdk/cliproxy/auth/conductor.go

sdk/api/handlers/openai/openai_audio_handlers.go

xkonjin

Code Review

Overall: Well-structured addition of an OpenAI-compatible audio transcription endpoint. Good test coverage. A few items worth attention:

Security and Input Validation

Memory usage: audioTranscriptionFormMemory is 32MB, and io.ReadAll(file) after ParseMultipartForm means file data is buffered twice (~50MB heap per request for a 25MB file). Consider reading directly from the multipart part, or document the memory ceiling.
File validation is extension + MIME only: validateAudioFile checks file extension and Content-Type header but not magic bytes. Mitigated by upstream API rejection, but worth noting.
RemoveAll in defer: Good. Multipart temp files get cleaned up on all paths.

Potential Issues

Response normalization edge case: normalizeAudioTranscriptionResponse tries json.Unmarshal into a string as a fallback. If upstream returns a bare non-JSON string with a leading ", Unmarshal succeeds and strips quotes. Probably fine in practice but the behavior could surprise.
Hardcoded Codex URL: The fallback https://chatgpt.com/backend-api/transcribe for codex auth is hardcoded. If this endpoint changes it requires a code change. Consider making it configurable.

Test Quality

Tests are solid, covering happy path, validation, endpoint registration, and header preservation. Missing tests for:

Codex provider URL resolution path
normalizeAudioTranscriptionResponse edge cases (empty body, existing text field, bare JSON string)
Retry/failover behavior via ExecuteHTTPRequest

Architecture

The ExecuteHTTPRequestWithAuthManager + HTTPRequestBuilder approach to bypass the JSON translator stack for binary multipart uploads is the right call. Clean separation from chat/completion flow. The header refactoring in codex_executor.go (extracting applyCodexPreparedHeaders) is a clean prerequisite change.

Good work overall.

xkonjin

Code Review

Good work adding the audio transcription endpoint. The overall structure is solid — proper multipart parsing, file validation, auth routing, and response normalization. A few observations:

Potential Issues

Unbounded file read into memory — io.ReadAll(file) in parseAudioTranscriptionRequest and io.ReadAll(upstreamResp.Body) in AudioTranscriptions have no size cap beyond the 32MB multipart form limit. If an upstream response is unexpectedly large (e.g. verbose JSON from a misconfigured provider), this could spike memory. Consider adding a LimitedReader on the upstream response body.
cliCancel called with body bytes — In the success path, cliCancel(normalizedBody) passes the entire response body as the cancel reason. If this is for logging/tracing, fine, but make sure the downstream context handler can handle []byte (not just error or string). If it expects a specific type, this could silently fail or panic depending on the assertion.
Model registration duplication — gpt-4o-mini-transcribe is copy-pasted into 4 tier blocks in models.json (codex-team, codex-plus, codex-pro, and the unnamed block). If tiers diverge or a new transcription model is added, this is easy to get out of sync. Not blocking, but worth noting for maintainability.

Nits

applyCodexHeaders now calls applyCodexPreparedHeaders and then unconditionally sets Content-Type to application/json. The PrepareRequest path intentionally skips setting Content-Type (so multipart is preserved). This split is correct but fragile — a comment on applyCodexHeaders noting it is only for JSON request paths would help future readers.
resolveAudioTranscriptionURL falls back to https://chatgpt.com/backend-api/transcribe for codex auth without a base_url. If that endpoint ever changes or requires different auth headers, this hardcoded URL could silently break. Consider making it configurable or at least adding a constant with a comment.

Good

File validation checks both extension and MIME type with a fallback to http.DetectContentType — solid defensive approach.
Tests cover both the happy path (plain-text wrapping) and rejection of unsupported formats.
The HTTPResponseError type with StatusCode() and Headers() is clean and will compose well with the existing retry machinery.

Overall: looks ready with the memory/size concern as the main thing to address before merge.

madwiki · 2026-03-19T04:57:42Z

Code Review

Good work adding the audio transcription endpoint. The overall structure is solid — proper multipart parsing, file validation, auth routing, and response normalization. A few observations:

Potential Issues

Unbounded file read into memory — io.ReadAll(file) in parseAudioTranscriptionRequest and io.ReadAll(upstreamResp.Body) in AudioTranscriptions have no size cap beyond the 32MB multipart form limit. If an upstream response is unexpectedly large (e.g. verbose JSON from a misconfigured provider), this could spike memory. Consider adding a LimitedReader on the upstream response body.

cliCancel called with body bytes — In the success path, cliCancel(normalizedBody) passes the entire response body as the cancel reason. If this is for logging/tracing, fine, but make sure the downstream context handler can handle []byte (not just error or string). If it expects a specific type, this could silently fail or panic depending on the assertion.

Model registration duplication — gpt-4o-mini-transcribe is copy-pasted into 4 tier blocks in models.json (codex-team, codex-plus, codex-pro, and the unnamed block). If tiers diverge or a new transcription model is added, this is easy to get out of sync. Not blocking, but worth noting for maintainability.

Nits

applyCodexHeaders now calls applyCodexPreparedHeaders and then unconditionally sets Content-Type to application/json. The PrepareRequest path intentionally skips setting Content-Type (so multipart is preserved). This split is correct but fragile — a comment on applyCodexHeaders noting it is only for JSON request paths would help future readers.

resolveAudioTranscriptionURL falls back to https://chatgpt.com/backend-api/transcribe for codex auth without a base_url. If that endpoint ever changes or requires different auth headers, this hardcoded URL could silently break. Consider making it configurable or at least adding a constant with a comment.

Good

File validation checks both extension and MIME type with a fallback to http.DetectContentType — solid defensive approach.

Tests cover both the happy path (plain-text wrapping) and rejection of unsupported formats.

The HTTPResponseError type with StatusCode() and Headers() is clean and will compose well with the existing retry machinery.

Overall: looks ready with the memory/size concern as the main thing to address before merge.

Thanks for the review. Addressed in 78b3a10.

This follow-up keeps the external API unchanged and tightens the implementation around memory usage, bounds, and maintainability.

Changes made:

Replaced the previous ParseMultipartForm + full in-memory buffering path with a streaming MultipartReader flow.
The uploaded audio is now staged into a temp file, and the outbound multipart request is rebuilt by reopening that staged file instead of buffering the whole file in memory.
Bounded upstream response reads and now return a clear 502-style error if the upstream transcription payload exceeds the configured limit.
Replaced the relevant numeric and URL literals with named constants/helpers, including the default Codex transcription endpoint.
Added focused tests for response normalization edge cases, default Codex URL resolution, oversized upstream responses, auth failover on retriable upstream HTTP failures, and staged temp-file cleanup.
Added a short comment clarifying that applyCodexHeaders is the JSON-request helper, while the prepared-request path intentionally preserves caller-provided content types such as multipart/form-data.

On the cliCancel point, I did not change that path because the existing cancel helper already handles []byte payloads explicitly, so this is not a new issue introduced by this endpoint.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78b3a104a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sdk/api/handlers/openai/openai_audio_handlers.go

sdk/cliproxy/auth/conductor.go

madwiki · 2026-03-19T05:24:57Z

Addressed in ea79e77 in response to the Codex review feedback.

This follow-up fixes the compatibility and scheduler issues called out there:

explicit non-JSON transcription formats are now preserved instead of always being wrapped as JSON
non-file multipart fields are now bounded
the raw HTTP path now applies resolved upstream models before request construction
auth-specific request-build failures now fall through to the next eligible model/auth when appropriate
Codex transcription URL derivation now follows the existing Codex base_url semantics

I intentionally kept this scoped to the issues that affected correctness, compatibility, or robustness.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea79e77196

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sdk/api/handlers/openai/openai_audio_handlers.go

sdk/cliproxy/auth/conductor.go

madwiki · 2026-03-19T08:17:14Z

Addressed in 8cdeddc in response to the latest Codex review feedback.

This follow-up fixes the two error-path issues called out there:

raw HTTP upstream error bodies are now bounded before constructing HTTPResponseError
upstream error headers are now filtered through the same header filter before being surfaced to clients when passthrough headers are enabled

I did not add transcription streaming support in this pass, since that would be a broader compatibility expansion beyond the current non-streaming transcription flow.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cdeddca17

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

sdk/api/handlers/handlers.go

sdk/api/handlers/openai/openai_audio_handlers.go

madwiki · 2026-03-19T15:29:18Z

Addressed in 35a54e4 in response to the latest Codex review feedback.

This follow-up fixes the remaining model=auto routing gap for audio transcription. Instead of resolving auto from a global transcription-looking model list, the audio path now resolves it against auths that can actually serve /v1/audio/transcriptions and reuses the auth-manager's existing execution-model candidate logic.

That means:

auto no longer picks models from unsupported providers just because they look transcription-capable in the registry
OpenAI-compatible alias / model-pool setups are now respected when choosing the audio route model
the selected route model still expands to the correct upstream transcription model during raw HTTP execution

I also added regression coverage for both cases: avoiding a newer non-audio chat model and resolving through an OpenAI-compatible alias setup.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35a54e4ad4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T15:35:26Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	for _, auth := range manager.List() {
+		if auth == nil || strings.TrimSpace(auth.ID) == "" {
+			continue
+		}
+		if _, err := resolveAudioTranscriptionURL(auth); err != nil {
+			continue


Exclude unavailable auths from audio auto resolution

This model=auto picker walks every auth returned by manager.List() and only filters on resolveAudioTranscriptionURL, so disabled or cooling-down credentials still contribute candidate models. The scheduler later rejects those auths in sdk/cliproxy/auth/selector.go, which means a newer transcription model on an unavailable credential can win here and then fail with auth_not_found even though an older active transcription model is available. Please restrict auto candidates to auths that are actually selectable.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T15:35:26Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	cliCtx, cliCancel := h.GetContextWithCancel(h, c, context.Background())
+	upstreamResp, _, errMsg := h.ExecuteHTTPRequestWithAuthManager(cliCtx, audioReq.Model, func(ctx context.Context, auth *coreauth.Auth, upstreamModel string) (*http.Request, error) {
+		return audioReq.BuildHTTPRequest(ctx, auth, upstreamModel)
+	})


Emit non-stream keepalives while transcriptions run

This new non-streaming handler goes straight from GetContextWithCancel to ExecuteHTTPRequestWithAuthManager without calling StartNonStreamingKeepAlive, unlike the existing non-streaming OpenAI/Responses handlers. In deployments that set non_stream_keep_alive_interval to keep reverse proxies alive, long transcription jobs will now sit silent until completion and can be timed out by nginx/Cloudflare even though the other long-running endpoints keep the connection open.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T15:35:26Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	limitedReader := &io.LimitedReader{R: body, N: audioTranscriptionUpstreamResponseLimitBytes + 1}
+	payload, err := io.ReadAll(limitedReader)
+	if err != nil {
+		return nil, fmt.Errorf("failed to read upstream transcription response: %w", err)
+	}
+	if int64(len(payload)) > audioTranscriptionUpstreamResponseLimitBytes {
+		return nil, &audioRequestError{


Remove the fixed 8 MiB cap on successful transcription bodies

Successful transcription responses are always buffered and rejected once they exceed 8 MiB. That is much smaller than the 32 MiB upload limit this handler accepts, and it breaks legitimate large outputs such as long text/subtitle responses or verbose_json with word-level timestamps, which can easily cross 8 MiB for larger audio files. In those cases the upstream request succeeds but the proxy turns it into a 502.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a40ff0c2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T16:13:22Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	upstreamModels := manager.ExecutionModelCandidates(auth, routeModel)
+	if len(upstreamModels) == 0 {


Avoid advancing alias-pool offsets while resolving model=auto

When an OpenAI-compatible alias maps to multiple upstream transcription models, resolveAudioRouteModelCreatedAt now calls manager.ExecutionModelCandidates(...) only to inspect candidates. That method delegates to prepareExecutionModels in sdk/cliproxy/auth/conductor.go, which increments modelPoolOffsets via nextModelPoolOffset. The same request then enters ExecuteHTTPRequest and rotates the pool a second time, so model=auto requests skip the model/fallback order that should have been attempted first and skew round-robin balancing for every alias-backed transcription pool.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T16:13:22Z

sdk/api/handlers/openai/openai_audio_handlers.go

+	if value == "" {
+		return false
+	}
+	return strings.Contains(value, "transcribe") || strings.Contains(value, "transcription") || strings.Contains(value, "speech-to-text")


Treat whisper-1 as transcription-capable during auto resolution

The new model=auto detection only recognizes model names containing transcribe, transcription, or speech-to-text. If an OpenAI-compatible backend advertises whisper-1 (or another speech model without those substrings) through registry.GetModelsForClient, resolveAudioUpstreamModelCreatedAt rejects it and /v1/audio/transcriptions returns “no transcription-capable model is available” even though a manual model=whisper-1 request can still be proxied. That makes auto-selection fail on common Whisper-style deployments.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T16:13:23Z

sdk/api/handlers/openai/openai_audio_handlers.go

+		if field.Name == audioTranscriptionModelFieldName {
+			fieldValue = modelValue


Strip thinking suffixes from transcription model names

This handler writes upstreamModel into the multipart model field verbatim. Unlike the normal chat executors, which strip internal thinking suffixes with thinking.ParseSuffix(...).ModelName before calling upstream APIs (for example in internal/runtime/executor/codex_executor.go and openai_compat_executor.go), a request such as model=gpt-4o-mini-transcribe(high) or model=auto(high) will now be forwarded as ...(...). Speech-to-text backends do not understand those suffixes, so otherwise valid transcription requests will fail with upstream model not found errors.

Useful? React with 👍 / 👎.

xkonjin

test

xkonjin

Code Review — Audio Transcription Endpoint

Solid, thorough implementation. Well-structured with proper resource cleanup, file validation, and size limits. A few observations:

Strengths:

Temp file staging with proper cleanup (defer + Cleanup method) prevents memory pressure from large uploads
Content sniffing + extension validation is a good defense-in-depth approach for file type checking
The countingWriter approach for pre-calculating multipart Content-Length is clever and avoids buffering the entire body
Error responses properly use the OpenAI error format (invalid_request_error type)
filteredErrorHeaders extraction DRYs up four near-identical blocks — good refactor

Potential issues:

Temp file race window: Between os.CreateTemp and os.Remove in error paths, a crash could leave orphaned temp files in the OS temp dir matching cliproxy-audio-transcription-*. Consider a startup sweep or periodic cleanup for the pattern.
Model auto-resolution iterates all auths: resolveAutoAudioModelBase calls manager.List() and iterates every auth to find transcription-capable models. If the auth list is large, this could be slow on every model=auto request. Worth caching or pre-computing if this endpoint sees frequent use.
resolveCodexAudioTranscriptionURL assumption: The function strips /codex suffix and appends /transcribe. If the base URL has a different path structure (e.g., a versioned API path), this could construct the wrong URL. A comment documenting the expected Codex URL patterns would help future maintainers.
Missing whisper-1 from model registry: The PR only adds gpt-4o-mini-transcribe to the models JSON. If users pass whisper-1 (the classic OpenAI transcription model), it would fail model lookup. Intentional?
The model entry is duplicated four times (once per tier in models.json). If the model definition ever needs updating, that is four places to edit. Consider a shared reference or template approach if the registry supports it.

Tests look good — the TestCodexPrepareRequestPreservesMultipartContentType test specifically validates that the Content-Type is not overwritten, which is the key integration concern. The filteredErrorHeaders test verifies blocked headers are stripped.

xkonjin

Code Review

Overall: This is a substantial, well-structured PR adding OpenAI-compatible audio transcription to CLIProxyAPI. The code is defensive, well-tested (874 lines of tests), and handles edge cases thoroughly. A few items worth attention:

Security & Safety

Temp file cleanup is solid. The defer Cleanup() pattern and the keepTempFile guard ensure files are removed on error paths. Good.
Upload limit (32MB) is enforced correctly via LimitedReader with N+1 pattern to detect overflow. The sniff-then-copy approach avoids buffering the whole file in memory.
Content type validation is thorough. Both extension-based and MIME-type-based checks, with sniffing via http.DetectContentType as fallback. However, http.DetectContentType will return application/octet-stream for most audio formats since Go's detector mainly knows video/webm and audio/wave — so the file extension check is doing the real work here. This is fine, just worth knowing.

Bugs / Potential Issues

resolveCodexAudioTranscriptionURL path construction: If base_url is https://chatgpt.com/backend-api/codex, this strips /codex and appends /transcribe → https://chatgpt.com/backend-api/transcribe. But if someone has a base URL like https://proxy.example.com/codex-api/codex, it would also strip the trailing /codex. This seems intentional for the Codex backend but could surprise users with custom proxies. Consider documenting the URL resolution behavior.
normalizeAudioTranscriptionResponse handles non-JSON well by wrapping plain text in {"text": "..."}. The double-unmarshal for JSON-encoded strings is a nice touch for upstream responses that wrap text in quotes.
Missing whisper-1 and gpt-4o-audio-preview in models.json — only gpt-4o-mini-transcribe is added. If users have OpenAI keys, they may expect whisper-1 to work. Is this intentional to only support the newer transcription model?

Code Quality

DRY improvement in filteredErrorHeaders — extracting the repeated error-header extraction pattern into a helper that also filters through FilterUpstreamHeaders is a nice cleanup. This also fixes a subtle bug where error responses previously leaked Set-Cookie and Connection headers from upstream.
The ExecuteHTTPRequestWithAuthManager addition to the base handler is cleanly separated from the JSON translation path. The PrepareRequest refactoring to avoid overriding Content-Type for multipart is correct.
Test coverage is comprehensive — validation, format preservation, plain text wrapping, Codex URL resolution, file size limits, and multipart field ordering are all covered.

Suggestions

Add a brief comment in the route setup explaining that audio transcription is a passthrough (no SDK translation) to help future readers understand why it's different from chat/completions.
Consider adding a response_format validation step (reject unknown formats early) rather than silently passing them through.

Strong PR. No blocking issues.

xkonjin

Code Review

Summary: Large, well-structured PR adding an OpenAI-compatible /v1/audio/transcriptions endpoint. Handles multipart file upload, temp file staging, content-type preservation, auto-model resolution, and response normalization. Includes comprehensive tests.

What looks good

Solid input validation: file size limits, content-type sniffing, supported format allowlist
Temp file cleanup with defer and the cleanupOnError pattern is correct
filteredErrorHeaders refactoring removes duplicated header-extraction code across 3 call sites
Header preservation in applyCodexPreparedHeaders correctly avoids overwriting multipart content types
Test coverage is thorough (874 lines of tests)

Issues to consider

Temp file race window on crash. If the process crashes between creating the temp file and Cleanup(), orphaned files with prefix cliproxy-audio-transcription-* accumulate in the OS temp dir. Consider using a dedicated subdirectory that can be swept on startup, or setting a TTL-based cleanup in a background goroutine.
resolveAutoAudioModelBase iterates all auths and all models. For deployments with many auth entries and large model registries, this could be slow on every auto request. Consider caching the resolved model with a short TTL (e.g., 30s) since the model registry does not change frequently.
audioTranscriptionUploadLimitBytes is 32MB but no Content-Length pre-check. The limit is enforced after reading, so a malicious client can still force the server to read up to 32MB before rejection. Adding a quick Content-Length header check before streaming would short-circuit obviously oversized uploads.
Models registry duplication. The gpt-4o-mini-transcribe entry is copy-pasted identically across 4 tier blocks in models.json. If this model needs updating, all 4 must be touched. Consider a shared definition or post-processing step.
Missing gpt-4o-transcribe model. Only gpt-4o-mini-transcribe is added. OpenAI also offers the full gpt-4o-transcribe model. Intentional omission, or should it be included?

Nitpicks

audioTranscriptionContentSniffBytes = 512 is fine but http.DetectContentType only reads the first 512 bytes by design, so this is a perfect match. Worth a comment noting this coupling.
The countingWriter is clean but could be a shared utility since similar patterns exist elsewhere in the codebase.

Solid work overall. The core flow is secure and well-tested. The main concern is temp file lifecycle management.

madwiki · 2026-03-19T21:50:47Z

Code Review

Summary: Large, well-structured PR adding an OpenAI-compatible /v1/audio/transcriptions endpoint. Handles multipart file upload, temp file staging, content-type preservation, auto-model resolution, and response normalization. Includes comprehensive tests.

What looks good

Solid input validation: file size limits, content-type sniffing, supported format allowlist

Temp file cleanup with defer and the cleanupOnError pattern is correct

filteredErrorHeaders refactoring removes duplicated header-extraction code across 3 call sites

Header preservation in applyCodexPreparedHeaders correctly avoids overwriting multipart content types

Test coverage is thorough (874 lines of tests)

Issues to consider

Temp file race window on crash. If the process crashes between creating the temp file and Cleanup(), orphaned files with prefix cliproxy-audio-transcription-* accumulate in the OS temp dir. Consider using a dedicated subdirectory that can be swept on startup, or setting a TTL-based cleanup in a background goroutine.

resolveAutoAudioModelBase iterates all auths and all models. For deployments with many auth entries and large model registries, this could be slow on every auto request. Consider caching the resolved model with a short TTL (e.g., 30s) since the model registry does not change frequently.

audioTranscriptionUploadLimitBytes is 32MB but no Content-Length pre-check. The limit is enforced after reading, so a malicious client can still force the server to read up to 32MB before rejection. Adding a quick Content-Length header check before streaming would short-circuit obviously oversized uploads.

Models registry duplication. The gpt-4o-mini-transcribe entry is copy-pasted identically across 4 tier blocks in models.json. If this model needs updating, all 4 must be touched. Consider a shared definition or post-processing step.

Missing gpt-4o-transcribe model. Only gpt-4o-mini-transcribe is added. OpenAI also offers the full gpt-4o-transcribe model. Intentional omission, or should it be included?

Nitpicks

audioTranscriptionContentSniffBytes = 512 is fine but http.DetectContentType only reads the first 512 bytes by design, so this is a perfect match. Worth a comment noting this coupling.

The countingWriter is clean but could be a shared utility since similar patterns exist elsewhere in the codebase.

Solid work overall. The core flow is secure and well-tested. The main concern is temp file lifecycle management.

Addressed the remaining audio-transcription follow-ups in:

c8a585b (Realign audio transcription with auth manager semantics)
fc3a342 (Document audio passthrough and validate response formats)

This update tightens the audio transcription path so it follows the existing auth-manager / selector semantics more closely and closes the remaining compatibility gaps from review.

Changes included:

added a read-only execution-model preview path so audio model=auto no longer mutates alias/model-pool rotation during inspection
pinned the auth selected by audio auto into the actual raw HTTP execution path, so preview and execution no longer diverge across OpenAI-compatible providers
made provider-specific modality metadata authoritative for transcription-capability detection, with name-based fallback only when metadata is absent
added explicit whisper-1 registry support so direct model=whisper-1 requests resolve through the normal provider/model path
changed successful JSON normalization to stage and stream from temp files instead of reading the full response back into memory
documented in route setup that audio transcription uses a direct multipart passthrough path rather than the JSON translator stack
added early response_format validation so unsupported formats now fail fast with an OpenAI-style invalid request error instead of being passed through silently

Validation:

go test ./sdk/api/handlers/openai ./sdk/cliproxy/auth
go test ./...

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc3a342da5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-19T21:56:16Z

sdk/api/handlers/openai/openai_audio_handlers.go

+var supportedAudioResponseFormats = map[string]struct{}{
+	"json":         {},
+	"srt":          {},
+	"text":         {},
+	"verbose_json": {},
+	"vtt":          {},


Accept the documented diarized_json transcription format

OpenAI's current Audio Transcriptions docs list diarized_json as a valid response_format for diarization-capable models such as gpt-4o-transcribe-diarize (docs). Because this allowlist omits that value, any OpenAI-compatible backend that supports diarized transcription will now be rejected locally with a 400 before the request ever reaches upstream, which breaks the endpoint's stated wire-compatibility goal.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-19T21:56:16Z

sdk/cliproxy/auth/selector.go

+func IsAuthSelectableForModel(auth *Auth, model string, now time.Time) bool {
+	blocked, _, _ := isAuthBlockedForModel(auth, model, now)


Exclude globally cooling-down auths from audio auto selection

Fresh evidence that the earlier unavailable-auth issue still exists: the new PreviewSelectableRouteModels path relies on IsAuthSelectableForModel, but isAuthBlockedForModel ignores auth.Unavailable/NextRetryAfter whenever model != "" and there is no ModelStates[model] entry. In that state, model=auto can still pin a newer transcription model on a credential that is globally cooling down, and the subsequent pinned execution fails instead of falling back to another ready auth.

Useful? React with 👍 / 👎.

xkonjin

Code Review — Audio Transcription Endpoint

Nice work on this feature. The multipart passthrough approach is the right call for binary payloads, and the test coverage is solid. A few observations:

Potential Issues

Model registry duplication — gpt-4o-mini-transcribe and whisper-1 entries are copy-pasted across 4 tier blocks (openai, codex-team, codex-plus, codex-pro). If a model gets added or updated, you need to touch 4 places. Consider extracting shared model definitions or at minimum leaving a comment pointing at the canonical block.
Temp file cleanup on panic paths — stageFilePart and normalizeAudioTranscriptionResponseFromReader both use a keepTempFile bool pattern with deferred cleanup. If any goroutine panics between setting keepTempFile = true and the caller registering cleanup, the temp file leaks. The writeMultipartBody goroutine is fire-and-forget with no recovery, so a panic there could leave the pipe hanging. Consider adding a recover() in the goroutine.
normalizeAudioTranscriptionResponse (in-memory version) is defined but only used in tests. The production path uses normalizeAudioTranscriptionResponseFromReader which stages to disk. If the in-memory version drifts out of sync with the file-based one, test coverage becomes misleading. Worth either removing the in-memory version or adding a property test asserting equivalence.
audioTranscriptionUploadLimitBytes is 32MB — the newer gpt-4o-mini-transcribe model may support larger files. Worth making this configurable or documenting the assumption.

Minor

filteredErrorHeaders helper is a good DRY refactor across three call sites.
applyCodexPreparedHeaders nil-guard on r.Header is defensive in a good way.
Content-length test (lastLength > 0) catches the regression where multipart bodies sent with unknown length.

Security

File type validation via both extension and MIME sniffing is solid.
audioTranscriptionNonFileFieldsLimitBytes cap at 1MB prevents abuse via giant prompt fields.
No path traversal risk since filenames are only used in multipart Content-Disposition headers upstream.

Main thing to address before merge: goroutine panic safety in writeMultipartBody.

madwiki · 2026-03-20T09:41:26Z

Code Review — Audio Transcription Endpoint

Nice work on this feature. The multipart passthrough approach is the right call for binary payloads, and the test coverage is solid. A few observations:

Potential Issues

Model registry duplication — gpt-4o-mini-transcribe and whisper-1 entries are copy-pasted across 4 tier blocks (openai, codex-team, codex-plus, codex-pro). If a model gets added or updated, you need to touch 4 places. Consider extracting shared model definitions or at minimum leaving a comment pointing at the canonical block.

Temp file cleanup on panic paths — stageFilePart and normalizeAudioTranscriptionResponseFromReader both use a keepTempFile bool pattern with deferred cleanup. If any goroutine panics between setting keepTempFile = true and the caller registering cleanup, the temp file leaks. The writeMultipartBody goroutine is fire-and-forget with no recovery, so a panic there could leave the pipe hanging. Consider adding a recover() in the goroutine.

normalizeAudioTranscriptionResponse (in-memory version) is defined but only used in tests. The production path uses normalizeAudioTranscriptionResponseFromReader which stages to disk. If the in-memory version drifts out of sync with the file-based one, test coverage becomes misleading. Worth either removing the in-memory version or adding a property test asserting equivalence.

audioTranscriptionUploadLimitBytes is 32MB — the newer gpt-4o-mini-transcribe model may support larger files. Worth making this configurable or documenting the assumption.

Minor

filteredErrorHeaders helper is a good DRY refactor across three call sites.

applyCodexPreparedHeaders nil-guard on r.Header is defensive in a good way.

Content-length test (lastLength > 0) catches the regression where multipart bodies sent with unknown length.

Security

File type validation via both extension and MIME sniffing is solid.

audioTranscriptionNonFileFieldsLimitBytes cap at 1MB prevents abuse via giant prompt fields.

No path traversal risk since filenames are only used in multipart Content-Disposition headers upstream.

Main thing to address before merge: goroutine panic safety in writeMultipartBody.

Thanks a lot for the careful review, and sorry for the churn on this PR. You were right to call out the remaining rough edges.

I addressed the follow-ups in fc3a342d and 07f44a8c:

documented the audio transcription route as a direct multipart passthrough path and tightened early response_format validation
unified audio response_format handling and added diarized_json, so validation, Accept selection, and response handling now stay consistent
fixed audio model=auto selection so globally cooling-down auths are no longer treated as selectable when model-specific state is missing
hardened the multipart writer goroutine by recovering panics and surfacing them as request errors
removed the duplicate test-only normalization path and switched those tests to exercise the real file-backed normalization flow
cleaned up the transcription-model duplication introduced in this PR by adding an optional shared Codex model layer while keeping compatibility with the existing duplicated remote catalog shape

I also added regression coverage around these paths and re-ran go test ./....

On the broader temp-file orphan point: I agree the whole-process crash window is real. I kept this pass focused on request-path cleanup and panic hardening, since fully solving crash-time orphan cleanup would need a separate process-lifecycle policy rather than more endpoint-local logic.

Thanks again for the thoughtful review.

Add OpenAI-compatible audio transcription endpoint

2a7d266

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

sdk/cliproxy/auth/conductor.go Outdated Show resolved Hide resolved

sdk/api/handlers/openai/openai_audio_handlers.go Outdated Show resolved Hide resolved

xkonjin reviewed Mar 19, 2026

View reviewed changes

Stream audio transcription uploads and bound responses

78b3a10

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

sdk/api/handlers/openai/openai_audio_handlers.go Outdated Show resolved Hide resolved

sdk/api/handlers/openai/openai_audio_handlers.go Show resolved Hide resolved

sdk/cliproxy/auth/conductor.go Outdated Show resolved Hide resolved

Fix raw HTTP transcription compatibility gaps

ea79e77

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

sdk/api/handlers/openai/openai_audio_handlers.go Outdated Show resolved Hide resolved

sdk/cliproxy/auth/conductor.go Outdated Show resolved Hide resolved

sdk/cliproxy/auth/conductor.go Show resolved Hide resolved

Bound raw HTTP error bodies and headers

8cdeddc

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

sdk/api/handlers/handlers.go Show resolved Hide resolved

sdk/api/handlers/openai/openai_audio_handlers.go Show resolved Hide resolved

Align audio auto model selection with auth routing

35a54e4

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Align audio handler with selector and keepalive semantics

2a40ff0

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

xkonjin reviewed Mar 19, 2026

View reviewed changes

Realign audio transcription with auth manager semantics

c8a585b

xkonjin reviewed Mar 19, 2026

View reviewed changes

Document audio passthrough and validate response formats

fc3a342

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

xkonjin reviewed Mar 19, 2026

View reviewed changes

Close remaining audio transcription review gaps

07f44a8

		upstreamModels := manager.ExecutionModelCandidates(auth, routeModel)
		if len(upstreamModels) == 0 {

		if field.Name == audioTranscriptionModelFieldName {
		fieldValue = modelValue

		func IsAuthSelectableForModel(auth *Auth, model string, now time.Time) bool {
		blocked, _, _ := isAuthBlockedForModel(auth, model, now)

Uh oh!

Conversation

madwiki commented Mar 19, 2026

Summary

What Changed

Why

Testing

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Code Review

Security and Input Validation

Potential Issues

Test Quality

Architecture

Uh oh!

xkonjin left a comment

Choose a reason for hiding this comment

Code Review

Potential Issues

Nits

Good

Uh oh!

madwiki commented Mar 19, 2026

Code Review

Potential Issues

Nits

Good

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madwiki commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madwiki commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

madwiki commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 19, 2026

Choose a reason for hiding this comment