Skip to content

Add OpenAI-compatible audio transcription endpoint#2224

Open
madwiki wants to merge 9 commits intorouter-for-me:mainfrom
madwiki:main
Open

Add OpenAI-compatible audio transcription endpoint#2224
madwiki wants to merge 9 commits intorouter-for-me:mainfrom
madwiki:main

Conversation

@madwiki
Copy link

@madwiki madwiki commented Mar 19, 2026

Summary

This PR adds support for OpenAI-compatible audio transcription requests via POST /v1/audio/transcriptions.

The new endpoint is implemented as a provider-native multipart proxy path, reusing the existing auth selection flow instead of introducing any client-specific behavior or additional credential requirements.

What Changed

  • Added POST /v1/audio/transcriptions
  • Accepted multipart form fields:
    • model
    • file
    • prompt (optional)
    • language (optional)
  • Normalized responses to JSON with at least a text field
  • Added a raw HTTP execution path in the auth manager for provider-native endpoints that do not fit the existing JSON translation pipeline
  • Preserved existing Codex/OpenAI auth selection behavior for audio requests
  • Updated Codex HTTP request preparation so multipart requests keep their original content type while still applying provider-specific auth headers
  • Exposed POST /v1/audio/transcriptions in the root endpoint list
  • Registered gpt-4o-mini-transcribe in Codex model sets
  • Added tests covering route exposure, multipart validation, response normalization, and Codex request preparation

Why

CLIProxyAPI already supports OpenAI-compatible chat and responses endpoints, but it did not expose the speech-to-text transcription endpoint.

Because audio transcription uses multipart/form-data, it does not fit cleanly into the existing JSON translator flow. This change adds the missing capability with minimal impact on the current architecture by routing it through a dedicated raw HTTP execution path.

Testing

  • go test ./...

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the CLI Proxy API by integrating an OpenAI-compatible audio transcription endpoint. This new capability allows users to perform speech-to-text conversions, addressing a previously missing feature. The implementation leverages a provider-native multipart proxy approach, ensuring compatibility with existing authentication mechanisms and minimizing architectural impact by routing through a specialized raw HTTP execution path for non-JSON requests.

Highlights

  • New Audio Transcription Endpoint: Added a new OpenAI-compatible endpoint at POST /v1/audio/transcriptions to support speech-to-text functionality.
  • Multipart Request Handling: Implemented robust handling for multipart form data, including file uploads and optional fields like model, prompt, and language, while preserving original content types during proxying.
  • Raw HTTP Execution Path: Introduced a dedicated raw HTTP execution path within the authentication manager to accommodate provider-native endpoints that do not fit the standard JSON translation pipeline.
  • Response Normalization: Ensured that all transcription responses are normalized to JSON format, always including at least a text field for consistency.
  • New Model Registration: Registered the gpt-4o-mini-transcribe model in the Codex model sets to enable its use with the new transcription service.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds a new OpenAI-compatible audio transcription endpoint. The changes are well-structured, including a new raw HTTP execution path in the auth manager to handle multipart form data, which is a thoughtful architectural addition. The implementation also correctly normalizes responses to ensure a consistent API for clients. My review includes two main points: a high-severity suggestion to improve memory efficiency by streaming audio files instead of buffering them entirely in memory, and a medium-severity suggestion to simplify some redundant header-setting logic for better code clarity and maintainability.

Comment on lines +169 to +172
fileData, err := io.ReadAll(file)
if err != nil {
return nil, &audioRequestError{status: http.StatusBadRequest, msg: fmt.Sprintf("failed to read uploaded file: %v", err)}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation reads the entire uploaded audio file into memory using io.ReadAll. This can lead to high memory consumption and potential out-of-memory errors for large files, creating a possible denial-of-service vulnerability. While the OpenAI API has a 25MB limit, this proxy might be used with backends that support larger files.

To improve efficiency and robustness, I recommend streaming the file content directly from the incoming request to the outgoing request to the provider, without buffering the entire file in memory. This could be achieved by modifying BuildHTTPRequest to use an io.Pipe to construct the multipart request body on-the-fly, which would avoid loading the whole file into r.FileData.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a7d266b64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: Well-structured addition of an OpenAI-compatible audio transcription endpoint. Good test coverage. A few items worth attention:

Security and Input Validation

  1. Memory usage: audioTranscriptionFormMemory is 32MB, and io.ReadAll(file) after ParseMultipartForm means file data is buffered twice (~50MB heap per request for a 25MB file). Consider reading directly from the multipart part, or document the memory ceiling.

  2. File validation is extension + MIME only: validateAudioFile checks file extension and Content-Type header but not magic bytes. Mitigated by upstream API rejection, but worth noting.

  3. RemoveAll in defer: Good. Multipart temp files get cleaned up on all paths.

Potential Issues

  1. Response normalization edge case: normalizeAudioTranscriptionResponse tries json.Unmarshal into a string as a fallback. If upstream returns a bare non-JSON string with a leading ", Unmarshal succeeds and strips quotes. Probably fine in practice but the behavior could surprise.

  2. Hardcoded Codex URL: The fallback https://chatgpt.com/backend-api/transcribe for codex auth is hardcoded. If this endpoint changes it requires a code change. Consider making it configurable.

Test Quality

Tests are solid, covering happy path, validation, endpoint registration, and header preservation. Missing tests for:

  • Codex provider URL resolution path
  • normalizeAudioTranscriptionResponse edge cases (empty body, existing text field, bare JSON string)
  • Retry/failover behavior via ExecuteHTTPRequest

Architecture

The ExecuteHTTPRequestWithAuthManager + HTTPRequestBuilder approach to bypass the JSON translator stack for binary multipart uploads is the right call. Clean separation from chat/completion flow. The header refactoring in codex_executor.go (extracting applyCodexPreparedHeaders) is a clean prerequisite change.

Good work overall.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Good work adding the audio transcription endpoint. The overall structure is solid — proper multipart parsing, file validation, auth routing, and response normalization. A few observations:

Potential Issues

  1. Unbounded file read into memoryio.ReadAll(file) in parseAudioTranscriptionRequest and io.ReadAll(upstreamResp.Body) in AudioTranscriptions have no size cap beyond the 32MB multipart form limit. If an upstream response is unexpectedly large (e.g. verbose JSON from a misconfigured provider), this could spike memory. Consider adding a LimitedReader on the upstream response body.

  2. cliCancel called with body bytes — In the success path, cliCancel(normalizedBody) passes the entire response body as the cancel reason. If this is for logging/tracing, fine, but make sure the downstream context handler can handle []byte (not just error or string). If it expects a specific type, this could silently fail or panic depending on the assertion.

  3. Model registration duplicationgpt-4o-mini-transcribe is copy-pasted into 4 tier blocks in models.json (codex-team, codex-plus, codex-pro, and the unnamed block). If tiers diverge or a new transcription model is added, this is easy to get out of sync. Not blocking, but worth noting for maintainability.

Nits

  • applyCodexHeaders now calls applyCodexPreparedHeaders and then unconditionally sets Content-Type to application/json. The PrepareRequest path intentionally skips setting Content-Type (so multipart is preserved). This split is correct but fragile — a comment on applyCodexHeaders noting it is only for JSON request paths would help future readers.

  • resolveAudioTranscriptionURL falls back to https://chatgpt.com/backend-api/transcribe for codex auth without a base_url. If that endpoint ever changes or requires different auth headers, this hardcoded URL could silently break. Consider making it configurable or at least adding a constant with a comment.

Good

  • File validation checks both extension and MIME type with a fallback to http.DetectContentType — solid defensive approach.
  • Tests cover both the happy path (plain-text wrapping) and rejection of unsupported formats.
  • The HTTPResponseError type with StatusCode() and Headers() is clean and will compose well with the existing retry machinery.

Overall: looks ready with the memory/size concern as the main thing to address before merge.

@madwiki
Copy link
Author

madwiki commented Mar 19, 2026

Code Review

Good work adding the audio transcription endpoint. The overall structure is solid — proper multipart parsing, file validation, auth routing, and response normalization. A few observations:

Potential Issues

  1. Unbounded file read into memoryio.ReadAll(file) in parseAudioTranscriptionRequest and io.ReadAll(upstreamResp.Body) in AudioTranscriptions have no size cap beyond the 32MB multipart form limit. If an upstream response is unexpectedly large (e.g. verbose JSON from a misconfigured provider), this could spike memory. Consider adding a LimitedReader on the upstream response body.
  2. cliCancel called with body bytes — In the success path, cliCancel(normalizedBody) passes the entire response body as the cancel reason. If this is for logging/tracing, fine, but make sure the downstream context handler can handle []byte (not just error or string). If it expects a specific type, this could silently fail or panic depending on the assertion.
  3. Model registration duplicationgpt-4o-mini-transcribe is copy-pasted into 4 tier blocks in models.json (codex-team, codex-plus, codex-pro, and the unnamed block). If tiers diverge or a new transcription model is added, this is easy to get out of sync. Not blocking, but worth noting for maintainability.

Nits

  • applyCodexHeaders now calls applyCodexPreparedHeaders and then unconditionally sets Content-Type to application/json. The PrepareRequest path intentionally skips setting Content-Type (so multipart is preserved). This split is correct but fragile — a comment on applyCodexHeaders noting it is only for JSON request paths would help future readers.
  • resolveAudioTranscriptionURL falls back to https://chatgpt.com/backend-api/transcribe for codex auth without a base_url. If that endpoint ever changes or requires different auth headers, this hardcoded URL could silently break. Consider making it configurable or at least adding a constant with a comment.

Good

  • File validation checks both extension and MIME type with a fallback to http.DetectContentType — solid defensive approach.
  • Tests cover both the happy path (plain-text wrapping) and rejection of unsupported formats.
  • The HTTPResponseError type with StatusCode() and Headers() is clean and will compose well with the existing retry machinery.

Overall: looks ready with the memory/size concern as the main thing to address before merge.

Thanks for the review. Addressed in 78b3a10.

This follow-up keeps the external API unchanged and tightens the implementation around memory usage, bounds, and maintainability.

Changes made:

  • Replaced the previous ParseMultipartForm + full in-memory buffering path with a streaming MultipartReader flow.
  • The uploaded audio is now staged into a temp file, and the outbound multipart request is rebuilt by reopening that staged file instead of buffering the whole file in memory.
  • Bounded upstream response reads and now return a clear 502-style error if the upstream transcription payload exceeds the configured limit.
  • Replaced the relevant numeric and URL literals with named constants/helpers, including the default Codex transcription endpoint.
  • Added focused tests for response normalization edge cases, default Codex URL resolution, oversized upstream responses, auth failover on retriable upstream HTTP failures, and staged temp-file cleanup.
  • Added a short comment clarifying that applyCodexHeaders is the JSON-request helper, while the prepared-request path intentionally preserves caller-provided content types such as multipart/form-data.

On the cliCancel point, I did not change that path because the existing cancel helper already handles []byte payloads explicitly, so this is not a new issue introduced by this endpoint.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78b3a104a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@madwiki
Copy link
Author

madwiki commented Mar 19, 2026

Addressed in ea79e77 in response to the Codex review feedback.

This follow-up fixes the compatibility and scheduler issues called out there:

  • explicit non-JSON transcription formats are now preserved instead of always being wrapped as JSON
  • non-file multipart fields are now bounded
  • the raw HTTP path now applies resolved upstream models before request construction
  • auth-specific request-build failures now fall through to the next eligible model/auth when appropriate
  • Codex transcription URL derivation now follows the existing Codex base_url semantics

I intentionally kept this scoped to the issues that affected correctness, compatibility, or robustness.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea79e77196

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@madwiki
Copy link
Author

madwiki commented Mar 19, 2026

Addressed in 8cdeddc in response to the latest Codex review feedback.

This follow-up fixes the two error-path issues called out there:

  • raw HTTP upstream error bodies are now bounded before constructing HTTPResponseError
  • upstream error headers are now filtered through the same header filter before being surfaced to clients when passthrough headers are enabled

I did not add transcription streaming support in this pass, since that would be a broader compatibility expansion beyond the current non-streaming transcription flow.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cdeddca17

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@madwiki
Copy link
Author

madwiki commented Mar 19, 2026

Addressed in 35a54e4 in response to the latest Codex review feedback.

This follow-up fixes the remaining model=auto routing gap for audio transcription. Instead of resolving auto from a global transcription-looking model list, the audio path now resolves it against auths that can actually serve /v1/audio/transcriptions and reuses the auth-manager's existing execution-model candidate logic.

That means:

  • auto no longer picks models from unsupported providers just because they look transcription-capable in the registry
  • OpenAI-compatible alias / model-pool setups are now respected when choosing the audio route model
  • the selected route model still expands to the correct upstream transcription model during raw HTTP execution

I also added regression coverage for both cases: avoiding a newer non-audio chat model and resolving through an OpenAI-compatible alias setup.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35a54e4ad4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +592 to +597
for _, auth := range manager.List() {
if auth == nil || strings.TrimSpace(auth.ID) == "" {
continue
}
if _, err := resolveAudioTranscriptionURL(auth); err != nil {
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude unavailable auths from audio auto resolution

This model=auto picker walks every auth returned by manager.List() and only filters on resolveAudioTranscriptionURL, so disabled or cooling-down credentials still contribute candidate models. The scheduler later rejects those auths in sdk/cliproxy/auth/selector.go, which means a newer transcription model on an unavailable credential can win here and then fail with auth_not_found even though an older active transcription model is available. Please restrict auto candidates to auths that are actually selectable.

Useful? React with 👍 / 👎.

Comment on lines +127 to +130
cliCtx, cliCancel := h.GetContextWithCancel(h, c, context.Background())
upstreamResp, _, errMsg := h.ExecuteHTTPRequestWithAuthManager(cliCtx, audioReq.Model, func(ctx context.Context, auth *coreauth.Auth, upstreamModel string) (*http.Request, error) {
return audioReq.BuildHTTPRequest(ctx, auth, upstreamModel)
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Emit non-stream keepalives while transcriptions run

This new non-streaming handler goes straight from GetContextWithCancel to ExecuteHTTPRequestWithAuthManager without calling StartNonStreamingKeepAlive, unlike the existing non-streaming OpenAI/Responses handlers. In deployments that set non_stream_keep_alive_interval to keep reverse proxies alive, long transcription jobs will now sit silent until completion and can be timed out by nginx/Cloudflare even though the other long-running endpoints keep the connection open.

Useful? React with 👍 / 👎.

Comment on lines +722 to +728
limitedReader := &io.LimitedReader{R: body, N: audioTranscriptionUpstreamResponseLimitBytes + 1}
payload, err := io.ReadAll(limitedReader)
if err != nil {
return nil, fmt.Errorf("failed to read upstream transcription response: %w", err)
}
if int64(len(payload)) > audioTranscriptionUpstreamResponseLimitBytes {
return nil, &audioRequestError{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the fixed 8 MiB cap on successful transcription bodies

Successful transcription responses are always buffered and rejected once they exceed 8 MiB. That is much smaller than the 32 MiB upload limit this handler accepts, and it breaks legitimate large outputs such as long text/subtitle responses or verbose_json with word-level timestamps, which can easily cross 8 MiB for larger audio files. In those cases the upstream request succeeds but the proxy turns it into a 502.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a40ff0c2d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +647 to +648
upstreamModels := manager.ExecutionModelCandidates(auth, routeModel)
if len(upstreamModels) == 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid advancing alias-pool offsets while resolving model=auto

When an OpenAI-compatible alias maps to multiple upstream transcription models, resolveAudioRouteModelCreatedAt now calls manager.ExecutionModelCandidates(...) only to inspect candidates. That method delegates to prepareExecutionModels in sdk/cliproxy/auth/conductor.go, which increments modelPoolOffsets via nextModelPoolOffset. The same request then enters ExecuteHTTPRequest and rotates the pool a second time, so model=auto requests skip the model/fallback order that should have been attempted first and skew round-robin balancing for every alias-backed transcription pool.

Useful? React with 👍 / 👎.

if value == "" {
return false
}
return strings.Contains(value, "transcribe") || strings.Contains(value, "transcription") || strings.Contains(value, "speech-to-text")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat whisper-1 as transcription-capable during auto resolution

The new model=auto detection only recognizes model names containing transcribe, transcription, or speech-to-text. If an OpenAI-compatible backend advertises whisper-1 (or another speech model without those substrings) through registry.GetModelsForClient, resolveAudioUpstreamModelCreatedAt rejects it and /v1/audio/transcriptions returns “no transcription-capable model is available” even though a manual model=whisper-1 request can still be proxied. That makes auto-selection fail on common Whisper-style deployments.

Useful? React with 👍 / 👎.

Comment on lines +439 to +440
if field.Name == audioTranscriptionModelFieldName {
fieldValue = modelValue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Strip thinking suffixes from transcription model names

This handler writes upstreamModel into the multipart model field verbatim. Unlike the normal chat executors, which strip internal thinking suffixes with thinking.ParseSuffix(...).ModelName before calling upstream APIs (for example in internal/runtime/executor/codex_executor.go and openai_compat_executor.go), a request such as model=gpt-4o-mini-transcribe(high) or model=auto(high) will now be forwarded as ...(...). Speech-to-text backends do not understand those suffixes, so otherwise valid transcription requests will fail with upstream model not found errors.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Audio Transcription Endpoint

Solid, thorough implementation. Well-structured with proper resource cleanup, file validation, and size limits. A few observations:

Strengths:

  • Temp file staging with proper cleanup (defer + Cleanup method) prevents memory pressure from large uploads
  • Content sniffing + extension validation is a good defense-in-depth approach for file type checking
  • The countingWriter approach for pre-calculating multipart Content-Length is clever and avoids buffering the entire body
  • Error responses properly use the OpenAI error format (invalid_request_error type)
  • filteredErrorHeaders extraction DRYs up four near-identical blocks — good refactor

Potential issues:

  1. Temp file race window: Between os.CreateTemp and os.Remove in error paths, a crash could leave orphaned temp files in the OS temp dir matching cliproxy-audio-transcription-*. Consider a startup sweep or periodic cleanup for the pattern.

  2. Model auto-resolution iterates all auths: resolveAutoAudioModelBase calls manager.List() and iterates every auth to find transcription-capable models. If the auth list is large, this could be slow on every model=auto request. Worth caching or pre-computing if this endpoint sees frequent use.

  3. resolveCodexAudioTranscriptionURL assumption: The function strips /codex suffix and appends /transcribe. If the base URL has a different path structure (e.g., a versioned API path), this could construct the wrong URL. A comment documenting the expected Codex URL patterns would help future maintainers.

  4. Missing whisper-1 from model registry: The PR only adds gpt-4o-mini-transcribe to the models JSON. If users pass whisper-1 (the classic OpenAI transcription model), it would fail model lookup. Intentional?

  5. The model entry is duplicated four times (once per tier in models.json). If the model definition ever needs updating, that is four places to edit. Consider a shared reference or template approach if the registry supports it.

Tests look good — the TestCodexPrepareRequestPreservesMultipartContentType test specifically validates that the Content-Type is not overwritten, which is the key integration concern. The filteredErrorHeaders test verifies blocked headers are stripped.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Overall: This is a substantial, well-structured PR adding OpenAI-compatible audio transcription to CLIProxyAPI. The code is defensive, well-tested (874 lines of tests), and handles edge cases thoroughly. A few items worth attention:

Security & Safety

  1. Temp file cleanup is solid. The defer Cleanup() pattern and the keepTempFile guard ensure files are removed on error paths. Good.

  2. Upload limit (32MB) is enforced correctly via LimitedReader with N+1 pattern to detect overflow. The sniff-then-copy approach avoids buffering the whole file in memory.

  3. Content type validation is thorough. Both extension-based and MIME-type-based checks, with sniffing via http.DetectContentType as fallback. However, http.DetectContentType will return application/octet-stream for most audio formats since Go's detector mainly knows video/webm and audio/wave — so the file extension check is doing the real work here. This is fine, just worth knowing.

Bugs / Potential Issues

  1. resolveCodexAudioTranscriptionURL path construction: If base_url is https://chatgpt.com/backend-api/codex, this strips /codex and appends /transcribehttps://chatgpt.com/backend-api/transcribe. But if someone has a base URL like https://proxy.example.com/codex-api/codex, it would also strip the trailing /codex. This seems intentional for the Codex backend but could surprise users with custom proxies. Consider documenting the URL resolution behavior.

  2. normalizeAudioTranscriptionResponse handles non-JSON well by wrapping plain text in {"text": "..."}. The double-unmarshal for JSON-encoded strings is a nice touch for upstream responses that wrap text in quotes.

  3. Missing whisper-1 and gpt-4o-audio-preview in models.json — only gpt-4o-mini-transcribe is added. If users have OpenAI keys, they may expect whisper-1 to work. Is this intentional to only support the newer transcription model?

Code Quality

  1. DRY improvement in filteredErrorHeaders — extracting the repeated error-header extraction pattern into a helper that also filters through FilterUpstreamHeaders is a nice cleanup. This also fixes a subtle bug where error responses previously leaked Set-Cookie and Connection headers from upstream.

  2. The ExecuteHTTPRequestWithAuthManager addition to the base handler is cleanly separated from the JSON translation path. The PrepareRequest refactoring to avoid overriding Content-Type for multipart is correct.

  3. Test coverage is comprehensive — validation, format preservation, plain text wrapping, Codex URL resolution, file size limits, and multipart field ordering are all covered.

Suggestions

  • Add a brief comment in the route setup explaining that audio transcription is a passthrough (no SDK translation) to help future readers understand why it's different from chat/completions.
  • Consider adding a response_format validation step (reject unknown formats early) rather than silently passing them through.

Strong PR. No blocking issues.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Summary: Large, well-structured PR adding an OpenAI-compatible /v1/audio/transcriptions endpoint. Handles multipart file upload, temp file staging, content-type preservation, auto-model resolution, and response normalization. Includes comprehensive tests.

What looks good

  • Solid input validation: file size limits, content-type sniffing, supported format allowlist
  • Temp file cleanup with defer and the cleanupOnError pattern is correct
  • filteredErrorHeaders refactoring removes duplicated header-extraction code across 3 call sites
  • Header preservation in applyCodexPreparedHeaders correctly avoids overwriting multipart content types
  • Test coverage is thorough (874 lines of tests)

Issues to consider

  1. Temp file race window on crash. If the process crashes between creating the temp file and Cleanup(), orphaned files with prefix cliproxy-audio-transcription-* accumulate in the OS temp dir. Consider using a dedicated subdirectory that can be swept on startup, or setting a TTL-based cleanup in a background goroutine.

  2. resolveAutoAudioModelBase iterates all auths and all models. For deployments with many auth entries and large model registries, this could be slow on every auto request. Consider caching the resolved model with a short TTL (e.g., 30s) since the model registry does not change frequently.

  3. audioTranscriptionUploadLimitBytes is 32MB but no Content-Length pre-check. The limit is enforced after reading, so a malicious client can still force the server to read up to 32MB before rejection. Adding a quick Content-Length header check before streaming would short-circuit obviously oversized uploads.

  4. Models registry duplication. The gpt-4o-mini-transcribe entry is copy-pasted identically across 4 tier blocks in models.json. If this model needs updating, all 4 must be touched. Consider a shared definition or post-processing step.

  5. Missing gpt-4o-transcribe model. Only gpt-4o-mini-transcribe is added. OpenAI also offers the full gpt-4o-transcribe model. Intentional omission, or should it be included?

Nitpicks

  • audioTranscriptionContentSniffBytes = 512 is fine but http.DetectContentType only reads the first 512 bytes by design, so this is a perfect match. Worth a comment noting this coupling.
  • The countingWriter is clean but could be a shared utility since similar patterns exist elsewhere in the codebase.

Solid work overall. The core flow is secure and well-tested. The main concern is temp file lifecycle management.

@madwiki
Copy link
Author

madwiki commented Mar 19, 2026

Code Review

Summary: Large, well-structured PR adding an OpenAI-compatible /v1/audio/transcriptions endpoint. Handles multipart file upload, temp file staging, content-type preservation, auto-model resolution, and response normalization. Includes comprehensive tests.

What looks good

  • Solid input validation: file size limits, content-type sniffing, supported format allowlist
  • Temp file cleanup with defer and the cleanupOnError pattern is correct
  • filteredErrorHeaders refactoring removes duplicated header-extraction code across 3 call sites
  • Header preservation in applyCodexPreparedHeaders correctly avoids overwriting multipart content types
  • Test coverage is thorough (874 lines of tests)

Issues to consider

  1. Temp file race window on crash. If the process crashes between creating the temp file and Cleanup(), orphaned files with prefix cliproxy-audio-transcription-* accumulate in the OS temp dir. Consider using a dedicated subdirectory that can be swept on startup, or setting a TTL-based cleanup in a background goroutine.
  2. resolveAutoAudioModelBase iterates all auths and all models. For deployments with many auth entries and large model registries, this could be slow on every auto request. Consider caching the resolved model with a short TTL (e.g., 30s) since the model registry does not change frequently.
  3. audioTranscriptionUploadLimitBytes is 32MB but no Content-Length pre-check. The limit is enforced after reading, so a malicious client can still force the server to read up to 32MB before rejection. Adding a quick Content-Length header check before streaming would short-circuit obviously oversized uploads.
  4. Models registry duplication. The gpt-4o-mini-transcribe entry is copy-pasted identically across 4 tier blocks in models.json. If this model needs updating, all 4 must be touched. Consider a shared definition or post-processing step.
  5. Missing gpt-4o-transcribe model. Only gpt-4o-mini-transcribe is added. OpenAI also offers the full gpt-4o-transcribe model. Intentional omission, or should it be included?

Nitpicks

  • audioTranscriptionContentSniffBytes = 512 is fine but http.DetectContentType only reads the first 512 bytes by design, so this is a perfect match. Worth a comment noting this coupling.
  • The countingWriter is clean but could be a shared utility since similar patterns exist elsewhere in the codebase.

Solid work overall. The core flow is secure and well-tested. The main concern is temp file lifecycle management.

Addressed the remaining audio-transcription follow-ups in:

  • c8a585b (Realign audio transcription with auth manager semantics)
  • fc3a342 (Document audio passthrough and validate response formats)

This update tightens the audio transcription path so it follows the existing auth-manager / selector semantics more closely and closes the remaining compatibility gaps from review.

Changes included:

  • added a read-only execution-model preview path so audio model=auto no longer mutates alias/model-pool rotation during inspection
  • pinned the auth selected by audio auto into the actual raw HTTP execution path, so preview and execution no longer diverge across OpenAI-compatible providers
  • made provider-specific modality metadata authoritative for transcription-capability detection, with name-based fallback only when metadata is absent
  • added explicit whisper-1 registry support so direct model=whisper-1 requests resolve through the normal provider/model path
  • changed successful JSON normalization to stage and stream from temp files instead of reading the full response back into memory
  • documented in route setup that audio transcription uses a direct multipart passthrough path rather than the JSON translator stack
  • added early response_format validation so unsupported formats now fail fast with an OpenAI-style invalid request error instead of being passed through silently

Validation:

  • go test ./sdk/api/handlers/openai ./sdk/cliproxy/auth
  • go test ./...

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc3a342da5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +74 to +79
var supportedAudioResponseFormats = map[string]struct{}{
"json": {},
"srt": {},
"text": {},
"verbose_json": {},
"vtt": {},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept the documented diarized_json transcription format

OpenAI's current Audio Transcriptions docs list diarized_json as a valid response_format for diarization-capable models such as gpt-4o-transcribe-diarize (docs). Because this allowlist omits that value, any OpenAI-compatible backend that supports diarized transcription will now be rejected locally with a 400 before the request ever reaches upstream, which breaks the endpoint's stated wire-compatibility goal.

Useful? React with 👍 / 👎.

Comment on lines +425 to +426
func IsAuthSelectableForModel(auth *Auth, model string, now time.Time) bool {
blocked, _, _ := isAuthBlockedForModel(auth, model, now)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude globally cooling-down auths from audio auto selection

Fresh evidence that the earlier unavailable-auth issue still exists: the new PreviewSelectableRouteModels path relies on IsAuthSelectableForModel, but isAuthBlockedForModel ignores auth.Unavailable/NextRetryAfter whenever model != "" and there is no ModelStates[model] entry. In that state, model=auto can still pin a newer transcription model on a credential that is globally cooling down, and the subsequent pinned execution fails instead of falling back to another ready auth.

Useful? React with 👍 / 👎.

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — Audio Transcription Endpoint

Nice work on this feature. The multipart passthrough approach is the right call for binary payloads, and the test coverage is solid. A few observations:

Potential Issues

  1. Model registry duplicationgpt-4o-mini-transcribe and whisper-1 entries are copy-pasted across 4 tier blocks (openai, codex-team, codex-plus, codex-pro). If a model gets added or updated, you need to touch 4 places. Consider extracting shared model definitions or at minimum leaving a comment pointing at the canonical block.

  2. Temp file cleanup on panic pathsstageFilePart and normalizeAudioTranscriptionResponseFromReader both use a keepTempFile bool pattern with deferred cleanup. If any goroutine panics between setting keepTempFile = true and the caller registering cleanup, the temp file leaks. The writeMultipartBody goroutine is fire-and-forget with no recovery, so a panic there could leave the pipe hanging. Consider adding a recover() in the goroutine.

  3. normalizeAudioTranscriptionResponse (in-memory version) is defined but only used in tests. The production path uses normalizeAudioTranscriptionResponseFromReader which stages to disk. If the in-memory version drifts out of sync with the file-based one, test coverage becomes misleading. Worth either removing the in-memory version or adding a property test asserting equivalence.

  4. audioTranscriptionUploadLimitBytes is 32MB — the newer gpt-4o-mini-transcribe model may support larger files. Worth making this configurable or documenting the assumption.

Minor

  • filteredErrorHeaders helper is a good DRY refactor across three call sites.
  • applyCodexPreparedHeaders nil-guard on r.Header is defensive in a good way.
  • Content-length test (lastLength > 0) catches the regression where multipart bodies sent with unknown length.

Security

  • File type validation via both extension and MIME sniffing is solid.
  • audioTranscriptionNonFileFieldsLimitBytes cap at 1MB prevents abuse via giant prompt fields.
  • No path traversal risk since filenames are only used in multipart Content-Disposition headers upstream.

Main thing to address before merge: goroutine panic safety in writeMultipartBody.

@madwiki
Copy link
Author

madwiki commented Mar 20, 2026

Code Review — Audio Transcription Endpoint

Nice work on this feature. The multipart passthrough approach is the right call for binary payloads, and the test coverage is solid. A few observations:

Potential Issues

  1. Model registry duplicationgpt-4o-mini-transcribe and whisper-1 entries are copy-pasted across 4 tier blocks (openai, codex-team, codex-plus, codex-pro). If a model gets added or updated, you need to touch 4 places. Consider extracting shared model definitions or at minimum leaving a comment pointing at the canonical block.
  2. Temp file cleanup on panic pathsstageFilePart and normalizeAudioTranscriptionResponseFromReader both use a keepTempFile bool pattern with deferred cleanup. If any goroutine panics between setting keepTempFile = true and the caller registering cleanup, the temp file leaks. The writeMultipartBody goroutine is fire-and-forget with no recovery, so a panic there could leave the pipe hanging. Consider adding a recover() in the goroutine.
  3. normalizeAudioTranscriptionResponse (in-memory version) is defined but only used in tests. The production path uses normalizeAudioTranscriptionResponseFromReader which stages to disk. If the in-memory version drifts out of sync with the file-based one, test coverage becomes misleading. Worth either removing the in-memory version or adding a property test asserting equivalence.
  4. audioTranscriptionUploadLimitBytes is 32MB — the newer gpt-4o-mini-transcribe model may support larger files. Worth making this configurable or documenting the assumption.

Minor

  • filteredErrorHeaders helper is a good DRY refactor across three call sites.
  • applyCodexPreparedHeaders nil-guard on r.Header is defensive in a good way.
  • Content-length test (lastLength > 0) catches the regression where multipart bodies sent with unknown length.

Security

  • File type validation via both extension and MIME sniffing is solid.
  • audioTranscriptionNonFileFieldsLimitBytes cap at 1MB prevents abuse via giant prompt fields.
  • No path traversal risk since filenames are only used in multipart Content-Disposition headers upstream.

Main thing to address before merge: goroutine panic safety in writeMultipartBody.

Thanks a lot for the careful review, and sorry for the churn on this PR. You were right to call out the remaining rough edges.

I addressed the follow-ups in fc3a342d and 07f44a8c:

  • documented the audio transcription route as a direct multipart passthrough path and tightened early response_format validation
  • unified audio response_format handling and added diarized_json, so validation, Accept selection, and response handling now stay consistent
  • fixed audio model=auto selection so globally cooling-down auths are no longer treated as selectable when model-specific state is missing
  • hardened the multipart writer goroutine by recovering panics and surfacing them as request errors
  • removed the duplicate test-only normalization path and switched those tests to exercise the real file-backed normalization flow
  • cleaned up the transcription-model duplication introduced in this PR by adding an optional shared Codex model layer while keeping compatibility with the existing duplicated remote catalog shape

I also added regression coverage around these paths and re-ran go test ./....

On the broader temp-file orphan point: I agree the whole-process crash window is real. I kept this pass focused on request-path cleanup and panic hardening, since fully solving crash-time orphan cleanup would need a separate process-lifecycle policy rather than more endpoint-local logic.

Thanks again for the thoughtful review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants