Add OpenAI-compatible audio transcription endpoint#2224
Add OpenAI-compatible audio transcription endpoint#2224madwiki wants to merge 9 commits intorouter-for-me:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the CLI Proxy API by integrating an OpenAI-compatible audio transcription endpoint. This new capability allows users to perform speech-to-text conversions, addressing a previously missing feature. The implementation leverages a provider-native multipart proxy approach, ensuring compatibility with existing authentication mechanisms and minimizing architectural impact by routing through a specialized raw HTTP execution path for non-JSON requests. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully adds a new OpenAI-compatible audio transcription endpoint. The changes are well-structured, including a new raw HTTP execution path in the auth manager to handle multipart form data, which is a thoughtful architectural addition. The implementation also correctly normalizes responses to ensure a consistent API for clients. My review includes two main points: a high-severity suggestion to improve memory efficiency by streaming audio files instead of buffering them entirely in memory, and a medium-severity suggestion to simplify some redundant header-setting logic for better code clarity and maintainability.
| fileData, err := io.ReadAll(file) | ||
| if err != nil { | ||
| return nil, &audioRequestError{status: http.StatusBadRequest, msg: fmt.Sprintf("failed to read uploaded file: %v", err)} | ||
| } |
There was a problem hiding this comment.
The current implementation reads the entire uploaded audio file into memory using io.ReadAll. This can lead to high memory consumption and potential out-of-memory errors for large files, creating a possible denial-of-service vulnerability. While the OpenAI API has a 25MB limit, this proxy might be used with backends that support larger files.
To improve efficiency and robustness, I recommend streaming the file content directly from the incoming request to the outgoing request to the provider, without buffering the entire file in memory. This could be achieved by modifying BuildHTTPRequest to use an io.Pipe to construct the multipart request body on-the-fly, which would avoid loading the whole file into r.FileData.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a7d266b64
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: Well-structured addition of an OpenAI-compatible audio transcription endpoint. Good test coverage. A few items worth attention:
Security and Input Validation
-
Memory usage:
audioTranscriptionFormMemoryis 32MB, andio.ReadAll(file)afterParseMultipartFormmeans file data is buffered twice (~50MB heap per request for a 25MB file). Consider reading directly from the multipart part, or document the memory ceiling. -
File validation is extension + MIME only:
validateAudioFilechecks file extension and Content-Type header but not magic bytes. Mitigated by upstream API rejection, but worth noting. -
RemoveAllin defer: Good. Multipart temp files get cleaned up on all paths.
Potential Issues
-
Response normalization edge case:
normalizeAudioTranscriptionResponsetriesjson.Unmarshalinto astringas a fallback. If upstream returns a bare non-JSON string with a leading", Unmarshal succeeds and strips quotes. Probably fine in practice but the behavior could surprise. -
Hardcoded Codex URL: The fallback
https://chatgpt.com/backend-api/transcribefor codex auth is hardcoded. If this endpoint changes it requires a code change. Consider making it configurable.
Test Quality
Tests are solid, covering happy path, validation, endpoint registration, and header preservation. Missing tests for:
- Codex provider URL resolution path
normalizeAudioTranscriptionResponseedge cases (empty body, existingtextfield, bare JSON string)- Retry/failover behavior via
ExecuteHTTPRequest
Architecture
The ExecuteHTTPRequestWithAuthManager + HTTPRequestBuilder approach to bypass the JSON translator stack for binary multipart uploads is the right call. Clean separation from chat/completion flow. The header refactoring in codex_executor.go (extracting applyCodexPreparedHeaders) is a clean prerequisite change.
Good work overall.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Good work adding the audio transcription endpoint. The overall structure is solid — proper multipart parsing, file validation, auth routing, and response normalization. A few observations:
Potential Issues
-
Unbounded file read into memory —
io.ReadAll(file)inparseAudioTranscriptionRequestandio.ReadAll(upstreamResp.Body)inAudioTranscriptionshave no size cap beyond the 32MB multipart form limit. If an upstream response is unexpectedly large (e.g. verbose JSON from a misconfigured provider), this could spike memory. Consider adding aLimitedReaderon the upstream response body. -
cliCancelcalled with body bytes — In the success path,cliCancel(normalizedBody)passes the entire response body as the cancel reason. If this is for logging/tracing, fine, but make sure the downstream context handler can handle[]byte(not justerrororstring). If it expects a specific type, this could silently fail or panic depending on the assertion. -
Model registration duplication —
gpt-4o-mini-transcribeis copy-pasted into 4 tier blocks inmodels.json(codex-team, codex-plus, codex-pro, and the unnamed block). If tiers diverge or a new transcription model is added, this is easy to get out of sync. Not blocking, but worth noting for maintainability.
Nits
-
applyCodexHeadersnow callsapplyCodexPreparedHeadersand then unconditionally setsContent-Typetoapplication/json. ThePrepareRequestpath intentionally skips setting Content-Type (so multipart is preserved). This split is correct but fragile — a comment onapplyCodexHeadersnoting it is only for JSON request paths would help future readers. -
resolveAudioTranscriptionURLfalls back tohttps://chatgpt.com/backend-api/transcribefor codex auth without a base_url. If that endpoint ever changes or requires different auth headers, this hardcoded URL could silently break. Consider making it configurable or at least adding a constant with a comment.
Good
- File validation checks both extension and MIME type with a fallback to
http.DetectContentType— solid defensive approach. - Tests cover both the happy path (plain-text wrapping) and rejection of unsupported formats.
- The
HTTPResponseErrortype withStatusCode()andHeaders()is clean and will compose well with the existing retry machinery.
Overall: looks ready with the memory/size concern as the main thing to address before merge.
Thanks for the review. Addressed in 78b3a10. This follow-up keeps the external API unchanged and tightens the implementation around memory usage, bounds, and maintainability. Changes made:
On the cliCancel point, I did not change that path because the existing cancel helper already handles []byte payloads explicitly, so this is not a new issue introduced by this endpoint. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78b3a104a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Addressed in ea79e77 in response to the Codex review feedback. This follow-up fixes the compatibility and scheduler issues called out there:
I intentionally kept this scoped to the issues that affected correctness, compatibility, or robustness. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea79e77196
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Addressed in 8cdeddc in response to the latest Codex review feedback. This follow-up fixes the two error-path issues called out there:
I did not add transcription streaming support in this pass, since that would be a broader compatibility expansion beyond the current non-streaming transcription flow. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cdeddca17
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Addressed in 35a54e4 in response to the latest Codex review feedback. This follow-up fixes the remaining That means:
I also added regression coverage for both cases: avoiding a newer non-audio chat model and resolving through an OpenAI-compatible alias setup. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35a54e4ad4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for _, auth := range manager.List() { | ||
| if auth == nil || strings.TrimSpace(auth.ID) == "" { | ||
| continue | ||
| } | ||
| if _, err := resolveAudioTranscriptionURL(auth); err != nil { | ||
| continue |
There was a problem hiding this comment.
Exclude unavailable auths from audio
auto resolution
This model=auto picker walks every auth returned by manager.List() and only filters on resolveAudioTranscriptionURL, so disabled or cooling-down credentials still contribute candidate models. The scheduler later rejects those auths in sdk/cliproxy/auth/selector.go, which means a newer transcription model on an unavailable credential can win here and then fail with auth_not_found even though an older active transcription model is available. Please restrict auto candidates to auths that are actually selectable.
Useful? React with 👍 / 👎.
| cliCtx, cliCancel := h.GetContextWithCancel(h, c, context.Background()) | ||
| upstreamResp, _, errMsg := h.ExecuteHTTPRequestWithAuthManager(cliCtx, audioReq.Model, func(ctx context.Context, auth *coreauth.Auth, upstreamModel string) (*http.Request, error) { | ||
| return audioReq.BuildHTTPRequest(ctx, auth, upstreamModel) | ||
| }) |
There was a problem hiding this comment.
Emit non-stream keepalives while transcriptions run
This new non-streaming handler goes straight from GetContextWithCancel to ExecuteHTTPRequestWithAuthManager without calling StartNonStreamingKeepAlive, unlike the existing non-streaming OpenAI/Responses handlers. In deployments that set non_stream_keep_alive_interval to keep reverse proxies alive, long transcription jobs will now sit silent until completion and can be timed out by nginx/Cloudflare even though the other long-running endpoints keep the connection open.
Useful? React with 👍 / 👎.
| limitedReader := &io.LimitedReader{R: body, N: audioTranscriptionUpstreamResponseLimitBytes + 1} | ||
| payload, err := io.ReadAll(limitedReader) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to read upstream transcription response: %w", err) | ||
| } | ||
| if int64(len(payload)) > audioTranscriptionUpstreamResponseLimitBytes { | ||
| return nil, &audioRequestError{ |
There was a problem hiding this comment.
Remove the fixed 8 MiB cap on successful transcription bodies
Successful transcription responses are always buffered and rejected once they exceed 8 MiB. That is much smaller than the 32 MiB upload limit this handler accepts, and it breaks legitimate large outputs such as long text/subtitle responses or verbose_json with word-level timestamps, which can easily cross 8 MiB for larger audio files. In those cases the upstream request succeeds but the proxy turns it into a 502.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a40ff0c2d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| upstreamModels := manager.ExecutionModelCandidates(auth, routeModel) | ||
| if len(upstreamModels) == 0 { |
There was a problem hiding this comment.
Avoid advancing alias-pool offsets while resolving
model=auto
When an OpenAI-compatible alias maps to multiple upstream transcription models, resolveAudioRouteModelCreatedAt now calls manager.ExecutionModelCandidates(...) only to inspect candidates. That method delegates to prepareExecutionModels in sdk/cliproxy/auth/conductor.go, which increments modelPoolOffsets via nextModelPoolOffset. The same request then enters ExecuteHTTPRequest and rotates the pool a second time, so model=auto requests skip the model/fallback order that should have been attempted first and skew round-robin balancing for every alias-backed transcription pool.
Useful? React with 👍 / 👎.
| if value == "" { | ||
| return false | ||
| } | ||
| return strings.Contains(value, "transcribe") || strings.Contains(value, "transcription") || strings.Contains(value, "speech-to-text") |
There was a problem hiding this comment.
Treat
whisper-1 as transcription-capable during auto resolution
The new model=auto detection only recognizes model names containing transcribe, transcription, or speech-to-text. If an OpenAI-compatible backend advertises whisper-1 (or another speech model without those substrings) through registry.GetModelsForClient, resolveAudioUpstreamModelCreatedAt rejects it and /v1/audio/transcriptions returns “no transcription-capable model is available” even though a manual model=whisper-1 request can still be proxied. That makes auto-selection fail on common Whisper-style deployments.
Useful? React with 👍 / 👎.
| if field.Name == audioTranscriptionModelFieldName { | ||
| fieldValue = modelValue |
There was a problem hiding this comment.
Strip thinking suffixes from transcription model names
This handler writes upstreamModel into the multipart model field verbatim. Unlike the normal chat executors, which strip internal thinking suffixes with thinking.ParseSuffix(...).ModelName before calling upstream APIs (for example in internal/runtime/executor/codex_executor.go and openai_compat_executor.go), a request such as model=gpt-4o-mini-transcribe(high) or model=auto(high) will now be forwarded as ...(...). Speech-to-text backends do not understand those suffixes, so otherwise valid transcription requests will fail with upstream model not found errors.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review — Audio Transcription Endpoint
Solid, thorough implementation. Well-structured with proper resource cleanup, file validation, and size limits. A few observations:
Strengths:
- Temp file staging with proper cleanup (defer + Cleanup method) prevents memory pressure from large uploads
- Content sniffing + extension validation is a good defense-in-depth approach for file type checking
- The
countingWriterapproach for pre-calculating multipart Content-Length is clever and avoids buffering the entire body - Error responses properly use the OpenAI error format (
invalid_request_errortype) filteredErrorHeadersextraction DRYs up four near-identical blocks — good refactor
Potential issues:
-
Temp file race window: Between
os.CreateTempandos.Removein error paths, a crash could leave orphaned temp files in the OS temp dir matchingcliproxy-audio-transcription-*. Consider a startup sweep or periodic cleanup for the pattern. -
Model auto-resolution iterates all auths:
resolveAutoAudioModelBasecallsmanager.List()and iterates every auth to find transcription-capable models. If the auth list is large, this could be slow on everymodel=autorequest. Worth caching or pre-computing if this endpoint sees frequent use. -
resolveCodexAudioTranscriptionURLassumption: The function strips/codexsuffix and appends/transcribe. If the base URL has a different path structure (e.g., a versioned API path), this could construct the wrong URL. A comment documenting the expected Codex URL patterns would help future maintainers. -
Missing
whisper-1from model registry: The PR only addsgpt-4o-mini-transcribeto the models JSON. If users passwhisper-1(the classic OpenAI transcription model), it would fail model lookup. Intentional? -
The model entry is duplicated four times (once per tier in models.json). If the model definition ever needs updating, that is four places to edit. Consider a shared reference or template approach if the registry supports it.
Tests look good — the TestCodexPrepareRequestPreservesMultipartContentType test specifically validates that the Content-Type is not overwritten, which is the key integration concern. The filteredErrorHeaders test verifies blocked headers are stripped.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Overall: This is a substantial, well-structured PR adding OpenAI-compatible audio transcription to CLIProxyAPI. The code is defensive, well-tested (874 lines of tests), and handles edge cases thoroughly. A few items worth attention:
Security & Safety
-
Temp file cleanup is solid. The
defer Cleanup()pattern and thekeepTempFileguard ensure files are removed on error paths. Good. -
Upload limit (32MB) is enforced correctly via
LimitedReaderwith N+1 pattern to detect overflow. The sniff-then-copy approach avoids buffering the whole file in memory. -
Content type validation is thorough. Both extension-based and MIME-type-based checks, with sniffing via
http.DetectContentTypeas fallback. However,http.DetectContentTypewill returnapplication/octet-streamfor most audio formats since Go's detector mainly knows video/webm and audio/wave — so the file extension check is doing the real work here. This is fine, just worth knowing.
Bugs / Potential Issues
-
resolveCodexAudioTranscriptionURLpath construction: Ifbase_urlishttps://chatgpt.com/backend-api/codex, this strips/codexand appends/transcribe→https://chatgpt.com/backend-api/transcribe. But if someone has a base URL likehttps://proxy.example.com/codex-api/codex, it would also strip the trailing/codex. This seems intentional for the Codex backend but could surprise users with custom proxies. Consider documenting the URL resolution behavior. -
normalizeAudioTranscriptionResponsehandles non-JSON well by wrapping plain text in{"text": "..."}. The double-unmarshal for JSON-encoded strings is a nice touch for upstream responses that wrap text in quotes. -
Missing
whisper-1andgpt-4o-audio-previewin models.json — onlygpt-4o-mini-transcribeis added. If users have OpenAI keys, they may expectwhisper-1to work. Is this intentional to only support the newer transcription model?
Code Quality
-
DRY improvement in
filteredErrorHeaders— extracting the repeated error-header extraction pattern into a helper that also filters throughFilterUpstreamHeadersis a nice cleanup. This also fixes a subtle bug where error responses previously leakedSet-CookieandConnectionheaders from upstream. -
The
ExecuteHTTPRequestWithAuthManageraddition to the base handler is cleanly separated from the JSON translation path. ThePrepareRequestrefactoring to avoid overridingContent-Typefor multipart is correct. -
Test coverage is comprehensive — validation, format preservation, plain text wrapping, Codex URL resolution, file size limits, and multipart field ordering are all covered.
Suggestions
- Add a brief comment in the route setup explaining that audio transcription is a passthrough (no SDK translation) to help future readers understand why it's different from chat/completions.
- Consider adding a
response_formatvalidation step (reject unknown formats early) rather than silently passing them through.
Strong PR. No blocking issues.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Summary: Large, well-structured PR adding an OpenAI-compatible /v1/audio/transcriptions endpoint. Handles multipart file upload, temp file staging, content-type preservation, auto-model resolution, and response normalization. Includes comprehensive tests.
What looks good
- Solid input validation: file size limits, content-type sniffing, supported format allowlist
- Temp file cleanup with
deferand thecleanupOnErrorpattern is correct filteredErrorHeadersrefactoring removes duplicated header-extraction code across 3 call sites- Header preservation in
applyCodexPreparedHeaderscorrectly avoids overwriting multipart content types - Test coverage is thorough (874 lines of tests)
Issues to consider
-
Temp file race window on crash. If the process crashes between creating the temp file and
Cleanup(), orphaned files with prefixcliproxy-audio-transcription-*accumulate in the OS temp dir. Consider using a dedicated subdirectory that can be swept on startup, or setting a TTL-based cleanup in a background goroutine. -
resolveAutoAudioModelBaseiterates all auths and all models. For deployments with many auth entries and large model registries, this could be slow on everyautorequest. Consider caching the resolved model with a short TTL (e.g., 30s) since the model registry does not change frequently. -
audioTranscriptionUploadLimitBytesis 32MB but noContent-Lengthpre-check. The limit is enforced after reading, so a malicious client can still force the server to read up to 32MB before rejection. Adding a quickContent-Lengthheader check before streaming would short-circuit obviously oversized uploads. -
Models registry duplication. The
gpt-4o-mini-transcribeentry is copy-pasted identically across 4 tier blocks inmodels.json. If this model needs updating, all 4 must be touched. Consider a shared definition or post-processing step. -
Missing
gpt-4o-transcribemodel. Onlygpt-4o-mini-transcribeis added. OpenAI also offers the fullgpt-4o-transcribemodel. Intentional omission, or should it be included?
Nitpicks
audioTranscriptionContentSniffBytes = 512is fine buthttp.DetectContentTypeonly reads the first 512 bytes by design, so this is a perfect match. Worth a comment noting this coupling.- The
countingWriteris clean but could be a shared utility since similar patterns exist elsewhere in the codebase.
Solid work overall. The core flow is secure and well-tested. The main concern is temp file lifecycle management.
Addressed the remaining audio-transcription follow-ups in:
This update tightens the audio transcription path so it follows the existing auth-manager / selector semantics more closely and closes the remaining compatibility gaps from review. Changes included:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc3a342da5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| var supportedAudioResponseFormats = map[string]struct{}{ | ||
| "json": {}, | ||
| "srt": {}, | ||
| "text": {}, | ||
| "verbose_json": {}, | ||
| "vtt": {}, |
There was a problem hiding this comment.
Accept the documented
diarized_json transcription format
OpenAI's current Audio Transcriptions docs list diarized_json as a valid response_format for diarization-capable models such as gpt-4o-transcribe-diarize (docs). Because this allowlist omits that value, any OpenAI-compatible backend that supports diarized transcription will now be rejected locally with a 400 before the request ever reaches upstream, which breaks the endpoint's stated wire-compatibility goal.
Useful? React with 👍 / 👎.
| func IsAuthSelectableForModel(auth *Auth, model string, now time.Time) bool { | ||
| blocked, _, _ := isAuthBlockedForModel(auth, model, now) |
There was a problem hiding this comment.
Exclude globally cooling-down auths from audio auto selection
Fresh evidence that the earlier unavailable-auth issue still exists: the new PreviewSelectableRouteModels path relies on IsAuthSelectableForModel, but isAuthBlockedForModel ignores auth.Unavailable/NextRetryAfter whenever model != "" and there is no ModelStates[model] entry. In that state, model=auto can still pin a newer transcription model on a credential that is globally cooling down, and the subsequent pinned execution fails instead of falling back to another ready auth.
Useful? React with 👍 / 👎.
xkonjin
left a comment
There was a problem hiding this comment.
Code Review — Audio Transcription Endpoint
Nice work on this feature. The multipart passthrough approach is the right call for binary payloads, and the test coverage is solid. A few observations:
Potential Issues
-
Model registry duplication —
gpt-4o-mini-transcribeandwhisper-1entries are copy-pasted across 4 tier blocks (openai, codex-team, codex-plus, codex-pro). If a model gets added or updated, you need to touch 4 places. Consider extracting shared model definitions or at minimum leaving a comment pointing at the canonical block. -
Temp file cleanup on panic paths —
stageFilePartandnormalizeAudioTranscriptionResponseFromReaderboth use akeepTempFilebool pattern with deferred cleanup. If any goroutine panics between settingkeepTempFile = trueand the caller registering cleanup, the temp file leaks. ThewriteMultipartBodygoroutine is fire-and-forget with no recovery, so a panic there could leave the pipe hanging. Consider adding arecover()in the goroutine. -
normalizeAudioTranscriptionResponse(in-memory version) is defined but only used in tests. The production path usesnormalizeAudioTranscriptionResponseFromReaderwhich stages to disk. If the in-memory version drifts out of sync with the file-based one, test coverage becomes misleading. Worth either removing the in-memory version or adding a property test asserting equivalence. -
audioTranscriptionUploadLimitBytesis 32MB — the newergpt-4o-mini-transcribemodel may support larger files. Worth making this configurable or documenting the assumption.
Minor
filteredErrorHeadershelper is a good DRY refactor across three call sites.applyCodexPreparedHeadersnil-guard onr.Headeris defensive in a good way.- Content-length test (
lastLength > 0) catches the regression where multipart bodies sent with unknown length.
Security
- File type validation via both extension and MIME sniffing is solid.
audioTranscriptionNonFileFieldsLimitBytescap at 1MB prevents abuse via giant prompt fields.- No path traversal risk since filenames are only used in multipart Content-Disposition headers upstream.
Main thing to address before merge: goroutine panic safety in writeMultipartBody.
Thanks a lot for the careful review, and sorry for the churn on this PR. You were right to call out the remaining rough edges. I addressed the follow-ups in
I also added regression coverage around these paths and re-ran On the broader temp-file orphan point: I agree the whole-process crash window is real. I kept this pass focused on request-path cleanup and panic hardening, since fully solving crash-time orphan cleanup would need a separate process-lifecycle policy rather than more endpoint-local logic. Thanks again for the thoughtful review. |
Summary
This PR adds support for OpenAI-compatible audio transcription requests via
POST /v1/audio/transcriptions.The new endpoint is implemented as a provider-native multipart proxy path, reusing the existing auth selection flow instead of introducing any client-specific behavior or additional credential requirements.
What Changed
POST /v1/audio/transcriptionsmodelfileprompt(optional)language(optional)textfieldPOST /v1/audio/transcriptionsin the root endpoint listgpt-4o-mini-transcribein Codex model setsWhy
CLIProxyAPI already supports OpenAI-compatible chat and responses endpoints, but it did not expose the speech-to-text transcription endpoint.
Because audio transcription uses
multipart/form-data, it does not fit cleanly into the existing JSON translator flow. This change adds the missing capability with minimal impact on the current architecture by routing it through a dedicated raw HTTP execution path.Testing
go test ./...