feat: add zero-shot voice cloning to /v1/audio/speech by ethannortharc · Pull Request #676 · jundot/omlx

ethannortharc · 2026-04-08T15:38:36Z

Summary

Add zero-shot voice cloning support to the /v1/audio/speech endpoint by forwarding base64-encoded reference audio through to mlx-audio's native ICL (In-Context Learning) generation path.

AudioSpeechRequest — two new optional fields: ref_audio (base64 string) and ref_text (transcript, required when ref_audio is set)
/v1/audio/speech route — validates ref_text is present, decodes base64, validates size (≤20 MB ≈ ~60s audio), writes temp file, passes to engine, cleans up in finally
TTSEngine.synthesize() — new ref_audio/ref_text params forwarded to model.generate() when supported

Design decisions

Follows mlx-audio's native pattern

mlx-audio's own server (mlx_audio/server.py) already supports ref_audio + ref_text on the same /v1/audio/speech endpoint. We follow the same contract — same field names, same endpoint, same generation path — but replace their server-local filesystem path with base64 transport so clients don't need filesystem access to the server.

Base64 only, no URL fetching

Eliminates SSRF risk entirely. The closed PR #492 accepted arbitrary URLs via urllib.request.urlretrieve, which could reach internal network addresses (http://169.254.169.254/, http://localhost:...). Base64-only means the server never makes outbound requests on behalf of the client.

ref_text is required when ref_audio is provided

During manual testing, we discovered that omitting or providing incorrect ref_text causes the ICL model to produce garbled, choppy audio. The text must match what is spoken in the reference audio for proper alignment. Rather than silently producing bad output, we validate upfront and return a clear 400 error.

No audio truncation

mlx-audio's _generate_icl() already caps effective_max_tokens based on target text length, preventing runaway generation from long reference audio. We enforce a ~60-second ceiling via the base64 size limit (20 MB) instead of truncating audio ourselves — this avoids the WAV-only truncation bug from PR #492 where non-WAV formats would bypass the 15s limit entirely.

No voice registry / caching

For a local inference server, the overhead of sending ~1 MB base64 per request over localhost is negligible. A two-step voice registration pattern (like ElevenLabs) adds storage management complexity that isn't justified yet. Can be added later if there's real demand.

Temp file with guaranteed cleanup

Decoded audio is written to a NamedTemporaryFile and deleted in a finally block — same pattern used by the existing STT transcription endpoint (_read_upload + os.unlink).

Industry comparison

Service	Voice clone approach	Audio transport	Registration required?
OpenAI	Not publicly available	N/A	N/A
ElevenLabs	Two-step: register voice → use voice_id	Multipart upload	Yes
Fish Audio	Same endpoint, zero-shot	MessagePack binary	Optional
CosyVoice	Two-step: register via URL → use voice_id	Public URL	Yes
mlx-audio server	Same endpoint, zero-shot	Server-local file path	No
omlx (this PR)	Same endpoint, zero-shot	Base64 in JSON body	No

Test coverage

Engine-level (TestTTSVoiceClonePassthrough):

ref_audio + ref_text forwarded to model.generate()
Neither passed when both are None
ref_audio without ref_text passes ref_text=None

Route-level (TestTTSVoiceCloneEndpoint):

Valid base64 ref_audio with ref_text returns 200
Decoded audio forwarded as temp file path to engine.synthesize()
Invalid base64 → 400 with clear error message
Oversized payload → 413
ref_audio without ref_text → 400 with clear error message
Temp file cleanup verified (file deleted after response)
Normal TTS (no ref_audio) unchanged

All 30 non-integration tests pass. No regressions to existing tests.

Manual testing ✅

Tested end-to-end against a live omlx server with Qwen3-TTS-12Hz-1.7B-Base-bf16:

STT transcription of reference audio via Qwen3-ASR-1.7B-bf16 to get accurate ref_text
Voice clone request with correct ref_text — produced 18s of clean cloned speech (31s generation time)
Incorrect ref_text — confirmed garbled output (motivating the ref_text requirement)
Error cases — invalid base64 returns 400, missing ref_text returns 400

Usage example

# 1. Transcribe reference audio to get accurate ref_text
REF_TEXT=$(curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@reference_voice.wav" \
  -F "model=Qwen3-ASR-1.7B-bf16" | jq -r '.text')

# 2. Encode reference audio
REF_B64=$(base64 < reference_voice.wav)

# 3. Voice clone request
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen3-TTS-12Hz-1.7B-Base-bf16\",
    \"input\": \"Hello, this is a cloned voice speaking.\",
    \"ref_audio\": \"$REF_B64\",
    \"ref_text\": \"$REF_TEXT\"
  }" \
  --output cloned_speech.wav

Test plan

TestTTSVoiceClonePassthrough — engine passthrough (3 tests)
TestTTSVoiceCloneEndpoint — route handling (7 tests)
All existing TTS tests pass (20 tests, no regressions)
Manual test with real Qwen3-TTS model and reference audio

🤖 Generated with Claude Code

Forward ref_audio and ref_text to model.generate() when the model's generate() signature accepts them (checked via inspect.signature, consistent with existing voice/instruct/speed pattern). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Voice cloning produces garbled output when ref_text doesn't match the reference audio. Make ref_text mandatory to prevent silent quality failures — discovered during manual testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jnchaba · 2026-04-08T17:40:15Z

Hi @ethannortharc, nice job. I was actually working on something similar -- but you beat me to the punch. I have created PR #678, which is checked out from your forks feat/voice-clone-tts branch to add VoiceDesign support, and also generation param support. Hope you don't mind.

ethannortharc · 2026-04-09T06:09:41Z

Thanks @jnchaba! Don't mind at all — glad the branch was useful. I've left a review on #678 with some feedback on
the VoiceDesign routing.

jundot · 2026-04-09T22:07:35Z

Reviewed the full diff. Clean implementation that follows existing patterns well.

Good call going with base64 instead of URL fetching for ref_audio. Eliminates SSRF entirely and keeps the server from making outbound requests on behalf of clients. The tempfile lifecycle mirrors what STT already does, input validation covers the right cases (size limit, base64 validity, ref_text requirement), and the test coverage is solid.

Merging this.

orghapygq1 · 2026-04-11T00:30:23Z

@ethannortharc @jundot is it possible to support local file path for ref_audio? if SSRF is the concern, you may reject it if it is not local file path. Because we need to access it via simple API call, based64 does not work well for this kind of usage. mlx-audio support this kind of file path for ref_audio. could you please reconsider. thanks.

ethannortharc and others added 4 commits April 8, 2026 08:31

feat: add ref_audio and ref_text fields to AudioSpeechRequest

d199de2

feat: add voice clone support with base64 ref_audio to /v1/audio/speech

2d1028b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jnchaba mentioned this pull request Apr 8, 2026

Feature/audio voicedesign params #678

Merged

jundot mentioned this pull request Apr 9, 2026

Qwen3-TTS-12Hz-1.7B-Base 如何使用克隆音色生成音频 #638

Closed

jundot merged commit 69c0590 into jundot:main Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add zero-shot voice cloning to /v1/audio/speech#676

feat: add zero-shot voice cloning to /v1/audio/speech#676
jundot merged 4 commits intojundot:mainfrom
ethannortharc:feat/voice-clone-tts

ethannortharc commented Apr 8, 2026 •

edited

Loading

Uh oh!

jnchaba commented Apr 8, 2026

Uh oh!

ethannortharc commented Apr 9, 2026

Uh oh!

jundot commented Apr 9, 2026

Uh oh!

orghapygq1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ethannortharc commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design decisions

Follows mlx-audio's native pattern

Base64 only, no URL fetching

ref_text is required when ref_audio is provided

No audio truncation

No voice registry / caching

Temp file with guaranteed cleanup

Industry comparison

Test coverage

Manual testing ✅

Usage example

Test plan

Uh oh!

jnchaba commented Apr 8, 2026

Uh oh!

ethannortharc commented Apr 9, 2026

Uh oh!

jundot commented Apr 9, 2026

Uh oh!

orghapygq1 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ethannortharc commented Apr 8, 2026 •

edited

Loading