Skip to content

feat: add zero-shot voice cloning to /v1/audio/speech#676

Merged
jundot merged 4 commits intojundot:mainfrom
ethannortharc:feat/voice-clone-tts
Apr 9, 2026
Merged

feat: add zero-shot voice cloning to /v1/audio/speech#676
jundot merged 4 commits intojundot:mainfrom
ethannortharc:feat/voice-clone-tts

Conversation

@ethannortharc
Copy link
Copy Markdown
Contributor

@ethannortharc ethannortharc commented Apr 8, 2026

Summary

Add zero-shot voice cloning support to the /v1/audio/speech endpoint by forwarding base64-encoded reference audio through to mlx-audio's native ICL (In-Context Learning) generation path.

  • AudioSpeechRequest — two new optional fields: ref_audio (base64 string) and ref_text (transcript, required when ref_audio is set)
  • /v1/audio/speech route — validates ref_text is present, decodes base64, validates size (≤20 MB ≈ ~60s audio), writes temp file, passes to engine, cleans up in finally
  • TTSEngine.synthesize() — new ref_audio/ref_text params forwarded to model.generate() when supported

Design decisions

Follows mlx-audio's native pattern

mlx-audio's own server (mlx_audio/server.py) already supports ref_audio + ref_text on the same /v1/audio/speech endpoint. We follow the same contract — same field names, same endpoint, same generation path — but replace their server-local filesystem path with base64 transport so clients don't need filesystem access to the server.

Base64 only, no URL fetching

Eliminates SSRF risk entirely. The closed PR #492 accepted arbitrary URLs via urllib.request.urlretrieve, which could reach internal network addresses (http://169.254.169.254/, http://localhost:...). Base64-only means the server never makes outbound requests on behalf of the client.

ref_text is required when ref_audio is provided

During manual testing, we discovered that omitting or providing incorrect ref_text causes the ICL model to produce garbled, choppy audio. The text must match what is spoken in the reference audio for proper alignment. Rather than silently producing bad output, we validate upfront and return a clear 400 error.

No audio truncation

mlx-audio's _generate_icl() already caps effective_max_tokens based on target text length, preventing runaway generation from long reference audio. We enforce a ~60-second ceiling via the base64 size limit (20 MB) instead of truncating audio ourselves — this avoids the WAV-only truncation bug from PR #492 where non-WAV formats would bypass the 15s limit entirely.

No voice registry / caching

For a local inference server, the overhead of sending ~1 MB base64 per request over localhost is negligible. A two-step voice registration pattern (like ElevenLabs) adds storage management complexity that isn't justified yet. Can be added later if there's real demand.

Temp file with guaranteed cleanup

Decoded audio is written to a NamedTemporaryFile and deleted in a finally block — same pattern used by the existing STT transcription endpoint (_read_upload + os.unlink).

Industry comparison

Service Voice clone approach Audio transport Registration required?
OpenAI Not publicly available N/A N/A
ElevenLabs Two-step: register voice → use voice_id Multipart upload Yes
Fish Audio Same endpoint, zero-shot MessagePack binary Optional
CosyVoice Two-step: register via URL → use voice_id Public URL Yes
mlx-audio server Same endpoint, zero-shot Server-local file path No
omlx (this PR) Same endpoint, zero-shot Base64 in JSON body No

Test coverage

Engine-level (TestTTSVoiceClonePassthrough):

  • ref_audio + ref_text forwarded to model.generate()
  • Neither passed when both are None
  • ref_audio without ref_text passes ref_text=None

Route-level (TestTTSVoiceCloneEndpoint):

  • Valid base64 ref_audio with ref_text returns 200
  • Decoded audio forwarded as temp file path to engine.synthesize()
  • Invalid base64 → 400 with clear error message
  • Oversized payload → 413
  • ref_audio without ref_text → 400 with clear error message
  • Temp file cleanup verified (file deleted after response)
  • Normal TTS (no ref_audio) unchanged

All 30 non-integration tests pass. No regressions to existing tests.

Manual testing ✅

Tested end-to-end against a live omlx server with Qwen3-TTS-12Hz-1.7B-Base-bf16:

  1. STT transcription of reference audio via Qwen3-ASR-1.7B-bf16 to get accurate ref_text
  2. Voice clone request with correct ref_text — produced 18s of clean cloned speech (31s generation time)
  3. Incorrect ref_text — confirmed garbled output (motivating the ref_text requirement)
  4. Error cases — invalid base64 returns 400, missing ref_text returns 400

Usage example

# 1. Transcribe reference audio to get accurate ref_text
REF_TEXT=$(curl -s -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@reference_voice.wav" \
  -F "model=Qwen3-ASR-1.7B-bf16" | jq -r '.text')

# 2. Encode reference audio
REF_B64=$(base64 < reference_voice.wav)

# 3. Voice clone request
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen3-TTS-12Hz-1.7B-Base-bf16\",
    \"input\": \"Hello, this is a cloned voice speaking.\",
    \"ref_audio\": \"$REF_B64\",
    \"ref_text\": \"$REF_TEXT\"
  }" \
  --output cloned_speech.wav

Test plan

  • TestTTSVoiceClonePassthrough — engine passthrough (3 tests)
  • TestTTSVoiceCloneEndpoint — route handling (7 tests)
  • All existing TTS tests pass (20 tests, no regressions)
  • Manual test with real Qwen3-TTS model and reference audio

🤖 Generated with Claude Code

ethannortharc and others added 4 commits April 8, 2026 08:31
Forward ref_audio and ref_text to model.generate() when the model's
generate() signature accepts them (checked via inspect.signature,
consistent with existing voice/instruct/speed pattern).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice cloning produces garbled output when ref_text doesn't match the
reference audio. Make ref_text mandatory to prevent silent quality
failures — discovered during manual testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jnchaba
Copy link
Copy Markdown
Contributor

jnchaba commented Apr 8, 2026

Hi @ethannortharc, nice job. I was actually working on something similar -- but you beat me to the punch. I have created PR #678, which is checked out from your forks feat/voice-clone-tts branch to add VoiceDesign support, and also generation param support. Hope you don't mind.

@ethannortharc
Copy link
Copy Markdown
Contributor Author

Thanks @jnchaba! Don't mind at all — glad the branch was useful. I've left a review on #678 with some feedback on
the VoiceDesign routing.

@jundot
Copy link
Copy Markdown
Owner

jundot commented Apr 9, 2026

Reviewed the full diff. Clean implementation that follows existing patterns well.

Good call going with base64 instead of URL fetching for ref_audio. Eliminates SSRF entirely and keeps the server from making outbound requests on behalf of clients. The tempfile lifecycle mirrors what STT already does, input validation covers the right cases (size limit, base64 validity, ref_text requirement), and the test coverage is solid.

Merging this.

@orghapygq1
Copy link
Copy Markdown

@ethannortharc @jundot is it possible to support local file path for ref_audio? if SSRF is the concern, you may reject it if it is not local file path. Because we need to access it via simple API call, based64 does not work well for this kind of usage. mlx-audio support this kind of file path for ref_audio. could you please reconsider. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants