feat: add zero-shot voice cloning to /v1/audio/speech#676
feat: add zero-shot voice cloning to /v1/audio/speech#676jundot merged 4 commits intojundot:mainfrom
Conversation
Forward ref_audio and ref_text to model.generate() when the model's generate() signature accepts them (checked via inspect.signature, consistent with existing voice/instruct/speed pattern). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice cloning produces garbled output when ref_text doesn't match the reference audio. Make ref_text mandatory to prevent silent quality failures — discovered during manual testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @ethannortharc, nice job. I was actually working on something similar -- but you beat me to the punch. I have created PR #678, which is checked out from your forks |
|
Reviewed the full diff. Clean implementation that follows existing patterns well. Good call going with base64 instead of URL fetching for ref_audio. Eliminates SSRF entirely and keeps the server from making outbound requests on behalf of clients. The tempfile lifecycle mirrors what STT already does, input validation covers the right cases (size limit, base64 validity, ref_text requirement), and the test coverage is solid. Merging this. |
|
@ethannortharc @jundot is it possible to support local file path for ref_audio? if SSRF is the concern, you may reject it if it is not local file path. Because we need to access it via simple API call, based64 does not work well for this kind of usage. mlx-audio support this kind of file path for ref_audio. could you please reconsider. thanks. |
Summary
Add zero-shot voice cloning support to the
/v1/audio/speechendpoint by forwarding base64-encoded reference audio through to mlx-audio's native ICL (In-Context Learning) generation path.AudioSpeechRequest— two new optional fields:ref_audio(base64 string) andref_text(transcript, required when ref_audio is set)/v1/audio/speechroute — validates ref_text is present, decodes base64, validates size (≤20 MB ≈ ~60s audio), writes temp file, passes to engine, cleans up infinallyTTSEngine.synthesize()— newref_audio/ref_textparams forwarded tomodel.generate()when supportedDesign decisions
Follows mlx-audio's native pattern
mlx-audio's own server (
mlx_audio/server.py) already supportsref_audio+ref_texton the same/v1/audio/speechendpoint. We follow the same contract — same field names, same endpoint, same generation path — but replace their server-local filesystem path with base64 transport so clients don't need filesystem access to the server.Base64 only, no URL fetching
Eliminates SSRF risk entirely. The closed PR #492 accepted arbitrary URLs via
urllib.request.urlretrieve, which could reach internal network addresses (http://169.254.169.254/,http://localhost:...). Base64-only means the server never makes outbound requests on behalf of the client.ref_text is required when ref_audio is provided
During manual testing, we discovered that omitting or providing incorrect
ref_textcauses the ICL model to produce garbled, choppy audio. The text must match what is spoken in the reference audio for proper alignment. Rather than silently producing bad output, we validate upfront and return a clear 400 error.No audio truncation
mlx-audio's
_generate_icl()already capseffective_max_tokensbased on target text length, preventing runaway generation from long reference audio. We enforce a ~60-second ceiling via the base64 size limit (20 MB) instead of truncating audio ourselves — this avoids the WAV-only truncation bug from PR #492 where non-WAV formats would bypass the 15s limit entirely.No voice registry / caching
For a local inference server, the overhead of sending ~1 MB base64 per request over localhost is negligible. A two-step voice registration pattern (like ElevenLabs) adds storage management complexity that isn't justified yet. Can be added later if there's real demand.
Temp file with guaranteed cleanup
Decoded audio is written to a
NamedTemporaryFileand deleted in afinallyblock — same pattern used by the existing STT transcription endpoint (_read_upload+os.unlink).Industry comparison
Test coverage
Engine-level (
TestTTSVoiceClonePassthrough):ref_audio+ref_textforwarded tomodel.generate()Noneref_audiowithoutref_textpassesref_text=NoneRoute-level (
TestTTSVoiceCloneEndpoint):ref_audiowithref_textreturns 200engine.synthesize()ref_audiowithoutref_text→ 400 with clear error messageref_audio) unchangedAll 30 non-integration tests pass. No regressions to existing tests.
Manual testing ✅
Tested end-to-end against a live omlx server with
Qwen3-TTS-12Hz-1.7B-Base-bf16:Qwen3-ASR-1.7B-bf16to get accurateref_textref_text— produced 18s of clean cloned speech (31s generation time)Usage example
Test plan
TestTTSVoiceClonePassthrough— engine passthrough (3 tests)TestTTSVoiceCloneEndpoint— route handling (7 tests)🤖 Generated with Claude Code