Feature/audio voicedesign params by jnchaba · Pull Request #678 · jundot/omlx

jnchaba · 2026-04-08T17:39:01Z

Extends PR #676's voice cloning support with two additional Qwen3-TTS capabilities:

VoiceDesign: when a model exposes generate_voice_design() and instructions are provided, routes through the dedicated VD path instead of standard generate()
Generation params: temperature, top_k, top_p, repetition_penalty, and max_tokens are accepted in the API schema and forwarded to whichever generation path is used

Also fixes test fixtures to use plain FakeModel instead of MagicMock,preventing false hasattr() positives for VoiceDesign detection.

You can test this with the following CLI commands:

# request VoiceDesign TTS
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16",
      "input": "Hello, this is a test of voice design.",
      "instructions": "female, calm, slow, clear enunciation",
      "temperature": 0.9
    }' --output voice_design.wav

# play wav file
afplay voice_design.wav

And cloning works with the designed voice, thanks to PR #676

# request cloned voice TTS
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3-TTS-12Hz-1.7B-Base-bf16",
      "input": "Now I am saying something completely different in the same voice.",
      "voice": "en",
      "ref_audio": "'$(base64 -i ~/voice_design.wav)'",
      "ref_text": "Hello, this is a test of voice design.",
      "temperature": 0.9
    }' --output cloned.wav

# play wav file
afplay cloned.wav

Forward ref_audio and ref_text to model.generate() when the model's generate() signature accepts them (checked via inspect.signature, consistent with existing voice/instruct/speed pattern). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Voice cloning produces garbled output when ref_text doesn't match the reference audio. Make ref_text mandatory to prevent silent quality failures — discovered during manual testing. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

ethannortharc · 2026-04-09T06:09:09Z

Thanks for building on the voice clone work — the generation params, FakeModel test fix, and model.sample_rate
usage are all solid improvements.

One concern with the VoiceDesign routing:

The hasattr(model, "generate_voice_design") check at the omlx engine level introduces a second routing layer that
duplicates what mlx-audio already does internally. When instruct is passed to model.generate(), the model's own
generate() method already routes to generate_voice_design() based on config.tts_model_type == "voice_design" — this is the authoritative check.

The problem with hasattr detection: CustomVoice models also have generate_voice_design() on them (inherited from the same base class), but they should route to generate_custom_voice() instead. If a user sends
instructions to a CustomVoice model, the current PR would call generate_voice_design() directly, bypassing the model's own type-based routing. The model-level method does have a guard (if self.config.tts_model_type !=. "voice_design": raise ValueError), so it won't silently produce wrong output — but the user gets a confusing error
instead of the correct CustomVoice behavior.

Suggestion: remove the hasattr-based VoiceDesign branch and keep passing instruct=instructions through
model.generate() — the model already knows how to route it correctly based on its own tts_model_type. The generation params, FakeModel fix, and sample_rate changes can stand on their own.

jundot · 2026-04-09T22:07:55Z

Thanks for the PR. The generation params forwarding, FakeModel test fixture, and model.sample_rate usage are all good improvements that i want to keep.

@ethannortharc's feedback on the VoiceDesign routing is correct. The existing tts.py already uses inspect.signature(model.generate).parameters to route kwargs through model.generate(), which lets mlx-audio handle the internal dispatch based on config.tts_model_type. Adding a hasattr(model, "generate_voice_design") branch at the omlx layer creates a second routing layer that conflicts with this.

The specific problem: CustomVoice models inherit generate_voice_design() from the same base class. If a user sends instructions to a CustomVoice model, this PR would call generate_voice_design() directly, bypassing the model's own type check. The model's guard would raise a ValueError instead of routing to the correct CustomVoice path.

Suggestion: drop the hasattr-based VoiceDesign branch and keep passing instruct=instructions through model.generate() as the existing code does. The generation params, FakeModel fixture, and sample_rate changes can stand on their own without it.

Also, now that #676 is merged, this branch will need a rebase since both PRs add ref_audio/ref_text to AudioSpeechRequest.

ethannortharc and others added 5 commits April 8, 2026 08:31

feat: add ref_audio and ref_text fields to AudioSpeechRequest

d199de2

feat: add voice clone support with base64 ref_audio to /v1/audio/speech

2d1028b

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

feat: add VoiceDesign path and generation params to TTS engine

31b903a

jnchaba mentioned this pull request Apr 8, 2026

feat: add zero-shot voice cloning to /v1/audio/speech #676

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/audio voicedesign params#678

Feature/audio voicedesign params#678
jnchaba wants to merge 5 commits intojundot:mainfrom
jnchaba:feature/audio-voicedesign-params

jnchaba commented Apr 8, 2026

Uh oh!

ethannortharc commented Apr 9, 2026

Uh oh!

jundot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jnchaba commented Apr 8, 2026

Uh oh!

ethannortharc commented Apr 9, 2026

Uh oh!

jundot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants