Skip to content

Feature/audio voicedesign params#678

Open
jnchaba wants to merge 5 commits intojundot:mainfrom
jnchaba:feature/audio-voicedesign-params
Open

Feature/audio voicedesign params#678
jnchaba wants to merge 5 commits intojundot:mainfrom
jnchaba:feature/audio-voicedesign-params

Conversation

@jnchaba
Copy link
Copy Markdown

@jnchaba jnchaba commented Apr 8, 2026

Extends PR #676's voice cloning support with two additional Qwen3-TTS capabilities:

  • VoiceDesign: when a model exposes generate_voice_design() and instructions are provided, routes through the dedicated VD path instead of standard generate()
  • Generation params: temperature, top_k, top_p, repetition_penalty, and max_tokens are accepted in the API schema and forwarded to whichever generation path is used

Also fixes test fixtures to use plain FakeModel instead of MagicMock,preventing false hasattr() positives for VoiceDesign detection.

You can test this with the following CLI commands:

# request VoiceDesign TTS
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16",
      "input": "Hello, this is a test of voice design.",
      "instructions": "female, calm, slow, clear enunciation",
      "temperature": 0.9
    }' --output voice_design.wav

# play wav file
afplay voice_design.wav

And cloning works with the designed voice, thanks to PR #676

# request cloned voice TTS
curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{
      "model": "Qwen3-TTS-12Hz-1.7B-Base-bf16",
      "input": "Now I am saying something completely different in the same voice.",
      "voice": "en",
      "ref_audio": "'$(base64 -i ~/voice_design.wav)'",
      "ref_text": "Hello, this is a test of voice design.",
      "temperature": 0.9
    }' --output cloned.wav

# play wav file
afplay cloned.wav

ethannortharc and others added 5 commits April 8, 2026 08:31
Forward ref_audio and ref_text to model.generate() when the model's
generate() signature accepts them (checked via inspect.signature,
consistent with existing voice/instruct/speed pattern).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Voice cloning produces garbled output when ref_text doesn't match the
reference audio. Make ref_text mandatory to prevent silent quality
failures — discovered during manual testing.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@ethannortharc
Copy link
Copy Markdown
Contributor

Thanks for building on the voice clone work — the generation params, FakeModel test fix, and model.sample_rate
usage are all solid improvements.

One concern with the VoiceDesign routing:

The hasattr(model, "generate_voice_design") check at the omlx engine level introduces a second routing layer that
duplicates what mlx-audio already does internally. When instruct is passed to model.generate(), the model's own
generate() method already routes to generate_voice_design() based on config.tts_model_type == "voice_design" — this is the authoritative check.

The problem with hasattr detection: CustomVoice models also have generate_voice_design() on them (inherited from the same base class), but they should route to generate_custom_voice() instead. If a user sends
instructions to a CustomVoice model, the current PR would call generate_voice_design() directly, bypassing the model's own type-based routing. The model-level method does have a guard (if self.config.tts_model_type !=. "voice_design": raise ValueError), so it won't silently produce wrong output — but the user gets a confusing error
instead of the correct CustomVoice behavior.

Suggestion: remove the hasattr-based VoiceDesign branch and keep passing instruct=instructions through
model.generate() — the model already knows how to route it correctly based on its own tts_model_type. The generation params, FakeModel fix, and sample_rate changes can stand on their own.

@jundot
Copy link
Copy Markdown
Owner

jundot commented Apr 9, 2026

Thanks for the PR. The generation params forwarding, FakeModel test fixture, and model.sample_rate usage are all good improvements that i want to keep.

@ethannortharc's feedback on the VoiceDesign routing is correct. The existing tts.py already uses inspect.signature(model.generate).parameters to route kwargs through model.generate(), which lets mlx-audio handle the internal dispatch based on config.tts_model_type. Adding a hasattr(model, "generate_voice_design") branch at the omlx layer creates a second routing layer that conflicts with this.

The specific problem: CustomVoice models inherit generate_voice_design() from the same base class. If a user sends instructions to a CustomVoice model, this PR would call generate_voice_design() directly, bypassing the model's own type check. The model's guard would raise a ValueError instead of routing to the correct CustomVoice path.

Suggestion: drop the hasattr-based VoiceDesign branch and keep passing instruct=instructions through model.generate() as the existing code does. The generation params, FakeModel fixture, and sample_rate changes can stand on their own without it.

Also, now that #676 is merged, this branch will need a rebase since both PRs add ref_audio/ref_text to AudioSpeechRequest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants