feat: chunked TTS generation with quality selector#99
feat: chunked TTS generation with quality selector#99glaucusj-sai wants to merge 2 commits intojamiepine:mainfrom
Conversation
Long text that exceeds the model's max_new_tokens limit now gets automatically split at sentence boundaries, generated per-chunk, and concatenated with a short crossfade. A runtime-configurable quality setting lets users choose between standard (24 kHz native) and high (44.1 kHz via soxr VHQ resampling). Changes: - Add backend/utils/chunked_tts.py with text splitting, audio concatenation, and resampling utilities - Integrate chunking directly into PyTorchTTSBackend.generate() so both the UI /generate and any API consumer benefit - Add GET/POST /tts/settings endpoints for runtime quality control - Bump GenerationRequest.text max_length from 5000 to 50000 - Add soxr to requirements.txt Tested with 9K+ character input producing ~12 minutes of seamless audio on an NVIDIA DGX Spark (Qwen3-TTS 1.7B).
|
AI Generated pull request. Please review code to make sure it works. |
yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets. |
Thank you for being honest, I approve of commit |
- models.py: keep max_length=50000 from PR + language pattern with 'he' from main - requirements.txt: keep soxr from PR + numba and httpx from main
📝 WalkthroughWalkthroughThe changes introduce a chunked text-to-speech generation system with configurable settings. Long text is split into sentence-boundary chunks, processed individually, concatenated with crossfading, and optionally upsampled. New API endpoints expose runtime configuration for quality and chunk size settings, along with supporting utility functions for audio processing. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as API<br/>(main.py)
participant Backend as Backend<br/>(pytorch_backend.py)
participant Utils as Utils<br/>(chunked_tts.py)
participant Engine as TTS Engine
Client->>API: POST /generate (long text)
API->>Backend: generate(text)
Backend->>Utils: split_text_into_chunks(text)
Utils-->>Backend: [chunk1, chunk2, ...]
loop For each chunk
Backend->>Backend: _generate_single(chunk)
Backend->>Engine: inference(chunk)
Engine-->>Backend: audio_chunk
end
Backend->>Utils: concatenate_audio_chunks(chunks, sr)
Utils-->>Backend: concatenated_audio
Backend->>Utils: resample_audio(audio, src_rate, dst_rate)
Utils-->>Backend: upsampled_audio
Backend->>API: return (audio, sample_rate)
API-->>Client: 200 OK with audio
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/backends/pytorch_backend.py (1)
404-427:⚠️ Potential issue | 🔴 CriticalCritical:
languageparameter missing from_generate_single().The
_generate_sync()function at line 422 referenceslanguage:language=LANGUAGE_CODE_TO_NAME.get(language, "auto"),However,
languageis a parameter ofgenerate(), not_generate_single(). Since_generate_single()is a separate method (not a nested function insidegenerate()), it doesn't have access tolanguagethrough closure.This will cause a
NameError: name 'language' is not definedat runtime.🐛 Proposed fix: Add `language` parameter
async def _generate_single( self, text: str, voice_prompt: dict, + language: str = "en", seed: Optional[int] = None, instruct: Optional[str] = None, ) -> Tuple[np.ndarray, int]: """Generate audio for a single text segment (no chunking)."""And update all call sites in
generate():audio, sample_rate = await self._generate_single( - text, voice_prompt, seed, instruct, + text, voice_prompt, language, seed, instruct, )chunk_audio, chunk_sr = await self._generate_single( - chunk_text, voice_prompt, seed, instruct, + chunk_text, voice_prompt, language, seed, instruct, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/pytorch_backend.py` around lines 404 - 427, The helper _generate_single is referencing an undefined name language, causing a NameError; add language: str (or Optional[str]) as a parameter to _generate_single and propagate it into its inner _generate_sync call (use language when calling LANGUAGE_CODE_TO_NAME), then update all places that call _generate_single from generate() to pass the generate()'s language argument (and any other callers) so the method receives the correct language value; ensure function signature changes for _generate_single(...) and its callers match.
🧹 Nitpick comments (1)
backend/backends/pytorch_backend.py (1)
367-401: Thread safety: settings read without synchronization.
_tts_settingsis read at the start of generation (lines 367, 393) but can be modified concurrently viaupdate_tts_settings(). While not causing crashes, a concurrent update mid-generation could result in chunks being split with onemax_chunk_charsvalue while resampling uses a differentqualitysetting.For a "Chill" review, this is acceptable for an MVP since the impact is limited to occasional inconsistent audio quality settings within a single long request.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/pytorch_backend.py` around lines 367 - 401, The code reads _tts_settings multiple times during generation which can be concurrently modified by update_tts_settings(), causing inconsistent behavior; fix by taking a consistent snapshot of settings at the start of generate (e.g., acquire an existing self._tts_settings_lock or create one, then copy = dict(self._tts_settings) / copy.deepcopy under the lock) and use that local snapshot (e.g., local_settings) for split_text_into_chunks, quality/upsample_rate checks, and _generate_single calls so all uses (max_chunk_chars, quality, upsample_rate) are consistent for the whole request; ensure update_tts_settings() also uses the same lock when mutating _tts_settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/main.py`:
- Around line 1469-1483: In update_tts_settings, the caught ValueError is
re-raised as an HTTPException without preserving the exception chain; modify the
except block in the update_tts_settings function so you raise the HTTPException
using "raise HTTPException(... ) from e" (include the same status_code and
detail=str(e)) to preserve the original exception context for debugging while
keeping the existing behavior of update_tts_settings.
In `@backend/models.py`:
- Line 61: Update the frontend validation and UI to match the backend's new
50,000 character limit: change the Zod schema in useGenerationForm.ts (the
z.string().min(1).max(5000) call) to max(50000), update the UI text in
GenerationForm.tsx that currently says "Max 5000 characters" to "Max 50000
characters", and regenerate or adjust the generated API schema
$GenerationRequest (the maxLength: 5000 entry) to maxLength: 50000 so
client-side validation and submitted payloads align with the backend.
In `@backend/utils/chunked_tts.py`:
- Around line 27-31: The _tts_settings dict initialization uses
int(os.getenv(...)) which will raise ValueError on malformed env values; update
the code to parse and validate TTS_MAX_CHUNK_CHARS and TTS_UPSAMPLE_RATE before
assigning to _tts_settings (e.g., a small helper like parse_int_env(var_name,
default) that attempts int(), catches ValueError/TypeError, logs/warns, and
returns the default), and then set "max_chunk_chars" and "upsample_rate" from
that validated result so import-time crashes are avoided.
---
Outside diff comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 404-427: The helper _generate_single is referencing an undefined
name language, causing a NameError; add language: str (or Optional[str]) as a
parameter to _generate_single and propagate it into its inner _generate_sync
call (use language when calling LANGUAGE_CODE_TO_NAME), then update all places
that call _generate_single from generate() to pass the generate()'s language
argument (and any other callers) so the method receives the correct language
value; ensure function signature changes for _generate_single(...) and its
callers match.
---
Nitpick comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 367-401: The code reads _tts_settings multiple times during
generation which can be concurrently modified by update_tts_settings(), causing
inconsistent behavior; fix by taking a consistent snapshot of settings at the
start of generate (e.g., acquire an existing self._tts_settings_lock or create
one, then copy = dict(self._tts_settings) / copy.deepcopy under the lock) and
use that local snapshot (e.g., local_settings) for split_text_into_chunks,
quality/upsample_rate checks, and _generate_single calls so all uses
(max_chunk_chars, quality, upsample_rate) are consistent for the whole request;
ensure update_tts_settings() also uses the same lock when mutating
_tts_settings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0a98e6cb-ae35-4736-b2c4-d068db0dce6e
📒 Files selected for processing (5)
backend/backends/pytorch_backend.pybackend/main.pybackend/models.pybackend/requirements.txtbackend/utils/chunked_tts.py
| @app.get("/tts/settings") | ||
| async def get_tts_settings(): | ||
| """Get current TTS chunking and quality settings.""" | ||
| from .utils.chunked_tts import get_tts_settings as _get_settings | ||
| return _get_settings() | ||
|
|
||
|
|
||
| @app.post("/tts/settings") | ||
| async def update_tts_settings(request: models.TTSSettingsUpdate): | ||
| """Update TTS quality and chunking settings at runtime.""" | ||
| from .utils.chunked_tts import update_tts_settings as _update_settings | ||
| try: | ||
| return _update_settings(request.model_dump(exclude_none=True)) | ||
| except ValueError as e: | ||
| raise HTTPException(status_code=400, detail=str(e)) |
There was a problem hiding this comment.
Preserve exception chain with raise ... from e.
The static analysis tool flagged line 1483: within an except clause, exceptions should be raised with raise ... from e to preserve the exception chain for debugging.
🔧 Proposed fix
except ValueError as e:
- raise HTTPException(status_code=400, detail=str(e))
+ raise HTTPException(status_code=400, detail=str(e)) from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @app.get("/tts/settings") | |
| async def get_tts_settings(): | |
| """Get current TTS chunking and quality settings.""" | |
| from .utils.chunked_tts import get_tts_settings as _get_settings | |
| return _get_settings() | |
| @app.post("/tts/settings") | |
| async def update_tts_settings(request: models.TTSSettingsUpdate): | |
| """Update TTS quality and chunking settings at runtime.""" | |
| from .utils.chunked_tts import update_tts_settings as _update_settings | |
| try: | |
| return _update_settings(request.model_dump(exclude_none=True)) | |
| except ValueError as e: | |
| raise HTTPException(status_code=400, detail=str(e)) | |
| `@app.get`("/tts/settings") | |
| async def get_tts_settings(): | |
| """Get current TTS chunking and quality settings.""" | |
| from .utils.chunked_tts import get_tts_settings as _get_settings | |
| return _get_settings() | |
| `@app.post`("/tts/settings") | |
| async def update_tts_settings(request: models.TTSSettingsUpdate): | |
| """Update TTS quality and chunking settings at runtime.""" | |
| from .utils.chunked_tts import update_tts_settings as _update_settings | |
| try: | |
| return _update_settings(request.model_dump(exclude_none=True)) | |
| except ValueError as e: | |
| raise HTTPException(status_code=400, detail=str(e)) from e |
🧰 Tools
🪛 Ruff (0.15.5)
[warning] 1483-1483: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/main.py` around lines 1469 - 1483, In update_tts_settings, the caught
ValueError is re-raised as an HTTPException without preserving the exception
chain; modify the except block in the update_tts_settings function so you raise
the HTTPException using "raise HTTPException(... ) from e" (include the same
status_code and detail=str(e)) to preserve the original exception context for
debugging while keeping the existing behavior of update_tts_settings.
| """Request model for voice generation.""" | ||
| profile_id: str | ||
| text: str = Field(..., min_length=1, max_length=5000) | ||
| text: str = Field(..., min_length=1, max_length=50000) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the frontend validation limits
echo "=== Checking frontend text length validation ==="
rg -n 'max.*5000|max_length.*5000' --type=ts --type=tsx -g '!node_modules'
rg -n 'Max 5000' --type=ts --type=tsx -g '!node_modules'Repository: jamiepine/voicebox
Length of output: 183
🏁 Script executed:
#!/bin/bash
# Find the frontend files mentioned in the review
echo "=== Locating frontend files ==="
find . -name "useGenerationForm.ts" 2>/dev/null
find . -name "GenerationForm.tsx" 2>/dev/null
# Search for 5000 character limits using a broader pattern
echo -e "\n=== Searching for 5000 character validation ==="
rg -n '5000' --max-count=20 2>/dev/null | grep -E '(max|Max|length)'Repository: jamiepine/voicebox
Length of output: 1025
🏁 Script executed:
#!/bin/bash
echo "=== useGenerationForm.ts around line 14 ==="
sed -n '10,20p' ./app/src/lib/hooks/useGenerationForm.ts
echo -e "\n=== GenerationForm.tsx around lines 85-86 ==="
sed -n '80,90p' ./app/src/components/Generation/GenerationForm.tsx
echo -e "\n=== $GenerationRequest.ts around line 15 ==="
sed -n '10,20p' ./app/src/lib/api/schemas/\$GenerationRequest.ts
echo -e "\n=== backend/utils/validation.py around line 9 ==="
sed -n '1,15p' backend/utils/validation.py
echo -e "\n=== Check if validation.py's validate_text is used for the text field ==="
rg -n 'validate_text' --max-count=10Repository: jamiepine/voicebox
Length of output: 1849
🏁 Script executed:
#!/bin/bash
echo "=== Check where validate_text is called ==="
rg -n 'validate_text' -A 2 -B 2
echo -e "\n=== Find the API endpoint handler for generation ==="
rg -n 'def.*generation|@.*post|@.*app' backend/ --max-count=20 | head -30
echo -e "\n=== Look for the GenerationRequest handler ==="
rg -n 'GenerationRequest' backend/ -A 3 -B 1 | head -40Repository: jamiepine/voicebox
Length of output: 3978
Frontend validation mismatch will block users from using the expanded limit.
The backend now accepts up to 50,000 characters, but the frontend still enforces a 5,000 character limit in three places:
app/src/lib/hooks/useGenerationForm.ts:14— Zod schema:z.string().min(1).max(5000)app/src/components/Generation/GenerationForm.tsx:85-86— UI shows "Max 5000 characters"app/src/lib/api/schemas/$GenerationRequest.ts:15— Generated schema:maxLength: 5000
Users won't be able to submit texts longer than 5,000 characters from the UI despite the backend supporting it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/models.py` at line 61, Update the frontend validation and UI to match
the backend's new 50,000 character limit: change the Zod schema in
useGenerationForm.ts (the z.string().min(1).max(5000) call) to max(50000),
update the UI text in GenerationForm.tsx that currently says "Max 5000
characters" to "Max 50000 characters", and regenerate or adjust the generated
API schema $GenerationRequest (the maxLength: 5000 entry) to maxLength: 50000 so
client-side validation and submitted payloads align with the backend.
| _tts_settings = { | ||
| "quality": os.getenv("TTS_QUALITY", "standard"), | ||
| "max_chunk_chars": int(os.getenv("TTS_MAX_CHUNK_CHARS", "800")), | ||
| "upsample_rate": int(os.getenv("TTS_UPSAMPLE_RATE", "44100")), | ||
| } |
There was a problem hiding this comment.
Invalid environment variable will crash at import time.
int(os.getenv(...)) will raise ValueError if the environment variable contains a non-integer value (e.g., TTS_MAX_CHUNK_CHARS=abc). This will crash the application at startup with an unhelpful error message.
🛡️ Proposed fix: Add validation with defaults
+def _parse_int_env(key: str, default: int) -> int:
+ """Parse integer from env var with fallback to default."""
+ val = os.getenv(key)
+ if val is None:
+ return default
+ try:
+ return int(val)
+ except ValueError:
+ logger.warning(f"Invalid {key}='{val}', using default {default}")
+ return default
+
+
_tts_settings = {
"quality": os.getenv("TTS_QUALITY", "standard"),
- "max_chunk_chars": int(os.getenv("TTS_MAX_CHUNK_CHARS", "800")),
- "upsample_rate": int(os.getenv("TTS_UPSAMPLE_RATE", "44100")),
+ "max_chunk_chars": _parse_int_env("TTS_MAX_CHUNK_CHARS", 800),
+ "upsample_rate": _parse_int_env("TTS_UPSAMPLE_RATE", 44100),
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/utils/chunked_tts.py` around lines 27 - 31, The _tts_settings dict
initialization uses int(os.getenv(...)) which will raise ValueError on malformed
env values; update the code to parse and validate TTS_MAX_CHUNK_CHARS and
TTS_UPSAMPLE_RATE before assigning to _tts_settings (e.g., a small helper like
parse_int_env(var_name, default) that attempts int(), catches
ValueError/TypeError, logs/warns, and returns the default), and then set
"max_chunk_chars" and "upsample_rate" from that validated result so import-time
crashes are avoided.
Text exceeding max_chunk_chars (default 800) is automatically split at sentence boundaries, generated per-chunk, and concatenated with a 50ms crossfade. Works with all engines (Qwen, LuxTTS, Chatterbox, Turbo). - Abbreviation-aware sentence splitter (Dr., Mr., e.g., decimals) - CJK sentence-ending punctuation support - Paralinguistic tag preservation ([laugh], [cough], etc.) - Per-chunk seed variation to avoid correlated RNG artefacts - Per-chunk Chatterbox trim (catches hallucination at each boundary) - max_chunk_chars exposed as per-request param on GenerationRequest - Text max_length raised to 50,000 characters Closes #99
Summary
Long text that exceeds the Qwen3-TTS model's
max_new_tokens=2048limit (~170s audio) now gets automatically handled:standard(24kHz native) andhigh(44.1kHz via soxr VHQ resampling)GET/POST /tts/settingsendpoints for runtime quality control without restartShort text (<800 chars) uses the original single-shot fast path with zero overhead.
Changes
backend/utils/chunked_tts.pybackend/backends/pytorch_backend.pygenerate(), extract_generate_single()backend/main.pyGET/POST /tts/settingsendpointsbackend/models.pyTTSSettingsUpdatemodel, bump text max_length to 50000backend/requirements.txtsoxr>=0.3.0for high-quality resamplingEnvironment variables
TTS_QUALITYstandardstandard=24kHz,high=44.1kHz)TTS_MAX_CHUNK_CHARS800TTS_UPSAMPLE_RATE44100Test plan
GET /tts/settingsreturns current configPOST /tts/settingswith{"quality":"high"}updates at runtimeTested on NVIDIA DGX Spark with Qwen3-TTS 1.7B — 9K character input produced ~12 minutes of seamless audio.
Summary by CodeRabbit