Skip to content

feat: chunked TTS generation with quality selector#99

Closed
glaucusj-sai wants to merge 2 commits intojamiepine:mainfrom
glaucusj-sai:feat/chunked-tts-quality
Closed

feat: chunked TTS generation with quality selector#99
glaucusj-sai wants to merge 2 commits intojamiepine:mainfrom
glaucusj-sai:feat/chunked-tts-quality

Conversation

@glaucusj-sai
Copy link

@glaucusj-sai glaucusj-sai commented Feb 19, 2026

Summary

Long text that exceeds the Qwen3-TTS model's max_new_tokens=2048 limit (~170s audio) now gets automatically handled:

  • Text splitting: Splits at sentence boundaries (with clause/word fallbacks) into configurable chunks (default 800 chars)
  • Crossfade concatenation: Joins audio chunks with a 50ms crossfade to eliminate clicks at boundaries
  • Quality selector: Runtime-switchable between standard (24kHz native) and high (44.1kHz via soxr VHQ resampling)
  • Settings API: New GET/POST /tts/settings endpoints for runtime quality control without restart

Short text (<800 chars) uses the original single-shot fast path with zero overhead.

Changes

File Change
backend/utils/chunked_tts.py New: text splitting, audio concat, resampling utilities
backend/backends/pytorch_backend.py Integrate chunking into generate(), extract _generate_single()
backend/main.py Add GET/POST /tts/settings endpoints
backend/models.py Add TTSSettingsUpdate model, bump text max_length to 50000
backend/requirements.txt Add soxr>=0.3.0 for high-quality resampling

Environment variables

Variable Default Description
TTS_QUALITY standard Output quality (standard=24kHz, high=44.1kHz)
TTS_MAX_CHUNK_CHARS 800 Max characters per chunk
TTS_UPSAMPLE_RATE 44100 Target sample rate for high quality

Test plan

  • Short text (<800 chars): uses single-shot path, no chunking overhead
  • Long text (9K+ chars): splits into ~12 chunks, generates and concatenates seamlessly
  • Quality switch to high: output sample rate changes to 44100
  • Switch back to standard: output returns to 24000
  • GET /tts/settings returns current config
  • POST /tts/settings with {"quality":"high"} updates at runtime

Tested on NVIDIA DGX Spark with Qwen3-TTS 1.7B — 9K character input produced ~12 minutes of seamless audio.

Summary by CodeRabbit

  • New Features
    • Extended text input support: TTS now handles texts up to 50,000 characters (previously 5,000).
    • Configurable quality settings: adjust audio quality levels (standard/high) and text chunk sizes at runtime.
    • New settings endpoints: retrieve and update TTS configuration dynamically via API.
    • Improved audio processing: enhanced handling for longer texts with optimized seamless transitions.

Long text that exceeds the model's max_new_tokens limit now gets
automatically split at sentence boundaries, generated per-chunk,
and concatenated with a short crossfade. A runtime-configurable
quality setting lets users choose between standard (24 kHz native)
and high (44.1 kHz via soxr VHQ resampling).

Changes:
- Add backend/utils/chunked_tts.py with text splitting, audio
  concatenation, and resampling utilities
- Integrate chunking directly into PyTorchTTSBackend.generate()
  so both the UI /generate and any API consumer benefit
- Add GET/POST /tts/settings endpoints for runtime quality control
- Bump GenerationRequest.text max_length from 5000 to 50000
- Add soxr to requirements.txt

Tested with 9K+ character input producing ~12 minutes of
seamless audio on an NVIDIA DGX Spark (Qwen3-TTS 1.7B).
@TacoDark
Copy link

AI Generated pull request. Please review code to make sure it works.

@glaucusj-sai
Copy link
Author

AI Generated pull request. Please review code to make sure it works.

yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets.

@TacoDark
Copy link

AI Generated pull request. Please review code to make sure it works.

yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets.

Thank you for being honest, I approve of commit

- models.py: keep max_length=50000 from PR + language pattern with 'he' from main
- requirements.txt: keep soxr from PR + numba and httpx from main
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The changes introduce a chunked text-to-speech generation system with configurable settings. Long text is split into sentence-boundary chunks, processed individually, concatenated with crossfading, and optionally upsampled. New API endpoints expose runtime configuration for quality and chunk size settings, along with supporting utility functions for audio processing.

Changes

Cohort / File(s) Summary
Core TTS Processing
backend/backends/pytorch_backend.py, backend/utils/chunked_tts.py
New chunked generation flow with text splitting at sentence/clause boundaries, per-chunk audio generation via _generate_single, crossfade-based concatenation, and optional quality-based upsampling. Comprehensive audio utilities including resampling via soxr with fallback to linear interpolation.
API & Configuration
backend/main.py, backend/models.py
Added GET/POST endpoints (/tts/settings) for runtime TTS settings management. Introduced TTSSettingsUpdate model for validating quality and max_chunk_chars parameters. Increased GenerationRequest.text max_length from 5000 to 50000 characters.
Dependencies
backend/requirements.txt
Added soxr>=0.3.0 dependency for high-quality audio resampling.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as API<br/>(main.py)
    participant Backend as Backend<br/>(pytorch_backend.py)
    participant Utils as Utils<br/>(chunked_tts.py)
    participant Engine as TTS Engine

    Client->>API: POST /generate (long text)
    API->>Backend: generate(text)
    Backend->>Utils: split_text_into_chunks(text)
    Utils-->>Backend: [chunk1, chunk2, ...]
    
    loop For each chunk
        Backend->>Backend: _generate_single(chunk)
        Backend->>Engine: inference(chunk)
        Engine-->>Backend: audio_chunk
    end
    
    Backend->>Utils: concatenate_audio_chunks(chunks, sr)
    Utils-->>Backend: concatenated_audio
    Backend->>Utils: resample_audio(audio, src_rate, dst_rate)
    Utils-->>Backend: upsampled_audio
    Backend->>API: return (audio, sample_rate)
    API-->>Client: 200 OK with audio
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 Hop, hop—chunks of audio flow,
Sentences split and voices glow,
Crossfades blend them smooth and tight,
Long texts sing in sterling bright! ✨🎵

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding chunked TTS generation with a quality selector feature to handle longer text inputs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/backends/pytorch_backend.py (1)

404-427: ⚠️ Potential issue | 🔴 Critical

Critical: language parameter missing from _generate_single().

The _generate_sync() function at line 422 references language:

language=LANGUAGE_CODE_TO_NAME.get(language, "auto"),

However, language is a parameter of generate(), not _generate_single(). Since _generate_single() is a separate method (not a nested function inside generate()), it doesn't have access to language through closure.

This will cause a NameError: name 'language' is not defined at runtime.

🐛 Proposed fix: Add `language` parameter
     async def _generate_single(
         self,
         text: str,
         voice_prompt: dict,
+        language: str = "en",
         seed: Optional[int] = None,
         instruct: Optional[str] = None,
     ) -> Tuple[np.ndarray, int]:
         """Generate audio for a single text segment (no chunking)."""

And update all call sites in generate():

             audio, sample_rate = await self._generate_single(
-                text, voice_prompt, seed, instruct,
+                text, voice_prompt, language, seed, instruct,
             )
                 chunk_audio, chunk_sr = await self._generate_single(
-                    chunk_text, voice_prompt, seed, instruct,
+                    chunk_text, voice_prompt, language, seed, instruct,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/pytorch_backend.py` around lines 404 - 427, The helper
_generate_single is referencing an undefined name language, causing a NameError;
add language: str (or Optional[str]) as a parameter to _generate_single and
propagate it into its inner _generate_sync call (use language when calling
LANGUAGE_CODE_TO_NAME), then update all places that call _generate_single from
generate() to pass the generate()'s language argument (and any other callers) so
the method receives the correct language value; ensure function signature
changes for _generate_single(...) and its callers match.
🧹 Nitpick comments (1)
backend/backends/pytorch_backend.py (1)

367-401: Thread safety: settings read without synchronization.

_tts_settings is read at the start of generation (lines 367, 393) but can be modified concurrently via update_tts_settings(). While not causing crashes, a concurrent update mid-generation could result in chunks being split with one max_chunk_chars value while resampling uses a different quality setting.

For a "Chill" review, this is acceptable for an MVP since the impact is limited to occasional inconsistent audio quality settings within a single long request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/pytorch_backend.py` around lines 367 - 401, The code reads
_tts_settings multiple times during generation which can be concurrently
modified by update_tts_settings(), causing inconsistent behavior; fix by taking
a consistent snapshot of settings at the start of generate (e.g., acquire an
existing self._tts_settings_lock or create one, then copy =
dict(self._tts_settings) / copy.deepcopy under the lock) and use that local
snapshot (e.g., local_settings) for split_text_into_chunks,
quality/upsample_rate checks, and _generate_single calls so all uses
(max_chunk_chars, quality, upsample_rate) are consistent for the whole request;
ensure update_tts_settings() also uses the same lock when mutating
_tts_settings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/main.py`:
- Around line 1469-1483: In update_tts_settings, the caught ValueError is
re-raised as an HTTPException without preserving the exception chain; modify the
except block in the update_tts_settings function so you raise the HTTPException
using "raise HTTPException(... ) from e" (include the same status_code and
detail=str(e)) to preserve the original exception context for debugging while
keeping the existing behavior of update_tts_settings.

In `@backend/models.py`:
- Line 61: Update the frontend validation and UI to match the backend's new
50,000 character limit: change the Zod schema in useGenerationForm.ts (the
z.string().min(1).max(5000) call) to max(50000), update the UI text in
GenerationForm.tsx that currently says "Max 5000 characters" to "Max 50000
characters", and regenerate or adjust the generated API schema
$GenerationRequest (the maxLength: 5000 entry) to maxLength: 50000 so
client-side validation and submitted payloads align with the backend.

In `@backend/utils/chunked_tts.py`:
- Around line 27-31: The _tts_settings dict initialization uses
int(os.getenv(...)) which will raise ValueError on malformed env values; update
the code to parse and validate TTS_MAX_CHUNK_CHARS and TTS_UPSAMPLE_RATE before
assigning to _tts_settings (e.g., a small helper like parse_int_env(var_name,
default) that attempts int(), catches ValueError/TypeError, logs/warns, and
returns the default), and then set "max_chunk_chars" and "upsample_rate" from
that validated result so import-time crashes are avoided.

---

Outside diff comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 404-427: The helper _generate_single is referencing an undefined
name language, causing a NameError; add language: str (or Optional[str]) as a
parameter to _generate_single and propagate it into its inner _generate_sync
call (use language when calling LANGUAGE_CODE_TO_NAME), then update all places
that call _generate_single from generate() to pass the generate()'s language
argument (and any other callers) so the method receives the correct language
value; ensure function signature changes for _generate_single(...) and its
callers match.

---

Nitpick comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 367-401: The code reads _tts_settings multiple times during
generation which can be concurrently modified by update_tts_settings(), causing
inconsistent behavior; fix by taking a consistent snapshot of settings at the
start of generate (e.g., acquire an existing self._tts_settings_lock or create
one, then copy = dict(self._tts_settings) / copy.deepcopy under the lock) and
use that local snapshot (e.g., local_settings) for split_text_into_chunks,
quality/upsample_rate checks, and _generate_single calls so all uses
(max_chunk_chars, quality, upsample_rate) are consistent for the whole request;
ensure update_tts_settings() also uses the same lock when mutating
_tts_settings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0a98e6cb-ae35-4736-b2c4-d068db0dce6e

📥 Commits

Reviewing files that changed from the base of the PR and between c12b5d6 and 4b8ca7a.

📒 Files selected for processing (5)
  • backend/backends/pytorch_backend.py
  • backend/main.py
  • backend/models.py
  • backend/requirements.txt
  • backend/utils/chunked_tts.py

Comment on lines +1469 to +1483
@app.get("/tts/settings")
async def get_tts_settings():
"""Get current TTS chunking and quality settings."""
from .utils.chunked_tts import get_tts_settings as _get_settings
return _get_settings()


@app.post("/tts/settings")
async def update_tts_settings(request: models.TTSSettingsUpdate):
"""Update TTS quality and chunking settings at runtime."""
from .utils.chunked_tts import update_tts_settings as _update_settings
try:
return _update_settings(request.model_dump(exclude_none=True))
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve exception chain with raise ... from e.

The static analysis tool flagged line 1483: within an except clause, exceptions should be raised with raise ... from e to preserve the exception chain for debugging.

🔧 Proposed fix
     except ValueError as e:
-        raise HTTPException(status_code=400, detail=str(e))
+        raise HTTPException(status_code=400, detail=str(e)) from e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@app.get("/tts/settings")
async def get_tts_settings():
"""Get current TTS chunking and quality settings."""
from .utils.chunked_tts import get_tts_settings as _get_settings
return _get_settings()
@app.post("/tts/settings")
async def update_tts_settings(request: models.TTSSettingsUpdate):
"""Update TTS quality and chunking settings at runtime."""
from .utils.chunked_tts import update_tts_settings as _update_settings
try:
return _update_settings(request.model_dump(exclude_none=True))
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
`@app.get`("/tts/settings")
async def get_tts_settings():
"""Get current TTS chunking and quality settings."""
from .utils.chunked_tts import get_tts_settings as _get_settings
return _get_settings()
`@app.post`("/tts/settings")
async def update_tts_settings(request: models.TTSSettingsUpdate):
"""Update TTS quality and chunking settings at runtime."""
from .utils.chunked_tts import update_tts_settings as _update_settings
try:
return _update_settings(request.model_dump(exclude_none=True))
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e)) from e
🧰 Tools
🪛 Ruff (0.15.5)

[warning] 1483-1483: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/main.py` around lines 1469 - 1483, In update_tts_settings, the caught
ValueError is re-raised as an HTTPException without preserving the exception
chain; modify the except block in the update_tts_settings function so you raise
the HTTPException using "raise HTTPException(... ) from e" (include the same
status_code and detail=str(e)) to preserve the original exception context for
debugging while keeping the existing behavior of update_tts_settings.

"""Request model for voice generation."""
profile_id: str
text: str = Field(..., min_length=1, max_length=5000)
text: str = Field(..., min_length=1, max_length=50000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the frontend validation limits
echo "=== Checking frontend text length validation ==="
rg -n 'max.*5000|max_length.*5000' --type=ts --type=tsx -g '!node_modules'
rg -n 'Max 5000' --type=ts --type=tsx -g '!node_modules'

Repository: jamiepine/voicebox

Length of output: 183


🏁 Script executed:

#!/bin/bash
# Find the frontend files mentioned in the review
echo "=== Locating frontend files ==="
find . -name "useGenerationForm.ts" 2>/dev/null
find . -name "GenerationForm.tsx" 2>/dev/null

# Search for 5000 character limits using a broader pattern
echo -e "\n=== Searching for 5000 character validation ==="
rg -n '5000' --max-count=20 2>/dev/null | grep -E '(max|Max|length)'

Repository: jamiepine/voicebox

Length of output: 1025


🏁 Script executed:

#!/bin/bash
echo "=== useGenerationForm.ts around line 14 ==="
sed -n '10,20p' ./app/src/lib/hooks/useGenerationForm.ts

echo -e "\n=== GenerationForm.tsx around lines 85-86 ==="
sed -n '80,90p' ./app/src/components/Generation/GenerationForm.tsx

echo -e "\n=== $GenerationRequest.ts around line 15 ==="
sed -n '10,20p' ./app/src/lib/api/schemas/\$GenerationRequest.ts

echo -e "\n=== backend/utils/validation.py around line 9 ==="
sed -n '1,15p' backend/utils/validation.py

echo -e "\n=== Check if validation.py's validate_text is used for the text field ==="
rg -n 'validate_text' --max-count=10

Repository: jamiepine/voicebox

Length of output: 1849


🏁 Script executed:

#!/bin/bash
echo "=== Check where validate_text is called ==="
rg -n 'validate_text' -A 2 -B 2

echo -e "\n=== Find the API endpoint handler for generation ==="
rg -n 'def.*generation|@.*post|@.*app' backend/ --max-count=20 | head -30

echo -e "\n=== Look for the GenerationRequest handler ==="
rg -n 'GenerationRequest' backend/ -A 3 -B 1 | head -40

Repository: jamiepine/voicebox

Length of output: 3978


Frontend validation mismatch will block users from using the expanded limit.

The backend now accepts up to 50,000 characters, but the frontend still enforces a 5,000 character limit in three places:

  • app/src/lib/hooks/useGenerationForm.ts:14 — Zod schema: z.string().min(1).max(5000)
  • app/src/components/Generation/GenerationForm.tsx:85-86 — UI shows "Max 5000 characters"
  • app/src/lib/api/schemas/$GenerationRequest.ts:15 — Generated schema: maxLength: 5000

Users won't be able to submit texts longer than 5,000 characters from the UI despite the backend supporting it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/models.py` at line 61, Update the frontend validation and UI to match
the backend's new 50,000 character limit: change the Zod schema in
useGenerationForm.ts (the z.string().min(1).max(5000) call) to max(50000),
update the UI text in GenerationForm.tsx that currently says "Max 5000
characters" to "Max 50000 characters", and regenerate or adjust the generated
API schema $GenerationRequest (the maxLength: 5000 entry) to maxLength: 50000 so
client-side validation and submitted payloads align with the backend.

Comment on lines +27 to +31
_tts_settings = {
"quality": os.getenv("TTS_QUALITY", "standard"),
"max_chunk_chars": int(os.getenv("TTS_MAX_CHUNK_CHARS", "800")),
"upsample_rate": int(os.getenv("TTS_UPSAMPLE_RATE", "44100")),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Invalid environment variable will crash at import time.

int(os.getenv(...)) will raise ValueError if the environment variable contains a non-integer value (e.g., TTS_MAX_CHUNK_CHARS=abc). This will crash the application at startup with an unhelpful error message.

🛡️ Proposed fix: Add validation with defaults
+def _parse_int_env(key: str, default: int) -> int:
+    """Parse integer from env var with fallback to default."""
+    val = os.getenv(key)
+    if val is None:
+        return default
+    try:
+        return int(val)
+    except ValueError:
+        logger.warning(f"Invalid {key}='{val}', using default {default}")
+        return default
+
+
 _tts_settings = {
     "quality": os.getenv("TTS_QUALITY", "standard"),
-    "max_chunk_chars": int(os.getenv("TTS_MAX_CHUNK_CHARS", "800")),
-    "upsample_rate": int(os.getenv("TTS_UPSAMPLE_RATE", "44100")),
+    "max_chunk_chars": _parse_int_env("TTS_MAX_CHUNK_CHARS", 800),
+    "upsample_rate": _parse_int_env("TTS_UPSAMPLE_RATE", 44100),
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/utils/chunked_tts.py` around lines 27 - 31, The _tts_settings dict
initialization uses int(os.getenv(...)) which will raise ValueError on malformed
env values; update the code to parse and validate TTS_MAX_CHUNK_CHARS and
TTS_UPSAMPLE_RATE before assigning to _tts_settings (e.g., a small helper like
parse_int_env(var_name, default) that attempts int(), catches
ValueError/TypeError, logs/warns, and returns the default), and then set
"max_chunk_chars" and "upsample_rate" from that validated result so import-time
crashes are avoided.

jamiepine added a commit that referenced this pull request Mar 13, 2026
Text exceeding max_chunk_chars (default 800) is automatically split at
sentence boundaries, generated per-chunk, and concatenated with a 50ms
crossfade.  Works with all engines (Qwen, LuxTTS, Chatterbox, Turbo).

- Abbreviation-aware sentence splitter (Dr., Mr., e.g., decimals)
- CJK sentence-ending punctuation support
- Paralinguistic tag preservation ([laugh], [cough], etc.)
- Per-chunk seed variation to avoid correlated RNG artefacts
- Per-chunk Chatterbox trim (catches hallucination at each boundary)
- max_chunk_chars exposed as per-request param on GenerationRequest
- Text max_length raised to 50,000 characters

Closes #99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants