Obvious Pauses Between Text Segments in Current OpenAITTSModel
Implementation Affect Speech Fluency
#493
Labels
enhancement
New feature or request
Please read this first
Describe the feature
What is the feature you're requesting? How would it work? Please provide examples and details if possible.
Problem Description
The current implementation of
OpenAITTSModel
only supports handling a single completetext
per call, as shown below:In the upper-layer business logic, LLM usually outputs text incrementally. Once a piece of text is generated, the TTS layer immediately calls
run
to synthesize and play it, for example:This approach leads to:
text
triggers a new and independent OpenAI TTS request.User Experience Problem
This issue is particularly noticeable in scenarios like LLM streaming conversations or long-form content reading — where the content is supposed to sound continuous, but instead feels unnaturally fragmented due to frequent pauses.
Optimization Goal
Maintain a persistent TTS WebSocket connection.
Whenever new
text
is generated by the LLM:The text was updated successfully, but these errors were encountered: