Skip to content

Obvious Pauses Between Text Segments in Current OpenAITTSModel Implementation Affect Speech Fluency #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mikuh opened this issue Apr 14, 2025 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@mikuh
Copy link

mikuh commented Apr 14, 2025

Please read this first

  • Have you read the docs?Agents SDK docs yes
  • Have you searched for related issues? Others may have had similar requests yes

Describe the feature

What is the feature you're requesting? How would it work? Please provide examples and details if possible.

Problem Description

The current implementation of OpenAITTSModel only supports handling a single complete text per call, as shown below:

async def run(self, text: str, settings: TTSModelSettings) -> AsyncIterator[bytes]:
    ...

In the upper-layer business logic, LLM usually outputs text incrementally. Once a piece of text is generated, the TTS layer immediately calls run to synthesize and play it, for example:

for text in stream_output:
    async for chunk in tts.run(text, settings):
        yield chunk

This approach leads to:

  1. Each text triggers a new and independent OpenAI TTS request.
  2. The resulting playback experience is: noticeable pauses between segments of text.

User Experience Problem

This issue is particularly noticeable in scenarios like LLM streaming conversations or long-form content reading — where the content is supposed to sound continuous, but instead feels unnaturally fragmented due to frequent pauses.


Optimization Goal

Maintain a persistent TTS WebSocket connection.

Whenever new text is generated by the LLM:

  • Directly send the incremental text to the TTS WebSocket stream.
  • The backend continuously pushes audio chunks without restarting or reconnecting.
  • The audio player can play these chunks seamlessly in real-time, without waiting for new connections or full-text inputs.
@mikuh mikuh added the enhancement New feature or request label Apr 14, 2025
@rm-openai
Copy link
Collaborator

Thanks, this makes sense. @dkundel-openai - assigning you for when you're back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants