Kokoro FastAPI Audio Streaming Support? #445

User873902 · 2026-02-11T03:08:25Z

User873902
Feb 11, 2026

Piper TTS streams audio on sentence boundaries, significantly reducing wait times for long text-to-speech responses. This is extremely helpful for LLMs. As the text is generated, the TTS generation starts for each sentence.

Is this possible with Kokoro for Home Assistant Voice Assist?

RBEmerson970 · 2026-02-11T03:45:02Z

RBEmerson970
Feb 11, 2026

I use FastKoko directly to read back material I'm writing for a book. When processing text, I get the start of the material read back while the rest of input is processed. This is under Docker 4.59 under Win 11 on a Legion 7 with i9, RTX4090. I'd call the delay perhaps 1-2 seconds tops. YMMV. For real time conversation this might be a problem, otherwise, life's good.

2 replies

User873902 Feb 11, 2026
Author

Yes, for real time conversations it would seem most helpful. But I agree it is fast. Right now for me, it only starts audio playback 1-2s after the text has fully populated. Ideally, it would start concurrent to text/token generation. I don’t know if this is a bug in my set up or just a limitation of the model implementation.

RBEmerson970 Feb 11, 2026

In my OP, the intervals I experience is from click "Generate Speech" to the first speech heard. For laughs and grins, I just timed, with a stopwatch app, the interval and... Using a text sample of about 4K characters and voice "bm-Lewis" I got about a 2.5" delay on the first try, and consistently got 1.5" afterwards (i.e., something's loading in the first pass). In daily use the length of the text has no impact on the time between "Generate Speech" and hearing FastKoko's output. The time to create the output file is, of course, very much content size dependent.

User873902 · 2026-02-11T06:40:03Z

User873902
Feb 11, 2026
Author

It would seem that my implementation could be the issue. I am using OpenAI TTS

0 replies

User873902 · 2026-02-11T19:30:49Z

User873902
Feb 11, 2026
Author

For what it is worth, I did finally get streaming to work with Kokoro for Home Assistant! It required wyoming_openai to handle the API rather than OpenAI TTS.

Streaming is not about the speed of audio generation. Yes, Kokoro is fast. Instead, streaming is the ability to run text generation parallel to audio TTS generation. Thus, the first sentence that is generated by an LLM can begin to be audio decoded while the LLM writes the second sentence, followed by the third, etc. The alternative is linear processing where the LLM has to generate the full final text before the task of TTS audio generation can begin. The result of enabling streaming is the decreased time between LLM prompt processing and the first spoken word in a STT > LLM > TTS voice assistant pipeline.

0 replies

Uh oh!

Kokoro FastAPI Audio Streaming Support? #445

Uh oh!

User873902 Feb 11, 2026

Replies: 3 comments · 2 replies

Uh oh!

RBEmerson970 Feb 11, 2026

Uh oh!

Uh oh!

User873902 Feb 11, 2026 Author

Uh oh!

Uh oh!

RBEmerson970 Feb 11, 2026

Uh oh!

Uh oh!

User873902 Feb 11, 2026 Author

Uh oh!

Uh oh!

User873902 Feb 11, 2026 Author

User873902
Feb 11, 2026

Replies: 3 comments 2 replies

RBEmerson970
Feb 11, 2026

User873902 Feb 11, 2026
Author

User873902
Feb 11, 2026
Author

User873902
Feb 11, 2026
Author