QuiLLMan is a complete voice chat application built on Modal: you speak and the chatbot speaks back!
OpenAI's Whisper V3 is used to produce a transcript, which is then passed into the LLaMA 3.1 8B language model to generate a response, which is then synthesized by Coqui's XTTS text-to-speech model. All together, this produces a voice-to-voice chat experience.
We’ve enjoyed playing around with QuiLLMan enough at Modal HQ that we decided to share the repo and put up a live demo.
Everything — the React frontend, the backend API, the LLMs — is deployed serverlessly, allowing it to automatically scale and ensuring you only pay for the compute you use. Read on to see how Modal makes this easy!
This post provides a high-level walkthrough of the repo. We’re looking to add more models and features to this as time goes on, and contributions are welcome!
Traditionally, building a robust serverless web application as complex as QuiLLMan would require a lot of work — you’re setting up a backend API and three different inference modules, running in separate custom containers and autoscaling independently.
But with Modal, it’s as simple as writing 4 different classes and running a CLI command.
Our project structure looks like this:
- Language model module: continues a text conversation with a text reply.
- Transcription module: converts speech audio into text.
- Text-to-speech module: converts text into speech.
- FastAPI server: runs server-side app logic.
- React frontend: runs client-side interaction logic.
Let’s go through each of these components in more detail.
Language models are trained to predict what text will come at the end of incomplete text. From this simple task emerge the sparks of artificial general intelligence.
In this case, we want to predict the text that a helpful, friendly assistant might write to continue a conversation with a user.
As with all Modal applications, we start by describing the environment (the container Image
), which we construct via Python method chaining:
llama_image = (
modal.Image.debian_slim(python_version="3.10")
.pip_install(
"transformers==4.44.2",
"vllm==0.6.0",
"torch==2.4.0",
"hf_transfer==0.1.8",
"huggingface_hub==0.24.6",
)
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}))
The chain starts with a base Debian container with Python 3.10,
then uses pip
to install
our Python packages we need.
Pinning versions of our dependencies improves the reproducibility of the image.
We use vLLM, a high-performance open-source library for running large language models on CPUs and GPUs, to run the LLaMA model. This server can continuously batch concurrent requests.
The models we use define a generate
function that constructs an input to our language model from a prompt template, the conversation history, and the latest text from the user. Then, it yield
s (streams) words as they are produced. Remote Python generators work out-of-the-box in Modal, so building streaming interactions is easy.
Although we’re going to call this model from our backend API, it’s useful to test it directly as well.
To get a basic "smoke test" going, we define a local_entrypoint
:
@app.local_entrypoint()
def main(prompt: str):
model = Llama()
for val in model.generate.remote_gen(prompt):
print(val, end="", flush=True)
Now, we can modal run
the model with a prompt of our choice from the terminal:
modal run -q src.llama --prompt "Why do computers get hot?"
In the Whisper module, we define a Modal class that uses OpenAI’s Whisper V3 to transcribe audio in real-time.
To speed up transcription, we use Flash Attention kernels,
which requires nvcc
, the CUDA compiler.
Check out this guide for more on how to set up and use CUDA on Modal.
We’re using an A10G GPU for transcriptions, which lets us transcribe most segments in under 2 seconds.
The text-to-speech module uses the Coqui XTTS text-to-speech model offering a variety of voices and languages. We're able to parallelize the TTS generation by splitting transcripts into smaller chunks across multiple GPUs
The file app.py
contains a FastAPI app that chains the inference modules into a single pipeline.
We can serve this app over the internet without any extra effort on top of writing the localhost
version by slapping on an
@asgi_app
decorator.
We structure the pipeline to minimize the perceived latency for the user during conversations. Human conversations involve turn-taking with delays under one second, so that's our target.
We take advantage of the fact that transcription and generation proceed faster than real-time to reduce this latency.
While the user is speaking, we stream audio to the server in chunks, so most of the transcript is complete by the time the user finishes speaking.
And text-to-speech happens faster than audio playback. So after we generate an initial speech response, either filler words or the first chunk of the language model's response, we can generate additional speech responses while the user is engaged in listening.
We use these techniques in /pipeline
to achieve this:
- The
/pipeline
endpoint is a WebSocket connection, enabling streaming audio in chunks. - All intermediate data in the pipeline is streamed through Python generators between inference modules.
- We start transcribing before the user finishes speaking.
- The transcription is done in parallel using Modal
spawn
, since each chunk is roughly independent. - The text response generation is a Modal remote generator function, streaming output as it's generated. Outputs can be fed into the text-to-speech module before the full transcript is available.
- The text-to-speech is done in parallel using Modal
map
, since each sentence is independent of the others. - We stream synthesized audio back to the client as it's generated so the browser can start playback as soon as possible.
In addition, we add a /prewarm
endpoint, to be called by the client before they run the first /pipeline
request. This ensures all on-GPU models are loaded and ready to go.
We use the Web Audio API
to record snippets of audio from the user’s microphone.
The file src/frontend/processor.js
defines an AudioWorkletProcessor
that distinguishes between speech and silence and emits events for speech buffers so we can transcribe them.
The frontend maintains a state machine to manage the state of the conversation. This is implemented with the help of the incredible XState library.
The code for this entire example is available on GitHub. Follow the instructions in the README to deploy it yourself on Modal.