Skip to content

Commit

Permalink
Merge pull request #22 from modal-labs/charlesfrye/merge-erik-dunteman
Browse files Browse the repository at this point in the history
update example, add testing
  • Loading branch information
charlesfrye authored Sep 18, 2024
2 parents c9e8a0f + 756c221 commit 576003a
Show file tree
Hide file tree
Showing 28 changed files with 1,430 additions and 1,326 deletions.
21 changes: 21 additions & 0 deletions .github/actions/setup/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: setup

description: Set up a Python environment.

inputs:
version:
description: Which Python version to install
required: false
default: "3.11"

runs:
using: composite
steps:
- name: Install Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.version }}

- name: Install project requirements
shell: bash
run: pip install -r requirements/requirements.txt
29 changes: 29 additions & 0 deletions .github/workflows/check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Check
on:
push:
branches:
- main
pull_request:
workflow_dispatch:
schedule:
- cron: "17 9 * * *"

jobs:
e2etest:
name: e2etest
runs-on: ubuntu-20.04
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_MODAL_LABS_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_MODAL_LABS_TOKEN_SECRET }}
MODAL_ENVIRONMENT: main
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 1
- uses: ./.github/actions/setup
- name: Run testing script
run: |
pip install -r requirements/requirements-dev.txt
modal token set --token-id $MODAL_TOKEN_ID --token-secret $MODAL_TOKEN_SECRET
modal serve src.app &
python tests/e2e_test.py
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
**/__pycache__
venv/
.venv
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2023 Modal Labs
Copyright (c) 2024 Modal Labs

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
30 changes: 21 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@ A complete chat app that transcribes audio in real-time, streams back a response

This repo is meant to serve as a starting point for your own language model-based apps, as well as a playground for experimentation. Contributions are welcome and encouraged!

![quillman](https://user-images.githubusercontent.com/5786378/233804923-c13627de-97db-4050-a36b-62d955db9c19.gif)

The language model used is [Zephyr](https://arxiv.org/abs/2310.16944). [OpenAI Whisper](https://github.com/openai/whisper) is used for transcription, and [Metavoice Tortoise TTS](https://github.com/metavoicexyz/tortoise-tts) is used for text-to-speech. The entire app, including the frontend, is made to be deployed serverlessly on [Modal](http://modal.com/).
OpenAI [Whisper V3](https://huggingface.co/openai/whisper-large-v3) is used to produce a transcript, which is then passed into the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) language model to generate a response, which is then synthesized by Coqui's [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model. All together, this produces a voice-to-voice chat experience.

You can find the demo live [here](https://modal-labs--quillman-web.modal.run/).

Expand All @@ -16,9 +14,9 @@ You can find the demo live [here](https://modal-labs--quillman-web.modal.run/).

1. React frontend ([`src/frontend/`](./src/frontend/))
2. FastAPI server ([`src/app.py`](./src/app.py))
3. Whisper transcription module ([`src/transcriber.py`](./src/transcriber.py))
4. Tortoise text-to-speech module ([`src/tts.py`](./src/tts.py))
5. Zephyr language model module ([`src/llm_zephyr.py`](./src/llm_zephyr.py))
3. Whisper transcription module ([`src/whisper.py`](./src/whisper.py))
4. XTTS text-to-speech module ([`src/xtts.py`](./src/xtts.py))
5. Llama 3.1 text generation module ([`src/llama.py`](./src/llama.py))

Read the accompanying [docs](https://modal.com/docs/examples/llm-voice-chat) for a detailed look at each of these components.

Expand All @@ -30,17 +28,31 @@ Read the accompanying [docs](https://modal.com/docs/examples/llm-voice-chat) for
- A [Modal](http://modal.com/) account
- A Modal token set up in your environment (`modal token new`)

### Develop on Modal
### Developing the inference modules

Whisper, XTTS, and Llama each have a [local_entrypoint()](https://modal.com/docs/reference/modal.App#local_entrypoint) method that is invoked when you run that file directly.
This is useful for testing each module standalone, without needing to run the whole app.

For example, to test the Whisper transcription module, run:
```shell
modal run -q src.whisper
```

### Developing the http server and frontend

The http server at `src/app.py` is a [FastAPI](https://fastapi.tiangolo.com/) app that chains the inference modules into a single pipeline.

It also serves the frontend as static files.

To [serve](https://modal.com/docs/guide/webhooks#developing-with-modal-serve) the app on Modal, run this command from the root directory of this repo:

```shell
modal serve src.app
```

In the terminal output, you'll find a URL that you can visit to use your app. While the `modal serve` process is running, changes to any of the project files will be automatically applied. `Ctrl+C` will stop the app.
In the terminal output, you'll find a URL that you can visit to use your app. While the `modal serve` process is running, changes to any of the project files will be automatically applied. `Ctrl+C` will stop the app. Note that for frontend changes, the browser cache will need to be cleared.

### Deploy to Modal
### Deploying to Modal

Once you're happy with your changes, [deploy](https://modal.com/docs/guide/managing-deployments#creating-deployments) your app:

Expand Down
107 changes: 107 additions & 0 deletions README_DOCS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# **QuiLLMan: Voice Chat with LLMs**

[QuiLLMan](https://github.com/modal-labs/quillman) is a complete voice chat application built on Modal: you speak and the chatbot speaks back!

OpenAI's [Whisper V3](https://huggingface.co/openai/whisper-large-v3) is used to produce a transcript, which is then passed into the [Zepyhr](https://arxiv.org/abs/2310.16944) language model to generate a response, which is then synthesized by Coqui's [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model. All together, this produces a voice-to-voice chat experience.

We’ve enjoyed playing around with QuiLLMan enough at Modal HQ that we decided to [share the repo](https://github.com/modal-labs/quillman) and put up [a live demo](https://modal-labs--quillman-web.modal.run/).

Everything — the React frontend, the backend API, the LLMs — is deployed serverlessly, allowing it to automatically scale and ensuring you only pay for the compute you use. Read on to see how Modal makes this easy!

This post provides a high-level walkthrough of the [repo](https://github.com/modal-labs/quillman). We’re looking to add more models and features to this as time goes on, and contributions are welcome!

## **Code overview**

Traditionally, building a robust serverless web application as complex as QuiLLMan would require a lot of work — you’re setting up a backend API and three different inference modules, running in separate custom containers and autoscaling independently.

But with Modal, it’s as simple as writing 4 different classes and running a CLI command.

Our project structure looks like this:

1. [Language model module](https://modal.com/docs/examples/llm-voice-chat#language-model): continues a text conversation with a text reply.
2. [Transcription module](https://modal.com/docs/examples/llm-voice-chat#transcription): converts speech audio into text.
3. [Text-to-speech module](https://modal.com/docs/examples/llm-voice-chat#text-to-speech): converts text into speech.
4. [FastAPI server](https://modal.com/docs/examples/llm-voice-chat#fastapi-server): runs server-side app logic.
5. [React frontend](https://modal.com/docs/examples/llm-voice-chat#react-frontend): runs client-side interaction logic.

Let’s go through each of these components in more detail.

You’ll want to have the code handy — look for GitHub links in this guide to see the code for each component.

### **Language model**

Language models are trained to predict what text will come at the end of incomplete text. From this simple task emerge the sparks of artificial general intelligence.

In this case, we want to predict the text that a helpful, friendly assistant might write to continue a conversation with a user.

As with all Modal applications, we start by describing the environment (the container `Image`), which we construct via Python method chaining:

`llama_image = ( modal.Image.debian_slim(python_version="3.10") .pip_install( "transformers==4.44.2", "vllm==0.6.0", "torch==2.4.0", "hf_transfer==0.1.8", "huggingface_hub==0.24.6", ) .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}))`

Copy

The chain starts with a base Debian container installs `python_version` 3.10, then uses [`pip` to `install` our Python packages we need](https://modal.com/docs/reference/modal.Image#pip_install). Pinning versions of our dependencies ensures that the built image is reproducible.

We use [VLLM](https://github.com/vllm-project/vllm), a high-performance open-source library for running large language models on CPUs and GPUs, to run the Llama model. This server scales to handle multiple concurrent requests, keeping costs down for our LLM module.

The models we use define a `generate` function that constructs an input to our language model from a prompt template, the conversation history, and the latest text from the user. Then, it `yield`s (streams) words as they are produced. Remote Python generators work out-of-the-box in Modal, so building streaming interactions is easy.

Although we’re going to call this model from our backend API, it’s useful to test it directly as well. To do this, we define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps):

`@app.local_entrypoint()def main(prompt: str): model = Llama() for val in model.generate.remote_gen(prompt): print(val, end="", flush=True)`

Copy

Now, we can [`run`](https://modal.com/docs/guide/apps#ephemeral-apps) the model with a prompt of our choice from the terminal:

`modal run -q src.llama --prompt "How do antihistamines work?"`

### **Transcription**

In the Whisper module, we define a Modal class that uses [OpenAI’s Whisper V3](https://huggingface.co/openai/whisper-large-v3) to transcribe audio in real-time.

To speed up transcription, we use [Flash-Attention](https://github.com/Dao-AILab/flash-attention), which requires some additional container customization such as using a CUDA-devel image to get access to `nvcc`. Optionally, [check out this guide](https://modal.com/docs/guide/cuda) if you'd like to optionally understand how CUDA works on Modal.

We’re using an [A10G GPU](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) for transcriptions, which lets us transcribe most segments in under 2 seconds.



### **Text-to-speech**

The text-to-speech module uses the Coqui [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model, offering a variety of voices and languages. We're able to parallelize the TTS generation by splitting transcripts into smaller chunks across multiple GPUs

### **FastAPI server**

The file `app.py` is a [FastAPI](https://fastapi.tiangolo.com/) app that chains the inference modules into a single pipeline. We can serve this app over the internet without any extra effort on top of writing the `localhost` version by slapping on an [`@asgi_app`](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) decorator.

To make the experience as conversational as possible, it's important to focus on reducing the latency between the end of the user's speech and the start of the reply audio playback.

Because we're using GPUs, we have an advantage: it's faster than real-time.

Transcription happens faster than the user's real-time speech. So while the user is speaking, we can stream that audio to the server in chunks, and have the transcript mostly complete by the time the user finishes speaking.

And text generation and text-to-speech happens faster than the audio playback. So if we generate, synthesize, and return the first sentence of the response ASAP, we can get it playing in the browser while we finish the rest of the response in the background.

This makes the user's perception of latency as short as possible, with us hiding the majority of the latency during the spoken interactions.

We use these techniques in `/pipeline` to achieve this:
1. The `/pipeline` endpoint is a websocket connection, enabling streaming audio in chunks.
2. All intermediate data in the pipeline is streamed through python generators between inference modules.
3. We start transcribing even before the user finishes speaking.
4. The transcription is done in parallel using Modal [spawn](https://modal.com/docs/reference/modal.Function#spawn), since each chunk is independent of the others. We use spawn rather than map since we're in an async block while receiving audio from the websocket.
5. The text response generation is a Modal [remote generator function](https://modal.com/docs/reference/modal.Function#remote_gen), streaming output as it's generated. This can be fed directly into the text-to-speech module, even before the full transcript is available.
6. The text-to-speech is done in parallel using Modal [map](https://modal.com/docs/reference/modal.Function#map), since each sentence is independent of the others.
7. We stream synthesized audio back to the client as it's generated, so the browser can start playback as soon as possible.

In addition, we add a `/prewarm` endpoint, to be called by the client before they run the first `/pipeline` request. This ensures all on-GPU models are loaded and ready to go.


### **React frontend**

We use the [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API) to record snippets of audio from the user’s microphone. The file [`src/frontend/processor.js`](https://github.com/modal-labs/quillman/blob/main/src/frontend/processor.js) defines an [AudioWorkletProcessor](https://developer.mozilla.org/en-US/docs/Web/API/AudioWorkletProcessor) that distinguishes between speech and silence, and emits events for speech buffers so we can transcribe them.

The frontend maintains a state machine to manage the state of the conversation. This is implemented with the help of the incredible [XState](https://github.com/statelyai/xstate) library.

## **Steal this example**

The code for this entire example is [available on GitHub](https://github.com/modal-labs/quillman). Follow the instructions in the README for how to run or deploy it yourself on Modal.
3 changes: 3 additions & 0 deletions requirements/requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-r requirements.txt
requests
websockets
2 changes: 2 additions & 0 deletions requirements/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
modal
# that's it :)
Binary file added src/.DS_Store
Binary file not shown.
Loading

0 comments on commit 576003a

Please sign in to comment.