Merge pull request #22 from modal-labs/charlesfrye/merge-erik-dunteman

update example, add testing
modal-labs · Sep 18, 2024 · 576003a · 576003a
2 parents c9e8a0f + 756c221
commit 576003a
Show file tree

Hide file tree

Showing 28 changed files with 1,430 additions and 1,326 deletions.
diff --git a/.github/actions/setup/action.yml b/.github/actions/setup/action.yml
@@ -0,0 +1,21 @@
+name: setup
+
+description: Set up a Python environment.
+
+inputs:
+  version:
+    description: Which Python version to install
+    required: false
+    default: "3.11"
+
+runs:
+  using: composite
+  steps:
+    - name: Install Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ inputs.version }}
+
+    - name: Install project requirements
+      shell: bash
+      run: pip install -r requirements/requirements.txt
diff --git a/.github/workflows/check.yml b/.github/workflows/check.yml
@@ -0,0 +1,29 @@
+name: Check
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+  workflow_dispatch:
+  schedule:
+      - cron: "17 9 * * *"
+
+jobs:
+  e2etest:
+    name: e2etest
+    runs-on: ubuntu-20.04
+    env:
+      MODAL_TOKEN_ID: ${{ secrets.MODAL_MODAL_LABS_TOKEN_ID }}
+      MODAL_TOKEN_SECRET: ${{ secrets.MODAL_MODAL_LABS_TOKEN_SECRET }}
+      MODAL_ENVIRONMENT: main
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 1
+      - uses: ./.github/actions/setup
+      - name: Run testing script
+        run: |
+            pip install -r requirements/requirements-dev.txt
+            modal token set --token-id $MODAL_TOKEN_ID --token-secret $MODAL_TOKEN_SECRET
+            modal serve src.app &
+            python tests/e2e_test.py
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 **/__pycache__
 venv/
+.venv
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2023 Modal Labs
+Copyright (c) 2024 Modal Labs
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -4,9 +4,7 @@ A complete chat app that transcribes audio in real-time, streams back a response
 
 This repo is meant to serve as a starting point for your own language model-based apps, as well as a playground for experimentation. Contributions are welcome and encouraged!
 
-![quillman](https://user-images.githubusercontent.com/5786378/233804923-c13627de-97db-4050-a36b-62d955db9c19.gif)
-
-The language model used is [Zephyr](https://arxiv.org/abs/2310.16944). [OpenAI Whisper](https://github.com/openai/whisper) is used for transcription, and [Metavoice Tortoise TTS](https://github.com/metavoicexyz/tortoise-tts) is used for text-to-speech. The entire app, including the frontend, is made to be deployed serverlessly on [Modal](http://modal.com/).
+OpenAI [Whisper V3](https://huggingface.co/openai/whisper-large-v3) is used to produce a transcript, which is then passed into the [Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) language model to generate a response, which is then synthesized by Coqui's [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model. All together, this produces a voice-to-voice chat experience.
 
 You can find the demo live [here](https://modal-labs--quillman-web.modal.run/).
 
@@ -16,9 +14,9 @@ You can find the demo live [here](https://modal-labs--quillman-web.modal.run/).
 
 1. React frontend ([`src/frontend/`](./src/frontend/))
 2. FastAPI server ([`src/app.py`](./src/app.py))
-3. Whisper transcription module ([`src/transcriber.py`](./src/transcriber.py))
-4. Tortoise text-to-speech module ([`src/tts.py`](./src/tts.py))
-5. Zephyr language model module ([`src/llm_zephyr.py`](./src/llm_zephyr.py))
+3. Whisper transcription module ([`src/whisper.py`](./src/whisper.py))
+4. XTTS text-to-speech module ([`src/xtts.py`](./src/xtts.py))
+5. Llama 3.1 text generation module ([`src/llama.py`](./src/llama.py))
 
 Read the accompanying [docs](https://modal.com/docs/examples/llm-voice-chat) for a detailed look at each of these components.
 
@@ -30,17 +28,31 @@ Read the accompanying [docs](https://modal.com/docs/examples/llm-voice-chat) for
 - A [Modal](http://modal.com/) account
 - A Modal token set up in your environment (`modal token new`)
 
-### Develop on Modal
+### Developing the inference modules
+
+Whisper, XTTS, and Llama each have a [local_entrypoint()](https://modal.com/docs/reference/modal.App#local_entrypoint) method that is invoked when you run that file directly. 
+This is useful for testing each module standalone, without needing to run the whole app.
+
+For example, to test the Whisper transcription module, run:
+```shell
+modal run -q src.whisper
+```
+
+### Developing the http server and frontend
+
+The http server at `src/app.py` is a [FastAPI](https://fastapi.tiangolo.com/) app that chains the inference modules into a single pipeline.
+
+It also serves the frontend as static files.
 
 To [serve](https://modal.com/docs/guide/webhooks#developing-with-modal-serve) the app on Modal, run this command from the root directory of this repo:
 
 ```shell
 modal serve src.app
 ```
 
-In the terminal output, you'll find a URL that you can visit to use your app. While the `modal serve` process is running, changes to any of the project files will be automatically applied. `Ctrl+C` will stop the app.
+In the terminal output, you'll find a URL that you can visit to use your app. While the `modal serve` process is running, changes to any of the project files will be automatically applied. `Ctrl+C` will stop the app. Note that for frontend changes, the browser cache will need to be cleared.
 
-### Deploy to Modal
+### Deploying to Modal
 
 Once you're happy with your changes, [deploy](https://modal.com/docs/guide/managing-deployments#creating-deployments) your app:
 

diff --git a/README_DOCS.md b/README_DOCS.md
@@ -0,0 +1,107 @@
+# **QuiLLMan: Voice Chat with LLMs**
+
+[QuiLLMan](https://github.com/modal-labs/quillman) is a complete voice chat application built on Modal: you speak and the chatbot speaks back!
+
+OpenAI's [Whisper V3](https://huggingface.co/openai/whisper-large-v3) is used to produce a transcript, which is then passed into the [Zepyhr](https://arxiv.org/abs/2310.16944) language model to generate a response, which is then synthesized by Coqui's [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model. All together, this produces a voice-to-voice chat experience.
+
+We’ve enjoyed playing around with QuiLLMan enough at Modal HQ that we decided to [share the repo](https://github.com/modal-labs/quillman) and put up [a live demo](https://modal-labs--quillman-web.modal.run/).
+
+Everything — the React frontend, the backend API, the LLMs — is deployed serverlessly, allowing it to automatically scale and ensuring you only pay for the compute you use. Read on to see how Modal makes this easy!
+
+This post provides a high-level walkthrough of the [repo](https://github.com/modal-labs/quillman). We’re looking to add more models and features to this as time goes on, and contributions are welcome!
+
+## **Code overview**
+
+Traditionally, building a robust serverless web application as complex as QuiLLMan would require a lot of work — you’re setting up a backend API and three different inference modules, running in separate custom containers and autoscaling independently.
+
+But with Modal, it’s as simple as writing 4 different classes and running a CLI command.
+
+Our project structure looks like this:
+
+1. [Language model module](https://modal.com/docs/examples/llm-voice-chat#language-model): continues a text conversation with a text reply.
+2. [Transcription module](https://modal.com/docs/examples/llm-voice-chat#transcription): converts speech audio into text.
+3. [Text-to-speech module](https://modal.com/docs/examples/llm-voice-chat#text-to-speech): converts text into speech.
+4. [FastAPI server](https://modal.com/docs/examples/llm-voice-chat#fastapi-server): runs server-side app logic.
+5. [React frontend](https://modal.com/docs/examples/llm-voice-chat#react-frontend): runs client-side interaction logic.
+
+Let’s go through each of these components in more detail.
+
+You’ll want to have the code handy — look for GitHub links in this guide to see the code for each component.
+
+### **Language model**
+
+Language models are trained to predict what text will come at the end of incomplete text. From this simple task emerge the sparks of artificial general intelligence.
+
+In this case, we want to predict the text that a helpful, friendly assistant might write to continue a conversation with a user.
+
+As with all Modal applications, we start by describing the environment (the container `Image`), which we construct via Python method chaining:
+
+`llama_image = (    modal.Image.debian_slim(python_version="3.10")    .pip_install(        "transformers==4.44.2",        "vllm==0.6.0",        "torch==2.4.0",        "hf_transfer==0.1.8",        "huggingface_hub==0.24.6",    )    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}))`
+
+Copy
+
+The chain starts with a base Debian container installs `python_version` 3.10, then uses [`pip` to `install` our Python packages we need](https://modal.com/docs/reference/modal.Image#pip_install). Pinning versions of our dependencies ensures that the built image is reproducible.
+
+We use [VLLM](https://github.com/vllm-project/vllm), a high-performance open-source library for running large language models on CPUs and GPUs, to run the Llama model. This server scales to handle multiple concurrent requests, keeping costs down for our LLM module.
+
+The models we use define a `generate` function that constructs an input to our language model from a prompt template, the conversation history, and the latest text from the user. Then, it `yield`s (streams) words as they are produced. Remote Python generators work out-of-the-box in Modal, so building streaming interactions is easy.
+
+Although we’re going to call this model from our backend API, it’s useful to test it directly as well. To do this, we define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps):
+
+`@app.local_entrypoint()def main(prompt: str):    model = Llama()    for val in model.generate.remote_gen(prompt):        print(val, end="", flush=True)`
+
+Copy
+
+Now, we can [`run`](https://modal.com/docs/guide/apps#ephemeral-apps) the model with a prompt of our choice from the terminal:
+
+`modal run -q src.llama --prompt "How do antihistamines work?"`
+
+### **Transcription**
+
+In the Whisper module, we define a Modal class that uses [OpenAI’s Whisper V3](https://huggingface.co/openai/whisper-large-v3) to transcribe audio in real-time.
+
+To speed up transcription, we use [Flash-Attention](https://github.com/Dao-AILab/flash-attention), which requires some additional container customization such as using a CUDA-devel image to get access to `nvcc`. Optionally, [check out this guide](https://modal.com/docs/guide/cuda) if you'd like to optionally understand how CUDA works on Modal.
+
+We’re using an [A10G GPU](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) for transcriptions, which lets us transcribe most segments in under 2 seconds. 
+
+
+
+### **Text-to-speech**
+
+The text-to-speech module uses the Coqui [XTTS](https://github.com/coqui-ai/TTS) text-to-speech model, offering a variety of voices and languages. We're able to parallelize the TTS generation by splitting transcripts into smaller chunks across multiple GPUs
+
+### **FastAPI server**
+
+The file `app.py` is a [FastAPI](https://fastapi.tiangolo.com/) app that chains the inference modules into a single pipeline. We can serve this app over the internet without any extra effort on top of writing the `localhost` version by slapping on an [`@asgi_app`](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) decorator.
+
+To make the experience as conversational as possible, it's important to focus on reducing the latency between the end of the user's speech and the start of the reply audio playback.
+
+Because we're using GPUs, we have an advantage: it's faster than real-time.
+
+Transcription happens faster than the user's real-time speech. So while the user is speaking, we can stream that audio to the server in chunks, and have the transcript mostly complete by the time the user finishes speaking.
+
+And text generation and text-to-speech happens faster than the audio playback. So if we generate, synthesize, and return the first sentence of the response ASAP, we can get it playing in the browser while we finish the rest of the response in the background.
+
+This makes the user's perception of latency as short as possible, with us hiding the majority of the latency during the spoken interactions.
+
+We use these techniques in `/pipeline` to achieve this:
+1. The `/pipeline` endpoint is a websocket connection, enabling streaming audio in chunks.
+2. All intermediate data in the pipeline is streamed through python generators between inference modules. 
+3. We start transcribing even before the user finishes speaking.
+4. The transcription is done in parallel using Modal [spawn](https://modal.com/docs/reference/modal.Function#spawn), since each chunk is independent of the others. We use spawn rather than map since we're in an async block while receiving audio from the websocket.
+5. The text response generation is a Modal [remote generator function](https://modal.com/docs/reference/modal.Function#remote_gen), streaming output as it's generated. This can be fed directly into the text-to-speech module, even before the full transcript is available.
+6. The text-to-speech is done in parallel using Modal [map](https://modal.com/docs/reference/modal.Function#map), since each sentence is independent of the others.
+7. We stream synthesized audio back to the client as it's generated, so the browser can start playback as soon as possible.
+
+In addition, we add a `/prewarm` endpoint, to be called by the client before they run the first `/pipeline` request. This ensures all on-GPU models are loaded and ready to go.
+
+
+### **React frontend**
+
+We use the [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API) to record snippets of audio from the user’s microphone. The file [`src/frontend/processor.js`](https://github.com/modal-labs/quillman/blob/main/src/frontend/processor.js) defines an [AudioWorkletProcessor](https://developer.mozilla.org/en-US/docs/Web/API/AudioWorkletProcessor) that distinguishes between speech and silence, and emits events for speech buffers so we can transcribe them.
+
+The frontend maintains a state machine to manage the state of the conversation. This is implemented with the help of the incredible [XState](https://github.com/statelyai/xstate) library.
+
+## **Steal this example**
+
+The code for this entire example is [available on GitHub](https://github.com/modal-labs/quillman). Follow the instructions in the README for how to run or deploy it yourself on Modal.
diff --git a/requirements/requirements-dev.txt b/requirements/requirements-dev.txt
@@ -0,0 +1,3 @@
+-r requirements.txt
+requests
+websockets
diff --git a/requirements/requirements.txt b/requirements/requirements.txt
@@ -0,0 +1,2 @@
+modal
+# that's it :)
diff --git a/src/.DS_Store b/src/.DS_Store