|
| 1 | +## **Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving" author: "The vLLM-Omni Team"** |
| 2 | + |
| 3 | +We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models. |
| 4 | + |
| 5 | +Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures. |
| 6 | + |
| 7 | +**vLLM-Omni** answers this call, extending vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference. |
| 8 | + |
| 9 | +\<p align="center"\> |
| 10 | +\<img src="/assets/figures/vllm-omni-logo-text-dark.png" alt="vLLM Omni Logo" width="60%"\> |
| 11 | +\</p\> |
| 12 | + |
| 13 | +## **Why vLLM-Omni?** |
| 14 | + |
| 15 | +Traditional serving engines were optimized for text-based Autoregressive (AR) tasks. As models evolve into "omni" agents—capable of seeing, hearing, and speaking—the serving infrastructure must evolve with them. |
| 16 | + |
| 17 | +vLLM-Omni addresses three critical shifts in model architecture: |
| 18 | + |
| 19 | +1. **True Omni-Modality:** Processing and generating Text, Image, Video, and Audio seamlessly. |
| 20 | +2. **Beyond Autoregression:** Extending vLLM's efficient memory management to **Diffusion Transformers (DiT)** and other parallel generation models. |
| 21 | +3. **Heterogeneous Pipelines:** Managing complex workflows where a single request might trigger a visual encoder, an AR reasoning step, and a diffusion-based video generation step. |
| 22 | + |
| 23 | +## **Inside the Architecture** |
| 24 | + |
| 25 | +vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation. |
| 26 | + |
| 27 | +\<p align="center"\> |
| 28 | +\<img src="/assets/figures/omni-modality-model-architecture.png" alt="Omni-modality model architecture" width="80%"\> |
| 29 | +\</p\> |
| 30 | +As shown above, the architecture unifies distinct phases: |
| 31 | + |
| 32 | +* **Modality Encoders:** Efficiently processing inputs (ViT, T5, etc.) |
| 33 | +* **LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage. |
| 34 | +* **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs. |
| 35 | + |
| 36 | +### **Key Features** |
| 37 | + |
| 38 | +* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server. |
| 39 | + |
| 40 | +# todo @liuhongsheng, add the vLLM-Omni architecture |
| 41 | + |
| 42 | + |
| 43 | +* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various Omni-Modality models including Qwen-Omni, Qwen-Image, SD models. |
| 44 | + |
| 45 | + |
| 46 | +* **Performance:** We utilize pipelined stage execution to overlap computation, ensuring that while one stage is processing, others aren't idle. |
| 47 | + |
| 48 | +# todo @zhoutaichang, please add a figure to illustrate the pipelined stage execution. |
| 49 | + |
| 50 | +## **Performance** |
| 51 | + |
| 52 | +We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving. |
| 53 | + |
| 54 | +| Metric | vLLM-Omni | HF Transformers | Improvement | |
| 55 | +| :---- | :---- | :---- | :---- | |
| 56 | +| **Throughput** (req/s) | **TBD** | TBD | **TBD x** | |
| 57 | +| **Latency** (TTFT ms) | **TBD** | TBD | **TBD x** | |
| 58 | +| **GPU Memory** (GB) | **TBD** | TBD | **TBD %** | |
| 59 | + |
| 60 | +*Note: Benchmarks were run on \[Insert Hardware Specs\] using \[Insert Model Name\].* |
| 61 | + |
| 62 | +## **Future Roadmap** |
| 63 | + |
| 64 | +vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support and pushing the boundaries of efficient inference even further. |
| 65 | + |
| 66 | +* **Expanded Model Support:** We plan to support a wider range of open-source omni-models and diffusion transformers as they emerge. |
| 67 | +* **Deeper vLLM Integration:** merging core omni-features upstream to make multi-modality a first-class citizen in the entire vLLM ecosystem. |
| 68 | +* **Diffusion Acceleration:** parallel inference(DP/TP/SP/USP...), cache acceleration(TeaCache/DBCache...) and compute acceleration(quantization/sparse attn...). |
| 69 | +* **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency. |
| 70 | +* **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere. |
| 71 | + |
| 72 | +Contributions and collabrations from the open source community are welcome. |
| 73 | + |
| 74 | +## **Getting Started** |
| 75 | + |
| 76 | +Getting started with vLLM-Omni is straightforward. The initial release is built on top of vLLM v0.11.0. |
| 77 | + |
| 78 | +### **Installation** |
| 79 | + |
| 80 | +First, set up your environment: |
| 81 | + |
| 82 | +\# Create a virtual environment |
| 83 | +uv venv \--python 3.12 \--seed |
| 84 | +source .venv/bin/activate |
| 85 | + |
| 86 | +\# Install the base vLLM |
| 87 | +uv pip install vllm==0.11.0 \--torch-backend=auto |
| 88 | + |
| 89 | +Next, install the vLLM-Omni extension: |
| 90 | + |
| 91 | +git clone \[https://github.com/vllm-project/vllm-omni.git\](https://github.com/vllm-project/vllm-omni.git) |
| 92 | +cd vllm\_omni |
| 93 | +uv pip install \-e . |
| 94 | + |
| 95 | +### **Running the Qwen3-Omni model** |
| 96 | + |
| 97 | +@huayongxiang, add the gradio example for Qwen3-Omni model inference |
| 98 | + |
| 99 | +Check out our [examples directory](https://www.google.com/search?q=https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows. |
| 100 | + |
| 101 | +## **Join the Community** |
| 102 | + |
| 103 | +This is just the beginning for omni-modality serving. We are actively developing support for more architectures and invite the community to help shape the future of vLLM-Omni. |
| 104 | + |
| 105 | +@gaohan, update the links after vllm-omni released |
| 106 | +* **Code & Docs:** [GitHub Repository](https://github.com/vllm-project/vllm-omni) | [Documentation](https://vllm-omni.readthedocs.io/en/latest/) |
| 107 | +* **Weekly Meeting:** Join us every Wednesday at 11:30 (UTC+8) to discuss roadmap and features. [Join here](https://tinyurl.com/vllm-omni-meeting). |
| 108 | + |
| 109 | +Let's build the future of omni-modal serving together\! |
0 commit comments