vllm-project · WoosukKwon · Nov 14, 2023 · Nov 14, 2023
diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
@@ -1,36 +1,36 @@
 ---
 layout: post
-title: "Notes on vLLM v.s. DeepSpeed"
+title: "Notes on vLLM v.s. DeepSpeed-FastGen"
 author: "vLLM Team"
 ---
 
 ---
 **TL;DR:**
 
-- vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs.
-- DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
+- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs.
+- DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
 - vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.
 
 ---
 
 The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
 We are happy to see the technology advancements from the open-source community.
 In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
-For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII.
+For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.
 
 
 ### Performance Benchmark
 
-We've identified two key differences between vLLM and DeepSpeed in terms of performance optimization:
+We've identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:
 
-1. **DeepSpeed adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
-2. DeepSpeed’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
+1. **DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
+2. DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
 
-As a result, DeepSpeed outperforms when the workload is consistently long prompt and short output.
+As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
 In other scenarios, vLLM shows superior performance.
 
 #### Scenario 1: Long Prompt Length, Short Output
-Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine.
+Here, DeepSpeed-FastGen's Dynamic SplitFuse scheduling is expected to shine.
 However, the performance gain we observe isn't as significant as 2x.
 
 <p align="center">
@@ -40,7 +40,7 @@ However, the performance gain we observe isn't as significant as 2x.
 </p>
 
 #### Scenario 2: Other cases
-In these cases, vLLM is up to **1.8x** faster than DeepSpeed.
+In these cases, vLLM is up to **1.8x** faster than DeepSpeed-FastGen.
 
 <p align="center">
 <picture>
@@ -58,10 +58,10 @@ Specifically for the Dynamic SplitFuse optimization, we are actively investigati
 
 ### Appendix: Feature Comparison
 
-DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
-We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market!
+DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
+We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!
 
-|                            |                   vLLM                  |                    DeepSpeed                    |
+|                            |                   vLLM                  |                DeepSpeed-FastGen                |
 |----------------------------|:---------------------------------------:|:-----------------------------------------------:|
 | Runtime                    | Python/PyTorch                          | Python/PyTorch                                  |
 | Model implementation       | HuggingFace Transformers                | Custom implementation + converter for HF models |