Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastGen #3

Merged
merged 1 commit into from
Nov 14, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 13 additions & 13 deletions _posts/2023-11-14-notes-vllm-vs-deepspeed.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,36 @@
---
layout: post
title: "Notes on vLLM v.s. DeepSpeed"
title: "Notes on vLLM v.s. DeepSpeed-FastGen"
author: "vLLM Team"
---

---
**TL;DR:**

- vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs.
- DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs.
- DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
- vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.

---

The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
We are happy to see the technology advancements from the open-source community.
In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII.
For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.


### Performance Benchmark

We've identified two key differences between vLLM and DeepSpeed in terms of performance optimization:
We've identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:

1. **DeepSpeed adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
2. DeepSpeed’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.
1. **DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme**, which wastes memory when output lengths are large.
2. DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives **speedup only when prompt lengths are much greater than output lengths**.

As a result, DeepSpeed outperforms when the workload is consistently long prompt and short output.
As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
In other scenarios, vLLM shows superior performance.

#### Scenario 1: Long Prompt Length, Short Output
Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine.
Here, DeepSpeed-FastGen's Dynamic SplitFuse scheduling is expected to shine.
However, the performance gain we observe isn't as significant as 2x.

<p align="center">
Expand All @@ -40,7 +40,7 @@ However, the performance gain we observe isn't as significant as 2x.
</p>

#### Scenario 2: Other cases
In these cases, vLLM is up to **1.8x** faster than DeepSpeed.
In these cases, vLLM is up to **1.8x** faster than DeepSpeed-FastGen.

<p align="center">
<picture>
Expand All @@ -58,10 +58,10 @@ Specifically for the Dynamic SplitFuse optimization, we are actively investigati

### Appendix: Feature Comparison

DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market!
DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!

| | vLLM | DeepSpeed |
| | vLLM | DeepSpeed-FastGen |
|----------------------------|:---------------------------------------:|:-----------------------------------------------:|
| Runtime | Python/PyTorch | Python/PyTorch |
| Model implementation | HuggingFace Transformers | Custom implementation + converter for HF models |
Expand Down