Skip to content

Commit d9f53bc

Browse files
authored
Remove Section number and adjust image size (#22)
1 parent c0c5484 commit d9f53bc

File tree

1 file changed

+11
-10
lines changed

1 file changed

+11
-10
lines changed

content/posts/2025-05-21-v0.3.0-release.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,14 @@ We introduce a fully integrated distributed KVCache offloading system, adaptive
3131

3232
## New Features!
3333

34-
### 1. Multi-tier KV Cache Offloading System
34+
### Multi-tier KV Cache Offloading System
3535

3636
In the AIBrix v0.2.0 release, we introduced distributed KVCache for the first time by integrating Vineyard and making experimental changes in vLLM. This early design aimed to address a key bottleneck in LLM inference: as model sizes and context lengths grow, KV cache increasingly consumes GPU memory and limits scalability. However, v0.2.0 had notable limitations — the KVConnector interface in vLLM had only just been merged upstream, and we hadn’t yet had the opportunity to fully leverage it or build a dedicated cache server optimized for LLM workloads. Meanwhile, systems like Dynamo, MoonCake, and LMCache explored KV offloading to CPU, SSDs, or remote memory—but their multi-tier designs remain incomplete or limited: Dynamo lacks a full implementation, other solutions employ weak eviction strategies, allocation efficiencies can not meet the needs etc.
3737

3838
With AIBrix v0.3.0, we close these gaps and introduce a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. By default, the framework leverages **L1 DRAM-based caching**, which already provides significant performance improvements by offloading GPU memory pressure with minimal latency impact. For scenarios requiring multi-node sharing or larger-scale reuse, AIBrix allows users to optionally enable **L2 remote caching**, unlocking the benefits of a distributed KV cache layer. This release also marks the debut of **InfiniStore([https://github.com/bytedance/infinistore](https://github.com/bytedance/infinistore))**, a high-performance RDMA-based KV cache server developed by Bytedance, purpose-built to support large-scale, low-latency, multi-tiered KV caching for LLM inference workloads.
3939

4040
<p align="center">
41-
<img src="/images/v0.3.0-release/aibrix-kvcache-offloading-framework.png" width="80%" style="display:inline-block; margin-right:1%" />
41+
<img src="/images/v0.3.0-release/aibrix-kvcache-offloading-framework.png" width="75%" style="display:inline-block; margin-right:1%" />
4242
</p>
4343

4444

@@ -62,7 +62,7 @@ Figure 1 and Table 1 illustrate the performance results of all systems (Cache-X
6262

6363
<p align="center"><strong><em>Figure 1:</em> Average Time to First Token (seconds) with Varied QPS - Workload-1</strong></p>
6464
<p align="center">
65-
<img src="/images/v0.3.0-release/benchmark-kvcache-workload1.png" width="80%" style="display:inline-block; margin-right:1%" />
65+
<img src="/images/v0.3.0-release/benchmark-kvcache-workload1.png" width="70%" style="display:inline-block; margin-right:1%" />
6666
</p>
6767

6868
**Note:** The vLLM baseline is **EXCLUDED** from this chart because its performance is significantly worse than others’, making their curves difficult to distinguish at this scale.
@@ -93,7 +93,7 @@ Figure 1 and Table 1 illustrate the performance results of all systems (Cache-X
9393

9494
<p align="center"><strong><em>Figure 2:</em> Average Time to First Token (seconds) with Varied QPS - Workload-2</strong></p>
9595
<p align="center">
96-
<img src="/images/v0.3.0-release/benchmark-kvcache-workload2.png" width="80%" style="display:inline-block; margin-right:1%" />
96+
<img src="/images/v0.3.0-release/benchmark-kvcache-workload2.png" width="70%" style="display:inline-block; margin-right:1%" />
9797
</p>
9898

9999
<details>
@@ -125,7 +125,7 @@ For Scenario-2, we construct a workload (i.e., Workload-3) that theoretically co
125125

126126
<p align="center"><strong><em>Figure 3:</em> Average Time to First Token (seconds) with Varied QPS - Workload-3</strong></p>
127127
<p align="center">
128-
<img src="/images/v0.3.0-release/benchmark-kvcache-workload3.png" width="80%" style="display:inline-block; margin-right:1%" />
128+
<img src="/images/v0.3.0-release/benchmark-kvcache-workload3.png" width="70%" style="display:inline-block; margin-right:1%" />
129129
</p>
130130

131131
<details>
@@ -152,7 +152,7 @@ For Scenario-2, we construct a workload (i.e., Workload-3) that theoretically co
152152
<br>
153153
<br>
154154

155-
### 2. Enhanced Routing Capabilities
155+
### Enhanced Routing Capabilities
156156

157157
This release upgrades routing logic with intelligent, adaptive strategies for LLM serving:
158158

@@ -171,7 +171,7 @@ Inference engine such as vLLM provides prefix-caching where KV cache of existing
171171
> To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
172172
173173
<p align="center">
174-
<img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="80%" style="display:inline-block; margin-right:1%" />
174+
<img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="90%" style="display:inline-block; margin-right:1%" />
175175
</p>
176176

177177
#### 2.2 Preble Paper based: Prefix-Cache Routing Implementation
@@ -190,8 +190,8 @@ By combining these three costs (L + M + P), Preble assigns each request to the G
190190
191191

192192
<p align="center">
193-
<img src="/images/v0.3.0-release/benchmark-routing-1k.png" width="45%" style="display:inline-block; margin-right:1%" />
194-
<img src="/images/v0.3.0-release/benchmark-routing-8k.png" width="45%" style="display:inline-block;" />
193+
<img src="/images/v0.3.0-release/benchmark-routing-1k.png" width="40%" style="display:inline-block; margin-right:1%" />
194+
<img src="/images/v0.3.0-release/benchmark-routing-8k.png" width="40%" style="display:inline-block;" />
195195
</p>
196196

197197
<p align="center"><em>Benchmark result for different prefix cache and load aware routing strategies</em></p>
@@ -204,7 +204,8 @@ The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving b
204204

205205
> To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*. Current status is *experimental*.
206206
207-
### 3. Synthetic Benchmarking & Load Generation Framework
207+
208+
### Synthetic Benchmarking & Load Generation Framework
208209

209210
Modern LLM deployments face unpredictable workloads, fluctuating user sessions, and a wide range of prompt/generation patterns. To meet these demands, AIBrix v0.3.0 introduces a **fully modular, production-grade benchmark toolkit** designed to evaluate AI inference systems with unmatched realism and flexibility.
210211

0 commit comments

Comments
 (0)