You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-05-21-v0.3.0-release.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,14 +31,14 @@ We introduce a fully integrated distributed KVCache offloading system, adaptive
31
31
32
32
## New Features!
33
33
34
-
### 1. Multi-tier KV Cache Offloading System
34
+
### Multi-tier KV Cache Offloading System
35
35
36
36
In the AIBrix v0.2.0 release, we introduced distributed KVCache for the first time by integrating Vineyard and making experimental changes in vLLM. This early design aimed to address a key bottleneck in LLM inference: as model sizes and context lengths grow, KV cache increasingly consumes GPU memory and limits scalability. However, v0.2.0 had notable limitations — the KVConnector interface in vLLM had only just been merged upstream, and we hadn’t yet had the opportunity to fully leverage it or build a dedicated cache server optimized for LLM workloads. Meanwhile, systems like Dynamo, MoonCake, and LMCache explored KV offloading to CPU, SSDs, or remote memory—but their multi-tier designs remain incomplete or limited: Dynamo lacks a full implementation, other solutions employ weak eviction strategies, allocation efficiencies can not meet the needs etc.
37
37
38
38
With AIBrix v0.3.0, we close these gaps and introduce a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. By default, the framework leverages **L1 DRAM-based caching**, which already provides significant performance improvements by offloading GPU memory pressure with minimal latency impact. For scenarios requiring multi-node sharing or larger-scale reuse, AIBrix allows users to optionally enable **L2 remote caching**, unlocking the benefits of a distributed KV cache layer. This release also marks the debut of **InfiniStore([https://github.com/bytedance/infinistore](https://github.com/bytedance/infinistore))**, a high-performance RDMA-based KV cache server developed by Bytedance, purpose-built to support large-scale, low-latency, multi-tiered KV caching for LLM inference workloads.
**Note:** The vLLM baseline is **EXCLUDED** from this chart because its performance is significantly worse than others’, making their curves difficult to distinguish at this scale.
@@ -93,7 +93,7 @@ Figure 1 and Table 1 illustrate the performance results of all systems (Cache-X
93
93
94
94
<palign="center"><strong><em>Figure 2:</em> Average Time to First Token (seconds) with Varied QPS - Workload-2</strong></p>
@@ -152,7 +152,7 @@ For Scenario-2, we construct a workload (i.e., Workload-3) that theoretically co
152
152
<br>
153
153
<br>
154
154
155
-
### 2. Enhanced Routing Capabilities
155
+
### Enhanced Routing Capabilities
156
156
157
157
This release upgrades routing logic with intelligent, adaptive strategies for LLM serving:
158
158
@@ -171,7 +171,7 @@ Inference engine such as vLLM provides prefix-caching where KV cache of existing
171
171
> To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
Modern LLM deployments face unpredictable workloads, fluctuating user sessions, and a wide range of prompt/generation patterns. To meet these demands, AIBrix v0.3.0 introduces a **fully modular, production-grade benchmark toolkit** designed to evaluate AI inference systems with unmatched realism and flexibility.
0 commit comments