Skip to content

Commit 83fc01f

Browse files
authored
Update v0.4.0 blog post (#25)
Signed-off-by: Jiaxin Shan <[email protected]>
1 parent a1130fb commit 83fc01f

File tree

1 file changed

+13
-14
lines changed

1 file changed

+13
-14
lines changed

content/posts/2025-08-04-v0.4.0-release.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ShowToc: true
1717
tocopen: true
1818
---
1919

20-
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release \- v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.
20+
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.
2121

2222
## v0.4.0 Highlight Features
2323

@@ -41,11 +41,7 @@ The handling of the prefill request depends on the underlying inference engine:
4141
<img src="/images/v0.4.0-release/aibrix-pd-router.png" width="100%" style="display:inline-block; margin-right:1%" />
4242
</p>
4343

44-
After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency.
45-
46-
The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged.
47-
48-
The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.
44+
After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency. The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged. The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.
4945

5046
The following figures illustrate the benefits of prefix-aware routing enabled by AIBrix's PD-aware routing support. To evaluate the impact of this feature, we design two workloads inspired by real-world scenarios. The **prefix-sharing workload** simulates requests that share a few long common prefixes, mimicking scenarios with significant prefix overlap (as described in our [benchmark setting](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L23)). The exact sharing patterns used are specified below. The **multiturn workload** simulates a multi-turn conversation, with a mean request length of 2,000 tokens (standard deviation: 500) and an average of [3.55 turns per conversation](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L18).
5147

@@ -111,8 +107,7 @@ The latest version brings several key features and enhancements:
111107
* Offered network auto-configuration functionality for RDMA-capable environments.
112108
* Introduced new AIBrix KVCache L2 connectors for PrisDB and EIC, ByteDance's key-value stores engineered for low-latency, scalable multi-tier caching architectures optimized for LLM inference workloads.
113109

114-
Benchmarks by the EIC team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model).
115-
We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.
110+
Benchmarks by the [Elastic Instant Cache](https://www.volcengine.com/product/eic)(EIC) team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model). We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.
116111

117112
(Notation: 8B-1U = DeepSeek-R1-Distill-Llama-8B + Workload-1; 8B-8U = Workload-2 variant; 70B-1U = DeepSeek-R1-Distill-Llama-70B model + Workload-1)
118113

@@ -122,17 +117,18 @@ We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache bac
122117

123118
### KV Event Subscription System
124119

125-
AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. Here, we will cover its design, trade-offs, and implementation details. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.
120+
AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.
126121

127122
The core idea of this feature is to broadcast KV cache state change events across all routers via messaging middleware. This provides the routing layer with a near real-time, global view of the cache, enabling **more precise routing decisions**. (See PR [\#1349](https://github.com/vllm-project/aibrix/pull/1349) for details)
128-
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.
123+
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.
129124

130-
In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.
131-
To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:
125+
In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.
126+
127+
To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:
132128

133129
```
134130
// Enable KV event synchronization
135-
AIBRIX_KV_EVENT_SYNC_ENABLED: true
131+
AIBRIX_KV_EVENT_SYNC_ENABLED: true
136132
// Depends on and enables remote tokenizer mode
137133
AIBRIX_USE_REMOTE_TOKENIZE: true
138134
```
@@ -147,7 +143,7 @@ The KV Event Subscription System is a step for AIBrix towards high-performance d
147143

148144
### Multi‑Engine Support
149145

150-
Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [[#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and xLLM** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.
146+
Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and [xLLM](https://www.volcengine.com/docs/6459/72358)** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.
151147

152148
Key points include:
153149

@@ -159,6 +155,8 @@ Multi‑engine support makes it easy to run vLLM and SGLang side‑by‑side and
159155

160156
## Other Improvements
161157

158+
While the highlights focus on new architectural and orchestration capabilities, we've also delivered several foundational improvements that strengthen AIBrix’s robustness and observability in real-world deployments.
159+
162160
AIBrix Gateway now supports SLO-aware routing with request profiling and deadline-based traffic control, enabling more intelligent and responsive load handling under dynamic traffic patterns ([#1192](https://github.com/vllm-project/aibrix/pull/1192), [#1305](https://github.com/vllm-project/aibrix/pull/1305), [#1368](https://github.com/vllm-project/aibrix/pull/1368)). Additional enhancements include configurable timeouts, custom metrics ports, and a ready-to-use Grafana dashboard for observability ([#1211](https://github.com/vllm-project/aibrix/pull/1211), [#1212](https://github.com/vllm-project/aibrix/pull/1212)).
163161

164162
On the control plane side, we've strengthened webhook validation, CRD existence checks, and added mechanisms to safely resync cache state during component restarts ([#1170](https://github.com/vllm-project/aibrix/pull/1170), [#1187](https://github.com/vllm-project/aibrix/pull/1187), [#1219](https://github.com/vllm-project/aibrix/pull/1219)).
@@ -180,6 +178,7 @@ We deeply appreciate your contributions and feedback. Keep them coming!
180178
## Next Steps
181179

182180
We're continuing to push the boundaries of LLM system infrastructure, and **AIBrix v0.5.0** will focus on unlocking powerful capabilities for **agent-based use cases**, **multi-modality**, and **cost-efficient multi-tenant serving**. Here's a glimpse of what's coming:
181+
- **P/D Disaggregation Improvements**: Introduced additional production-ready deployment patterns and examples, with improved integration of PodGroup for better scheduling alignment and enhanced autoscaling support.
183182
- **Batch API**: Introduce a new batch inference API to improve GPU utilization under latency-insensitive scenarios.
184183
- **Multi-Tenancy**: Add tenant-aware isolation, request segregation, and per-tenant SLO controls for safer shared deployments.
185184
- **Context Cache for Agents**: Enable efficient reuse of session history across multi-turn conversations and agentic programs via a new context caching interface.

0 commit comments

Comments
 (0)