You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/2025-08-04-v0.4.0-release.md
+13-14Lines changed: 13 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ ShowToc: true
17
17
tocopen: true
18
18
---
19
19
20
-
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release \- v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.
20
+
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.
21
21
22
22
## v0.4.0 Highlight Features
23
23
@@ -41,11 +41,7 @@ The handling of the prefill request depends on the underlying inference engine:
After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency.
45
-
46
-
The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged.
47
-
48
-
The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.
44
+
After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency. The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged. The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.
49
45
50
46
The following figures illustrate the benefits of prefix-aware routing enabled by AIBrix's PD-aware routing support. To evaluate the impact of this feature, we design two workloads inspired by real-world scenarios. The **prefix-sharing workload** simulates requests that share a few long common prefixes, mimicking scenarios with significant prefix overlap (as described in our [benchmark setting](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L23)). The exact sharing patterns used are specified below. The **multiturn workload** simulates a multi-turn conversation, with a mean request length of 2,000 tokens (standard deviation: 500) and an average of [3.55 turns per conversation](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L18).
51
47
@@ -111,8 +107,7 @@ The latest version brings several key features and enhancements:
111
107
* Offered network auto-configuration functionality for RDMA-capable environments.
112
108
* Introduced new AIBrix KVCache L2 connectors for PrisDB and EIC, ByteDance's key-value stores engineered for low-latency, scalable multi-tier caching architectures optimized for LLM inference workloads.
113
109
114
-
Benchmarks by the EIC team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model).
115
-
We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.
110
+
Benchmarks by the [Elastic Instant Cache](https://www.volcengine.com/product/eic)(EIC) team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model). We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.
@@ -122,17 +117,18 @@ We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache bac
122
117
123
118
### KV Event Subscription System
124
119
125
-
AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. Here, we will cover its design, trade-offs, and implementation details. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.
120
+
AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.
126
121
127
122
The core idea of this feature is to broadcast KV cache state change events across all routers via messaging middleware. This provides the routing layer with a near real-time, global view of the cache, enabling **more precise routing decisions**. (See PR [\#1349](https://github.com/vllm-project/aibrix/pull/1349) for details)
128
-
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.
123
+
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.
129
124
130
-
In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.
131
-
To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:
125
+
In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.
126
+
127
+
To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:
132
128
133
129
```
134
130
// Enable KV event synchronization
135
-
AIBRIX_KV_EVENT_SYNC_ENABLED: true
131
+
AIBRIX_KV_EVENT_SYNC_ENABLED: true
136
132
// Depends on and enables remote tokenizer mode
137
133
AIBRIX_USE_REMOTE_TOKENIZE: true
138
134
```
@@ -147,7 +143,7 @@ The KV Event Subscription System is a step for AIBrix towards high-performance d
147
143
148
144
### Multi‑Engine Support
149
145
150
-
Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [[#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and xLLM** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.
146
+
Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and [xLLM](https://www.volcengine.com/docs/6459/72358)** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.
151
147
152
148
Key points include:
153
149
@@ -159,6 +155,8 @@ Multi‑engine support makes it easy to run vLLM and SGLang side‑by‑side and
159
155
160
156
## Other Improvements
161
157
158
+
While the highlights focus on new architectural and orchestration capabilities, we've also delivered several foundational improvements that strengthen AIBrix’s robustness and observability in real-world deployments.
159
+
162
160
AIBrix Gateway now supports SLO-aware routing with request profiling and deadline-based traffic control, enabling more intelligent and responsive load handling under dynamic traffic patterns ([#1192](https://github.com/vllm-project/aibrix/pull/1192), [#1305](https://github.com/vllm-project/aibrix/pull/1305), [#1368](https://github.com/vllm-project/aibrix/pull/1368)). Additional enhancements include configurable timeouts, custom metrics ports, and a ready-to-use Grafana dashboard for observability ([#1211](https://github.com/vllm-project/aibrix/pull/1211), [#1212](https://github.com/vllm-project/aibrix/pull/1212)).
163
161
164
162
On the control plane side, we've strengthened webhook validation, CRD existence checks, and added mechanisms to safely resync cache state during component restarts ([#1170](https://github.com/vllm-project/aibrix/pull/1170), [#1187](https://github.com/vllm-project/aibrix/pull/1187), [#1219](https://github.com/vllm-project/aibrix/pull/1219)).
@@ -180,6 +178,7 @@ We deeply appreciate your contributions and feedback. Keep them coming!
180
178
## Next Steps
181
179
182
180
We're continuing to push the boundaries of LLM system infrastructure, and **AIBrix v0.5.0** will focus on unlocking powerful capabilities for **agent-based use cases**, **multi-modality**, and **cost-efficient multi-tenant serving**. Here's a glimpse of what's coming:
181
+
-**P/D Disaggregation Improvements**: Introduced additional production-ready deployment patterns and examples, with improved integration of PodGroup for better scheduling alignment and enhanced autoscaling support.
183
182
-**Batch API**: Introduce a new batch inference API to improve GPU utilization under latency-insensitive scenarios.
184
183
-**Multi-Tenancy**: Add tenant-aware isolation, request segregation, and per-tenant SLO controls for safer shared deployments.
185
184
-**Context Cache for Agents**: Enable efficient reuse of session history across multi-turn conversations and agentic programs via a new context caching interface.
0 commit comments