Remove Section number and adjust image size (#22)

xieus · web-flow · commit d9f53bcf5bbb · 2025-05-22T10:28:46.000-07:00
diff --git a/content/posts/2025-05-21-v0.3.0-release.md b/content/posts/2025-05-21-v0.3.0-release.md
@@ -31,14 +31,14 @@ We introduce a fully integrated distributed KVCache offloading system, adaptive
 
 ## New Features!
 
-### 1. Multi-tier KV Cache Offloading System
+### Multi-tier KV Cache Offloading System
 
 In the AIBrix v0.2.0 release, we introduced distributed KVCache for the first time by integrating Vineyard and making experimental changes in vLLM. This early design aimed to address a key bottleneck in LLM inference: as model sizes and context lengths grow, KV cache increasingly consumes GPU memory and limits scalability. However, v0.2.0 had notable limitations — the KVConnector interface in vLLM had only just been merged upstream, and we hadn’t yet had the opportunity to fully leverage it or build a dedicated cache server optimized for LLM workloads. Meanwhile, systems like Dynamo, MoonCake, and LMCache explored KV offloading to CPU, SSDs, or remote memory—but their multi-tier designs remain incomplete or limited: Dynamo lacks a full implementation, other solutions employ weak eviction strategies, allocation efficiencies can not meet the needs etc.
 
 With AIBrix v0.3.0, we close these gaps and introduce a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. By default, the framework leverages **L1 DRAM-based caching**, which already provides significant performance improvements by offloading GPU memory pressure with minimal latency impact. For scenarios requiring multi-node sharing or larger-scale reuse, AIBrix allows users to optionally enable **L2 remote caching**, unlocking the benefits of a distributed KV cache layer. This release also marks the debut of **InfiniStore([https://github.com/bytedance/infinistore](https://github.com/bytedance/infinistore))**, a high-performance RDMA-based KV cache server developed by Bytedance, purpose-built to support large-scale, low-latency, multi-tiered KV caching for LLM inference workloads.
 
 <p align="center">
-  <img src="/images/v0.3.0-release/aibrix-kvcache-offloading-framework.png" width="80%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/aibrix-kvcache-offloading-framework.png" width="75%" style="display:inline-block; margin-right:1%" />
 </p>
 
 
@@ -62,7 +62,7 @@ Figure 1 and Table 1 illustrate the performance results of all systems (Cache-X
 
 <p align="center"><strong><em>Figure 1:</em> Average Time to First Token (seconds) with Varied QPS - Workload-1</strong></p>
 <p align="center">
-  <img src="/images/v0.3.0-release/benchmark-kvcache-workload1.png" width="80%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/benchmark-kvcache-workload1.png" width="70%" style="display:inline-block; margin-right:1%" />
 </p>
 
 **Note:** The vLLM baseline is **EXCLUDED** from this chart because its performance is significantly worse than others’, making their curves difficult to distinguish at this scale.
@@ -93,7 +93,7 @@ Figure 1 and Table 1 illustrate the performance results of all systems (Cache-X
 
 <p align="center"><strong><em>Figure 2:</em> Average Time to First Token (seconds) with Varied QPS - Workload-2</strong></p>
 <p align="center">
-  <img src="/images/v0.3.0-release/benchmark-kvcache-workload2.png" width="80%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/benchmark-kvcache-workload2.png" width="70%" style="display:inline-block; margin-right:1%" />
 </p>
 
 <details>
@@ -125,7 +125,7 @@ For Scenario-2, we construct a workload (i.e., Workload-3) that theoretically co
 
 <p align="center"><strong><em>Figure 3:</em> Average Time to First Token (seconds) with Varied QPS - Workload-3</strong></p>
 <p align="center">
-  <img src="/images/v0.3.0-release/benchmark-kvcache-workload3.png" width="80%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/benchmark-kvcache-workload3.png" width="70%" style="display:inline-block; margin-right:1%" />
 </p>
 
 <details>
@@ -152,7 +152,7 @@ For Scenario-2, we construct a workload (i.e., Workload-3) that theoretically co
 <br>
 <br>
 
-### 2. Enhanced Routing Capabilities
+### Enhanced Routing Capabilities
 
 This release upgrades routing logic with intelligent, adaptive strategies for LLM serving:
 
@@ -171,7 +171,7 @@ Inference engine such as vLLM provides prefix-caching where KV cache of existing
  > To use [prefix-cache routing](https://aibrix.readthedocs.io/latest/features/gateway-plugins.html#routing-strategies), include header *`"routing-strategy": "prefix-cache"`*.
 
 <p align="center">
-  <img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="80%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/aibrix-prefix-cache-aware.png" width="90%" style="display:inline-block; margin-right:1%" />
 </p>
 
 #### 2.2 Preble Paper based: Prefix-Cache Routing Implementation
@@ -190,8 +190,8 @@ By combining these three costs (L + M + P), Preble assigns each request to the G
 
 
 <p align="center">
-  <img src="/images/v0.3.0-release/benchmark-routing-1k.png" width="45%" style="display:inline-block; margin-right:1%" />
-  <img src="/images/v0.3.0-release/benchmark-routing-8k.png" width="45%" style="display:inline-block;" />
+  <img src="/images/v0.3.0-release/benchmark-routing-1k.png" width="40%" style="display:inline-block; margin-right:1%" />
+  <img src="/images/v0.3.0-release/benchmark-routing-8k.png" width="40%" style="display:inline-block;" />
 </p>
 
 <p align="center"><em>Benchmark result for different prefix cache and load aware routing strategies</em></p>
@@ -204,7 +204,8 @@ The Virtual Token Counter (VTC) is a fair scheduling algorithm for LLM serving b
 
 > To use a fairness-oriented routing, include header *`"routing-strategy": "vtc-basic"`*. Current status is *experimental*.
 
-### 3. Synthetic Benchmarking & Load Generation Framework
+
+### Synthetic Benchmarking & Load Generation Framework
 
 Modern LLM deployments face unpredictable workloads, fluctuating user sessions, and a wide range of prompt/generation patterns. To meet these demands, AIBrix v0.3.0 introduces a **fully modular, production-grade benchmark toolkit** designed to evaluate AI inference systems with unmatched realism and flexibility.