As mentioned, our team (member: @lianghao208 @pokerfaceSad @will-qq @xdxd1234-bit ) has developed an elastic memory management system to solve the KV cache over-allocation problem, aligning with the kvcached project's goal. During performance benchmarking, we observed a significant performance gap between our solution and counterparts like vLLM and KVCached.
GPU: H20 * 1
Model:Llama2 7B
vllm bench + sharegpt with random QPS
|
vLLM |
ours |
kvcached |
| Mean TTFT (ms) |
80.95 |
80.69 |
267.72 |
| P99 TTFT (ms) |
260.72 |
263.28 |
915.25 |
| Mean TPOT (ms) |
29.56 |
29.71 |
68.22 |
| P99 TPOT (ms) |
65.19 |
65.18 |
218.51 |
Through our analysis, we identified several optimization opportunities in KVCached that could further enhance performance, as detailed below:
1. Redundant Object Creation Optimization
Implement object pooling and reuse strategy. (e.g. KVCacheBlockClass instances in get_new_blocks())
2. Reduce CUDA Call Overhead in available_size
Eliminate expensive get_avail_physical_pages() CUDA calls during block allocation in available_size() method to minimize blocks allocation latency.
3. Page Allocator Migration from Python to C++
Rewrite page allocation logic from Python to high-performance C++ implementation to eliminated Python interpreter overhead and GIL contention for memory allocation operations.
4. Asynchronous Pages Release
Transform page release process from synchronous to asynchronous operation
This issue tracks the implementation of critical performance optimizations for the system. The current implementation has several areas where overhead can be significantly reduced to improve overall system throughput and reduce latency. These optimizations target key bottlenecks identified in production usage, particularly focusing on reducing CUDA call overhead, Python interpreter overhead, object creation overhead, and synchronous page operations.
As mentioned, our team (member: @lianghao208 @pokerfaceSad @will-qq @xdxd1234-bit ) has developed an elastic memory management system to solve the KV cache over-allocation problem, aligning with the kvcached project's goal. During performance benchmarking, we observed a significant performance gap between our solution and counterparts like vLLM and KVCached.
GPU: H20 * 1
Model:Llama2 7B
vllm bench + sharegpt with random QPS
Through our analysis, we identified several optimization opportunities in KVCached that could further enhance performance, as detailed below:
1. Redundant Object Creation Optimization
Implement object pooling and reuse strategy. (e.g. KVCacheBlockClass instances in get_new_blocks())
2. Reduce CUDA Call Overhead in available_size
Eliminate expensive get_avail_physical_pages() CUDA calls during block allocation in available_size() method to minimize blocks allocation latency.
3. Page Allocator Migration from Python to C++
Rewrite page allocation logic from Python to high-performance C++ implementation to eliminated Python interpreter overhead and GIL contention for memory allocation operations.
4. Asynchronous Pages Release
Transform page release process from synchronous to asynchronous operation
This issue tracks the implementation of critical performance optimizations for the system. The current implementation has several areas where overhead can be significantly reduced to improve overall system throughput and reduce latency. These optimizations target key bottlenecks identified in production usage, particularly focusing on reducing CUDA call overhead, Python interpreter overhead, object creation overhead, and synchronous page operations.