Performance Optimization Track Issue

As mentioned, our team (member: @lianghao208  @pokerfaceSad  @will-qq  @xdxd1234-bit ) has developed an elastic memory management system to solve the KV cache over-allocation problem, aligning with the kvcached project's goal. During performance benchmarking, we observed a significant performance gap between our solution and counterparts like vLLM and KVCached. 

GPU: H20 * 1
Model：Llama2 7B
vllm bench + sharegpt with random QPS

| | vLLM | ours | kvcached |
| :--- | :--- | :--- | :--- |
| Mean TTFT (ms) | 80.95 | 80.69 | 267.72 |
| P99 TTFT (ms) | 260.72 | 263.28 | 915.25 |
| Mean TPOT (ms) | 29.56 | 29.71 | 68.22 |
| P99 TPOT (ms) | 65.19 | 65.18 | 218.51 |

Through our analysis, we identified several optimization opportunities in KVCached that could further enhance performance, as detailed below:

- [ ] Redundant Object Creation Optimization #300 
- [ ] Reduce CUDA Call Overhead in available_size
- [ ] Page Allocator Migration from Python to C++ #319 
- [ ] Asynchronous Pages Release


# 1. Redundant Object Creation Optimization
Implement object pooling and reuse strategy. (e.g. **KVCacheBlockClass** instances in get_new_blocks())

# 2. Reduce CUDA Call Overhead in available_size

Eliminate expensive get_avail_physical_pages() CUDA calls during block allocation in available_size() method to minimize blocks allocation latency.

# 3. Page Allocator Migration from Python to C++
Rewrite page allocation logic from Python to high-performance C++ implementation to eliminated Python interpreter overhead and GIL contention for memory allocation operations.

# 4. Asynchronous Pages Release
Transform page release process from synchronous to asynchronous operation

This issue tracks the implementation of critical performance optimizations for the system. The current implementation has several areas where overhead can be significantly reduced to improve overall system throughput and reduce latency. These optimizations target key bottlenecks identified in production usage, particularly focusing on reducing CUDA call overhead, Python interpreter overhead, object creation overhead, and synchronous page operations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimization Track Issue #299

1. Redundant Object Creation Optimization

2. Reduce CUDA Call Overhead in available_size

3. Page Allocator Migration from Python to C++

4. Asynchronous Pages Release

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	vLLM	ours	kvcached
Mean TTFT (ms)	80.95	80.69	267.72
P99 TTFT (ms)	260.72	263.28	915.25
Mean TPOT (ms)	29.56	29.71	68.22
P99 TPOT (ms)	65.19	65.18	218.51

Performance Optimization Track Issue #299

Description

1. Redundant Object Creation Optimization

2. Reduce CUDA Call Overhead in available_size

3. Page Allocator Migration from Python to C++

4. Asynchronous Pages Release

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions