Skip to content

Performance Optimization Track Issue #299

@lianghao208

Description

@lianghao208

As mentioned, our team (member: @lianghao208 @pokerfaceSad @will-qq @xdxd1234-bit ) has developed an elastic memory management system to solve the KV cache over-allocation problem, aligning with the kvcached project's goal. During performance benchmarking, we observed a significant performance gap between our solution and counterparts like vLLM and KVCached.

GPU: H20 * 1
Model:Llama2 7B
vllm bench + sharegpt with random QPS

vLLM ours kvcached
Mean TTFT (ms) 80.95 80.69 267.72
P99 TTFT (ms) 260.72 263.28 915.25
Mean TPOT (ms) 29.56 29.71 68.22
P99 TPOT (ms) 65.19 65.18 218.51

Through our analysis, we identified several optimization opportunities in KVCached that could further enhance performance, as detailed below:

1. Redundant Object Creation Optimization

Implement object pooling and reuse strategy. (e.g. KVCacheBlockClass instances in get_new_blocks())

2. Reduce CUDA Call Overhead in available_size

Eliminate expensive get_avail_physical_pages() CUDA calls during block allocation in available_size() method to minimize blocks allocation latency.

3. Page Allocator Migration from Python to C++

Rewrite page allocation logic from Python to high-performance C++ implementation to eliminated Python interpreter overhead and GIL contention for memory allocation operations.

4. Asynchronous Pages Release

Transform page release process from synchronous to asynchronous operation

This issue tracks the implementation of critical performance optimizations for the system. The current implementation has several areas where overhead can be significantly reduced to improve overall system throughput and reduce latency. These optimizations target key bottlenecks identified in production usage, particularly focusing on reducing CUDA call overhead, Python interpreter overhead, object creation overhead, and synchronous page operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions