[feature]add gpu memory management#324
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces KV cache management APIs for vLLM and SGLang, enabling users to query and manage KV cache memory directly through the inference framework's port. It implements an auto-calculated safety floor mechanism to prevent invalid memory configurations and adds autopatch support to inject these endpoints into the HTTP servers. The changes also include updates to the PageAllocator and MemInfoTracker to support these features and improve prefix cache eviction logic during shrink mode. The reviewer identified an incorrect CLI flag in the documentation for disabling prefix caching in vLLM, which should be corrected to --disable-prefix-caching.
| # vLLM | ||
| vllm serve MODEL --no-enable-prefix-caching |
There was a problem hiding this comment.
The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.
| # vLLM | |
| vllm serve MODEL --no-enable-prefix-caching | |
| # vLLM | |
| vllm serve MODEL --disable-prefix-caching |
| # vLLM | ||
| vllm serve MODEL --no-enable-prefix-caching |
There was a problem hiding this comment.
The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.
| # vLLM | |
| vllm serve MODEL --no-enable-prefix-caching | |
| # vLLM | |
| vllm serve MODEL --disable-prefix-caching |
|
@cui36 FYI |
|
Hi @flyerming, thanks so much for the contribution! Would you mind joining our kvcached-slack to walk us through your requirements in a bit more detail? We'd love to learn more about your use case and see how we can best support you. Look forward to the collaboration! |
I've joined the OVG Project.My ID is flyerming. How can i get in touch with you? |
summary:
In cloud large model service, we need dynamic model service management.
In recent test, i found model with kvcached support, hbm memory will be full after serving for a period of time. This prevents the scheduling platform form dynamically deploying new models to GPUs with sparsely loaded existing models.
I developed the kv management function for kvcached.
use these endpint to manage gpu memory:
/kvcache/status/kvcache/limit/kvcache/limit_percent/kvcache/trim/kvcache/safety_floorvllm 0.19.1 tested。
example:
curl http://localhost:8080/kvcache/status
curl -X POST http://localhost:8080/kvcache/limit_percent -H "Content-Type: application/json" -d '{"percent": 20}'
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "3G"}'
Set an ultra-large gpu memory limit to lift the gpu memory restriction of this model
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "90G"}'