Skip to content

[feature]add gpu memory management#324

Open
flyerming wants to merge 3 commits into
ovg-project:mainfrom
flyerming:add_hbm_memory_recycling_port
Open

[feature]add gpu memory management#324
flyerming wants to merge 3 commits into
ovg-project:mainfrom
flyerming:add_hbm_memory_recycling_port

Conversation

@flyerming
Copy link
Copy Markdown

@flyerming flyerming commented May 6, 2026

summary:
In cloud large model service, we need dynamic model service management.
In recent test, i found model with kvcached support, hbm memory will be full after serving for a period of time. This prevents the scheduling platform form dynamically deploying new models to GPUs with sparsely loaded existing models.
I developed the kv management function for kvcached.
use these endpint to manage gpu memory:
/kvcache/status
/kvcache/limit
/kvcache/limit_percent
/kvcache/trim
/kvcache/safety_floor

vllm 0.19.1 tested。

example:
curl http://localhost:8080/kvcache/status
curl -X POST http://localhost:8080/kvcache/limit_percent -H "Content-Type: application/json" -d '{"percent": 20}'
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "3G"}'

Set an ultra-large gpu memory limit to lift the gpu memory restriction of this model
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "90G"}'

@flyerming
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces KV cache management APIs for vLLM and SGLang, enabling users to query and manage KV cache memory directly through the inference framework's port. It implements an auto-calculated safety floor mechanism to prevent invalid memory configurations and adds autopatch support to inject these endpoints into the HTTP servers. The changes also include updates to the PageAllocator and MemInfoTracker to support these features and improve prefix cache eviction logic during shrink mode. The reviewer identified an incorrect CLI flag in the documentation for disabling prefix caching in vLLM, which should be corrected to --disable-prefix-caching.

Comment on lines +418 to +419
# vLLM
vllm serve MODEL --no-enable-prefix-caching
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.

Suggested change
# vLLM
vllm serve MODEL --no-enable-prefix-caching
# vLLM
vllm serve MODEL --disable-prefix-caching

Comment on lines +418 to +419
# vLLM
vllm serve MODEL --no-enable-prefix-caching
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.

Suggested change
# vLLM
vllm serve MODEL --no-enable-prefix-caching
# vLLM
vllm serve MODEL --disable-prefix-caching

@flyerming
Copy link
Copy Markdown
Author

@cui36 FYI

@cui36
Copy link
Copy Markdown
Collaborator

cui36 commented May 6, 2026

Hi @flyerming, thanks so much for the contribution!

Would you mind joining our kvcached-slack to walk us through your requirements in a bit more detail? We'd love to learn more about your use case and see how we can best support you.

Look forward to the collaboration!

@flyerming
Copy link
Copy Markdown
Author

Hi @flyerming, thanks so much for the contribution!

Would you mind joining our kvcached-slack to walk us through your requirements in a bit more detail? We'd love to learn more about your use case and see how we can best support you.

Look forward to the collaboration!

I've joined the OVG Project.My ID is flyerming. How can i get in touch with you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants