[feature]add gpu memory management by flyerming · Pull Request #324 · ovg-project/kvcached

flyerming · 2026-05-06T09:30:29Z

summary:
In cloud large model service, we need dynamic model service management.
In recent test, i found model with kvcached support, hbm memory will be full after serving for a period of time. This prevents the scheduling platform form dynamically deploying new models to GPUs with sparsely loaded existing models.
I developed the kv management function for kvcached.
use these endpint to manage gpu memory:
/kvcache/status
/kvcache/limit
/kvcache/limit_percent
/kvcache/trim
/kvcache/safety_floor

vllm 0.19.1 tested。

example：
curl http://localhost:8080/kvcache/status
curl -X POST http://localhost:8080/kvcache/limit_percent -H "Content-Type: application/json" -d '{"percent": 20}'
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "3G"}'

Set an ultra-large gpu memory limit to lift the gpu memory restriction of this model
curl -X POST http://localhost:8080/kvcache/limit -H "Content-Type: application/json" -d '{"size": "90G"}'

flyerming · 2026-05-06T09:51:26Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces KV cache management APIs for vLLM and SGLang, enabling users to query and manage KV cache memory directly through the inference framework's port. It implements an auto-calculated safety floor mechanism to prevent invalid memory configurations and adds autopatch support to inject these endpoints into the HTTP servers. The changes also include updates to the PageAllocator and MemInfoTracker to support these features and improve prefix cache eviction logic during shrink mode. The reviewer identified an incorrect CLI flag in the documentation for disabling prefix caching in vLLM, which should be corrected to --disable-prefix-caching.

gemini-code-assist · 2026-05-06T09:54:50Z

+# vLLM
+vllm serve MODEL --no-enable-prefix-caching


The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.

Suggested change

# vLLM

vllm serve MODEL --no-enable-prefix-caching

# vLLM

vllm serve MODEL --disable-prefix-caching

gemini-code-assist · 2026-05-06T09:54:50Z

+# vLLM
+vllm serve MODEL --no-enable-prefix-caching


The command to disable prefix caching in vLLM appears to be incorrect. The correct flag to disable this feature in vLLM is --disable-prefix-caching, as --no-enable-prefix-caching is not a valid argument. Please update the example command for accuracy.

Suggested change

# vLLM

vllm serve MODEL --no-enable-prefix-caching

# vLLM

vllm serve MODEL --disable-prefix-caching

flyerming · 2026-05-06T10:07:36Z

@cui36 FYI

cui36 · 2026-05-06T15:22:37Z

Hi @flyerming, thanks so much for the contribution!

Would you mind joining our kvcached-slack to walk us through your requirements in a bit more detail? We'd love to learn more about your use case and see how we can best support you.

Look forward to the collaboration!

flyerming · 2026-05-07T03:30:47Z

Hi @flyerming, thanks so much for the contribution!

Would you mind joining our kvcached-slack to walk us through your requirements in a bit more detail? We'd love to learn more about your use case and see how we can best support you.

Look forward to the collaboration!

I've joined the OVG Project.My ID is flyerming. How can i get in touch with you?

flyerming added 3 commits May 6, 2026 17:09

add hbm memory management

d5a84b8

api doc for english

76a9877

Merge branch 'main' into add_hbm_memory_recycling_port

4d952ab

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature]add gpu memory management#324

[feature]add gpu memory management#324
flyerming wants to merge 3 commits into
ovg-project:mainfrom
flyerming:add_hbm_memory_recycling_port

flyerming commented May 6, 2026 •

edited

Loading

Uh oh!

flyerming commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

flyerming commented May 6, 2026

Uh oh!

cui36 commented May 6, 2026

Uh oh!

flyerming commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

flyerming commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flyerming commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

flyerming commented May 6, 2026

Uh oh!

cui36 commented May 6, 2026

Uh oh!

flyerming commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

flyerming commented May 6, 2026 •

edited

Loading