Skip to content

Conversation

@ZackyLake
Copy link

Details:

A proposal to eagerly release the output memory for kvcache related primitive, avoiding keeping unnecessary memory/tensor during idle state.

Currently readvalue primitive is keeping a reference to "past state" in its output, that leads to memory spike due to duplicated kvcache. It's not totally resolvable since outputs variable is used during kvcache node's execution, but we could release it after whole graph execution.

And kvcache primitive is also keeping a reference to "new state" in its output, which should also be hold by variable state. This is preventing us releasing the kvcache explicitly in OVEP.

The memory reclaim/release is only happening during primitive execution, so idle state suffers from high memory usage.

With this change, memory usage drops right before returning from execution, and releasing(reset) variablestate correctly release the memory.

There's some concerns on synchronizing and data race, so please let us know how it can work properly.

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Nov 18, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin ExternalIntelPR External contributor from Intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants