Skip to content

Conversation

@hlin99
Copy link
Contributor

@hlin99 hlin99 commented Dec 23, 2025

Logic to handle chunked prefill/prefix caching for HPU
Due to HPU padding constraints, batching requests with existing
history (ctx != 0) causes excessive memory usage, as the entire
batch must be padded to the longest context, leading to OOM.

This patch enforces a batch size of 1 for prefill operations when
ctx != 0. Although this sacrifices some throughput in corenr cases,
it effectively eliminates the OOM risk.

Due to HPU padding constraints, batching requests with existing
history (ctx != 0) causes excessive memory usage, as the entire
batch must be padded to the longest context, leading to OOM.

This patch enforces a batch size of 1 for prefill operations when
ctx != 0. Although this sacrifices some throughput in corenr cases,
it effectively eliminates the OOM risk.

Signed-off-by: Tony Lin <tony.lin@intel.com>
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
326e7c31055812277957e3e2b43715b4f366facb

@hlin99 hlin99 changed the title Logic to handle chunked prefill/prefix caching for HPU Prefill batching logic to handle chunked prefill/prefix caching for HPU Dec 23, 2025
Copy link
Collaborator

@xuechendi xuechendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @adobrzyn , may you take a second review

@xuechendi xuechendi self-assigned this Jan 6, 2026
@xuechendi
Copy link
Collaborator

@hlin99 , hmm, after second thought, I am a little unsure how the changes impact unified attention. Will need @kzawora-intel to check

@xuechendi xuechendi assigned kzawora-intel and unassigned xuechendi Jan 6, 2026
@hlin99
Copy link
Contributor Author

hlin99 commented Jan 7, 2026

@hlin99 , hmm, after second thought, I am a little unsure how the changes impact unified attention. Will need @kzawora-intel to check

sure. i wan't aware ua goes to the same code path. if ua can overcome HPU padding nature, we can definitely split the code into ua & non-ua paths. @kzawora-intel please advise. thx

@afierka-intel afierka-intel merged commit 872795d into vllm-project:main Jan 13, 2026
50 checks passed
jinyouzhi pushed a commit to jinyouzhi/vllm-gaudi that referenced this pull request Jan 14, 2026
…PU (vllm-project#753)

Logic to handle chunked prefill/prefix caching for HPU
Due to HPU padding constraints, batching requests with existing
history (ctx != 0) causes excessive memory usage, as the entire
batch must be padded to the longest context, leading to OOM.

This patch enforces a batch size of 1 for prefill operations when
ctx != 0. Although this sacrifices some throughput in corenr cases,
it effectively eliminates the OOM risk.

Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants