Skip to content

Conversation

@hsubramony
Copy link
Contributor

No description provided.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: Harish Subramony <[email protected]>
@github-actions
Copy link

github-actions bot commented Oct 3, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
be22bb6f3dd7aaf8559a4a0a1beb98a37a5a8138


MODELS=(
"/root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/"
"/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use internal path here, didn't realized that last time

#)
#MODELS=(
# "Qwen/Qwen3-0.6B"
#)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pelease clean up models comments here

# --port 9111 \
# --seed "$(date +%s)" \
# --model /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \
# --tokenizer /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

def wait_for_save(self):
assert self.connector_worker is not None
assert isinstance(self._connector_metadata, NixlConnectorMetadata)
self.connector_worker.rewrite_kv_based_on_transfer_layout(self._connector_metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate of what rewrite_kv_based_on_transfer_layout want to do here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean add a dev_doc comments for future code reading

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add conditional check and only enable for p and d with different TP_size
Please do assert when you can't split, not in 2x, 4x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its based on the ratio check which needs to specified in the command line , there is no other way to get it.

kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape)
kv.index_copy_(dim=0, index=indices, source=kv_selected)
if len(metadata.reqs_to_save) > 0:
torch.hpu.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the sync necessary?

kv_selected = torch.index_select(kv, 0, indices)
bc, bs, h, d = kv_selected.shape
shape = int(bs * h / decoder_tp_ratio * d)
blocks = torch.chunk(kv_selected, 2, dim=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2 is hard-coded?



def rewrite_kv_based_on_transfer_layout(self, metadata: NixlConnectorMetadata):
decoder_tp_ratio = int(os.getenv('DECODER_TP_RATIO', 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this one necessary, can you get it from somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes not sure if there is another way to get the ratio

blocks = torch.chunk(kv_selected, 2, dim=2)
vecs = [b.reshape([bc, shape]) for b in blocks]
kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape)
kv.index_copy_(dim=0, index=indices, source=kv_selected)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impl here looks not efficient to me. Does the kv here means host_buffer only?

self.profiler.record_counter(self.event_start, counters)

if decoder_tp_ratio > 1:
self.rewrite_kv_based_on_transfer_layout(scheduler_output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happens everytime after model_fwd?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is kv transfer, then do it

for layer_idx in range(len(self.kv_caches)):
k = self.kv_caches[layer_idx][0]
v = self.kv_caches[layer_idx][1]
gb, h, d = v.shape
Copy link

@libinta libinta Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gb.h,d means( block count*block_size, num_kv_heads, head_size)

indices = torch.tensor(block_ids, device=v.device)
gbhd = [int(gb / self.block_size), self.block_size, h, d]
for kv_tensor in [k, v]:
kv = kv_tensor.reshape(gbhd)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments

_TYPE_CACHE: dict[str, dict[str, Any]] = {}

hpu_buffer: list[list[torch.Tensor]] = []
decoder_tp_ratio = int(os.getenv('DECODER_TP_RATIO', 1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get it from nixl_connector?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check what's the tp_ratio during handshake and pass it here

kv_selected = torch.index_select(kv, 0, indices)
bc, bs, h, d = kv_selected.shape
shape = int(bs * h / decoder_tp_ratio * d)
blocks = torch.chunk(kv_selected, 2, dim=2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 2 a hard_code

Copy link

@libinta libinta Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 2 to tp ratio

gb, h, d = v.shape
indices = torch.tensor(block_ids, device=v.device)
gbhd = [int(gb / self.block_size), self.block_size, h, d]
for kv_tensor in [k, v]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add dev_doc for dimension names and shape before/after

return model_runner_output

def rewrite_kv_based_on_transfer_layout(self, scheduler_output: "SchedulerOutput"):
if scheduler_output.kv_connector_metadata:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add condition to make sure head can be divided by tp_ratio, otherwise logger assert( "model kv head can't be divied by tp_ratio"

@xuechendi
Copy link
Collaborator

I also have an overall question, so the codes here is for prefill node is HPU or CUDA?
Need to also add a check for NHD and HND here

@xinyu-intel
Copy link
Contributor

what's p2d4 mean here? P: TP2 D: TP4 or P: 2xTP1 D: 4xTP1 or P: TP2 D: 2xTP2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants