enable p2d4 #253

hsubramony · 2025-09-24T21:26:12Z

No description provided.

Signed-off-by: Harish Subramony <[email protected]>

github-actions · 2025-09-26T21:14:34Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-09-30T17:48:50Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-10-03T17:15:22Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Signed-off-by: Harish Subramony <[email protected]>

github-actions · 2025-10-03T17:32:35Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2025-10-03T18:28:59Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
be22bb6f3dd7aaf8559a4a0a1beb98a37a5a8138

xuechendi · 2025-10-03T20:08:09Z

examples/nixl/run_benchmark_test.sh


 MODELS=(
-    "/root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/"
+    "/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/"


let's not use internal path here, didn't realized that last time

xuechendi · 2025-10-03T20:09:05Z

examples/nixl/run_benchmark_test_heter.sh

+#)
+#MODELS=(
+#    "Qwen/Qwen3-0.6B"
+#)


pelease clean up models comments here

xuechendi · 2025-10-03T20:09:29Z

examples/nixl/run_benchmark_test_heter.sh

+    #  --port 9111 \
+    #  --seed "$(date +%s)" \
+    #  --model /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \
+    #  --tokenizer /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \


xuechendi · 2025-10-03T20:11:45Z

vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py

+def wait_for_save(self):
+    assert self.connector_worker is not None
+    assert isinstance(self._connector_metadata, NixlConnectorMetadata)
+    self.connector_worker.rewrite_kv_based_on_transfer_layout(self._connector_metadata)


please elaborate of what rewrite_kv_based_on_transfer_layout want to do here

I mean add a dev_doc comments for future code reading

please add conditional check and only enable for p and d with different TP_size
Please do assert when you can't split, not in 2x, 4x

its based on the ratio check which needs to specified in the command line , there is no other way to get it.

xuechendi · 2025-10-03T20:13:22Z

vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py

+                kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape)
+                kv.index_copy_(dim=0, index=indices, source=kv_selected)
+    if len(metadata.reqs_to_save) > 0:
+        torch.hpu.synchronize()


is the sync necessary?

xuechendi · 2025-10-03T20:23:03Z

vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py

+                kv_selected = torch.index_select(kv, 0, indices)
+                bc, bs, h, d = kv_selected.shape
+                shape = int(bs * h / decoder_tp_ratio * d)
+                blocks = torch.chunk(kv_selected, 2, dim=2)


why 2 is hard-coded?

xuechendi · 2025-10-03T20:25:16Z

vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py

+
+
+def rewrite_kv_based_on_transfer_layout(self, metadata: NixlConnectorMetadata):
+    decoder_tp_ratio = int(os.getenv('DECODER_TP_RATIO', 1))


is this one necessary, can you get it from somewhere else?

yes not sure if there is another way to get the ratio

xuechendi · 2025-10-03T20:26:25Z

vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py

+                blocks = torch.chunk(kv_selected, 2, dim=2)
+                vecs = [b.reshape([bc, shape]) for b in blocks]
+                kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape)
+                kv.index_copy_(dim=0, index=indices, source=kv_selected)


The impl here looks not efficient to me. Does the kv here means host_buffer only?

Signed-off-by: Harish Subramony <[email protected]>

xuechendi · 2025-10-09T16:08:02Z