-
Notifications
You must be signed in to change notification settings - Fork 68
enable p2d4 #253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
enable p2d4 #253
Conversation
Signed-off-by: Harish Subramony <[email protected]>
Signed-off-by: Harish Subramony <[email protected]>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Harish Subramony <[email protected]>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
|
|
||
| MODELS=( | ||
| "/root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/" | ||
| "/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not use internal path here, didn't realized that last time
| #) | ||
| #MODELS=( | ||
| # "Qwen/Qwen3-0.6B" | ||
| #) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pelease clean up models comments here
| # --port 9111 \ | ||
| # --seed "$(date +%s)" \ | ||
| # --model /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \ | ||
| # --tokenizer /root/software/data/pytorch/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659/ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| def wait_for_save(self): | ||
| assert self.connector_worker is not None | ||
| assert isinstance(self._connector_metadata, NixlConnectorMetadata) | ||
| self.connector_worker.rewrite_kv_based_on_transfer_layout(self._connector_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please elaborate of what rewrite_kv_based_on_transfer_layout want to do here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean add a dev_doc comments for future code reading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add conditional check and only enable for p and d with different TP_size
Please do assert when you can't split, not in 2x, 4x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its based on the ratio check which needs to specified in the command line , there is no other way to get it.
| kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape) | ||
| kv.index_copy_(dim=0, index=indices, source=kv_selected) | ||
| if len(metadata.reqs_to_save) > 0: | ||
| torch.hpu.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the sync necessary?
| kv_selected = torch.index_select(kv, 0, indices) | ||
| bc, bs, h, d = kv_selected.shape | ||
| shape = int(bs * h / decoder_tp_ratio * d) | ||
| blocks = torch.chunk(kv_selected, 2, dim=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 2 is hard-coded?
|
|
||
|
|
||
| def rewrite_kv_based_on_transfer_layout(self, metadata: NixlConnectorMetadata): | ||
| decoder_tp_ratio = int(os.getenv('DECODER_TP_RATIO', 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this one necessary, can you get it from somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes not sure if there is another way to get the ratio
| blocks = torch.chunk(kv_selected, 2, dim=2) | ||
| vecs = [b.reshape([bc, shape]) for b in blocks] | ||
| kv_selected = torch.concat(vecs, dim=1).reshape(kv_selected.shape) | ||
| kv.index_copy_(dim=0, index=indices, source=kv_selected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impl here looks not efficient to me. Does the kv here means host_buffer only?
Signed-off-by: Harish Subramony <[email protected]>
Signed-off-by: Harish Subramony <[email protected]>
| self.profiler.record_counter(self.event_start, counters) | ||
|
|
||
| if decoder_tp_ratio > 1: | ||
| self.rewrite_kv_based_on_transfer_layout(scheduler_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this happens everytime after model_fwd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is kv transfer, then do it
| for layer_idx in range(len(self.kv_caches)): | ||
| k = self.kv_caches[layer_idx][0] | ||
| v = self.kv_caches[layer_idx][1] | ||
| gb, h, d = v.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gb.h,d means( block count*block_size, num_kv_heads, head_size)
| indices = torch.tensor(block_ids, device=v.device) | ||
| gbhd = [int(gb / self.block_size), self.block_size, h, d] | ||
| for kv_tensor in [k, v]: | ||
| kv = kv_tensor.reshape(gbhd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments
| _TYPE_CACHE: dict[str, dict[str, Any]] = {} | ||
|
|
||
| hpu_buffer: list[list[torch.Tensor]] = [] | ||
| decoder_tp_ratio = int(os.getenv('DECODER_TP_RATIO', 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you get it from nixl_connector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check what's the tp_ratio during handshake and pass it here
| kv_selected = torch.index_select(kv, 0, indices) | ||
| bc, bs, h, d = kv_selected.shape | ||
| shape = int(bs * h / decoder_tp_ratio * d) | ||
| blocks = torch.chunk(kv_selected, 2, dim=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 2 a hard_code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change 2 to tp ratio
| gb, h, d = v.shape | ||
| indices = torch.tensor(block_ids, device=v.device) | ||
| gbhd = [int(gb / self.block_size), self.block_size, h, d] | ||
| for kv_tensor in [k, v]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add dev_doc for dimension names and shape before/after
| return model_runner_output | ||
|
|
||
| def rewrite_kv_based_on_transfer_layout(self, scheduler_output: "SchedulerOutput"): | ||
| if scheduler_output.kv_connector_metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add condition to make sure head can be divided by tp_ratio, otherwise logger assert( "model kv head can't be divied by tp_ratio"
|
I also have an overall question, so the codes here is for prefill node is HPU or CUDA? |
|
what's p2d4 mean here? P: TP2 D: TP4 or P: 2xTP1 D: 4xTP1 or P: TP2 D: 2xTP2? |
No description provided.