-
Notifications
You must be signed in to change notification settings - Fork 654
[KVPOOl]Support pp #4761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KVPOOl]Support pp #4761
Conversation
Signed-off-by: baxingpiaochong <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for pipeline parallelism (PP) to the KV pool. The changes include updating data structures to be aware of the pipeline rank and modifying the cache lookup logic. However, the implementation of the cache lookup across different pipeline stages in lookup_scheduler is flawed. It does not correctly generate keys for all combinations of tensor and pipeline parallel ranks, and the subsequent result processing is broken. This critical issue will lead to incorrect cache hit detection and potential failures. I have provided a detailed comment with a suggested fix for this logic.
| for i in range(1, self.pp_size): | ||
| for item in keys: | ||
| new_str = item.replace( # type: ignore[attr-defined] | ||
| "@pp_rank:0", f"@pp_rank:{i}", 1) | ||
| multi_tp_keys.append(new_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for checking key existence across both tensor (TP) and pipeline parallel (PP) ranks is flawed.
- Incomplete key generation: It fails to generate keys for all
(TP, PP)rank combinations, only checking for(TP=i, PP=0)and(TP=0, PP=j). This will result in missed cache hits when both TP and PP are greater than 1. - Incorrect result processing: The subsequent result processing logic (lines 573-576) is not updated for pipeline parallelism and will likely fail with an
IndexErroror produce incorrect results.
The entire key generation and result processing block needs to be refactored. Additionally, the implementation relies on hardcoded rank:0 strings, which is brittle. A more robust solution would replace the current worker's rank in the key string.
Here is a corrected implementation for lines 554-579 that addresses the combination issue, assuming this lookup is always performed from a worker with tp_rank=0 and pp_rank=0:
multi_tp_keys = []
for pp_i in range(self.pp_size):
for tp_i in range(min(self.tp_size, self.num_kv_head)):
for item in keys:
item_with_pp = item.replace("@pp_rank:0", f"@pp_rank:{pp_i}", 1)
new_str = item_with_pp.replace("@head_or_tp_rank:0", f"@head_or_tp_rank:{tp_i}", 1)
multi_tp_keys.append(new_str)
res = self.m_store.exists(
multi_tp_keys) # type: ignore[assignment]
num_block = len(keys)
if use_layerwise:
res = self.check_all_layers_exists(res, self.num_layers)
num_block = len(keys) // self.num_layers
num_ranks = self.pp_size * min(self.tp_size, self.num_kv_head)
multi_rank_values = [
res[i * num_block:(i + 1) * num_block]
for i in range(num_ranks)
]
index = self.find_min_first_non_one_index(multi_rank_values)
if index != -1:
return starts[index]|
When pipeline parallelism (PP) is enabled, |
Signed-off-by: baxingpiaochong <[email protected]>
|
The ADXL-related fixes will be included in 8.5.RC1. |
### What this PR does / why we need it? Support pp for kv pool - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: baxingpiaochong <[email protected]>
### What this PR does / why we need it? Support pp for kv pool - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: baxingpiaochong <[email protected]>
What this PR does / why we need it?
Support pp for kv pool