Skip to content

Conversation

@baxingpiaochong
Copy link
Contributor

@baxingpiaochong baxingpiaochong commented Dec 6, 2025

What this PR does / why we need it?

Support pp for kv pool

Signed-off-by: baxingpiaochong <[email protected]>
@github-actions
Copy link

github-actions bot commented Dec 6, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for pipeline parallelism (PP) to the KV pool. The changes include updating data structures to be aware of the pipeline rank and modifying the cache lookup logic. However, the implementation of the cache lookup across different pipeline stages in lookup_scheduler is flawed. It does not correctly generate keys for all combinations of tensor and pipeline parallel ranks, and the subsequent result processing is broken. This critical issue will lead to incorrect cache hit detection and potential failures. I have provided a detailed comment with a suggested fix for this logic.

Comment on lines +561 to +565
for i in range(1, self.pp_size):
for item in keys:
new_str = item.replace( # type: ignore[attr-defined]
"@pp_rank:0", f"@pp_rank:{i}", 1)
multi_tp_keys.append(new_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current logic for checking key existence across both tensor (TP) and pipeline parallel (PP) ranks is flawed.

  1. Incomplete key generation: It fails to generate keys for all (TP, PP) rank combinations, only checking for (TP=i, PP=0) and (TP=0, PP=j). This will result in missed cache hits when both TP and PP are greater than 1.
  2. Incorrect result processing: The subsequent result processing logic (lines 573-576) is not updated for pipeline parallelism and will likely fail with an IndexError or produce incorrect results.

The entire key generation and result processing block needs to be refactored. Additionally, the implementation relies on hardcoded rank:0 strings, which is brittle. A more robust solution would replace the current worker's rank in the key string.

Here is a corrected implementation for lines 554-579 that addresses the combination issue, assuming this lookup is always performed from a worker with tp_rank=0 and pp_rank=0:

            multi_tp_keys = []
            for pp_i in range(self.pp_size):
                for tp_i in range(min(self.tp_size, self.num_kv_head)):
                    for item in keys:
                        item_with_pp = item.replace("@pp_rank:0", f"@pp_rank:{pp_i}", 1)
                        new_str = item_with_pp.replace("@head_or_tp_rank:0", f"@head_or_tp_rank:{tp_i}", 1)
                        multi_tp_keys.append(new_str)

            res = self.m_store.exists(
                multi_tp_keys)  # type: ignore[assignment]
            num_block = len(keys)
            if use_layerwise:
                res = self.check_all_layers_exists(res, self.num_layers)
                num_block = len(keys) // self.num_layers

            num_ranks = self.pp_size * min(self.tp_size, self.num_kv_head)
            multi_rank_values = [
                res[i * num_block:(i + 1) * num_block]
                for i in range(num_ranks)
            ]
            index = self.find_min_first_non_one_index(multi_rank_values)
            if index != -1:
                return starts[index]

@LCAIZJ
Copy link
Contributor

LCAIZJ commented Dec 8, 2025

When pipeline parallelism (PP) is enabled, self.maybe_wait_for_kv_save() cannot remain in its current position and must be moved back to its previous location. Additionally, we need to synchronize with the ADXL version that includes the fix for the hang issue.

Signed-off-by: baxingpiaochong <[email protected]>
@LCAIZJ
Copy link
Contributor

LCAIZJ commented Dec 9, 2025

The ADXL-related fixes will be included in 8.5.RC1.

@wangxiyuan wangxiyuan merged commit dda027e into vllm-project:main Dec 9, 2025
27 checks passed
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Support pp for kv pool

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: baxingpiaochong <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Support pp for kv pool

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: baxingpiaochong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants