[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol #5132

liyonghua0910 · 2025-11-19T11:19:42Z

Motivation

修复以下问题：

问题 1：当 ENABLE_V1_KVCACHE_SCHEDULER=1 时，运行单机 PD 分离，如果数据传输采用 IPC 协议，则服务无法启动；

问题 2：当 ENABLE_V1_KVCACHE_SCHEDULER=1 时，若不手动指定 --num-gpu-blocks-override，则服务无法启动。

Modifications

问题 1：原因为 P 实例请求的 block_tables 被误传到了 D 节点，D 节点为请求分配资源后 block_tables 与 P 节点长度不一致；解决方式为让 D 实例忽略 P 传来请求的 block_tables。

问题 2：原因为 V1 调度下走 per_chunk 模式的 kv 信号通信，profile run 时往消息队列写数据，但没有接收端读数据，导致卡住；解决方式为 profile 模式下禁用消息队列的 init 和 send_signal 操作。

Usage or Command

bash examples/splitwise/start_v1_tp1.sh

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-19T11:19:56Z

Thanks for your contribution!

…_profile_run

Copilot

Pull Request Overview

This PR fixes two critical bugs in PD (Prefill-Decode) disaggregation when using the V1 KV cache scheduler with IPC protocol:

Key Changes:

Fixed block_tables mishandling: Decode instances now replace (not extend) block_tables from prefill instances
Fixed profile run hang: Added is_profiling flag to skip IPC message queue initialization during memory profiling
Removed obsolete NotImplementedError checks that blocked V1 scheduler usage with IPC protocol

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
tests/e2e/test_ernie_03b_pd_router_v1.py	Added E2E test for V1 scheduler with PD disaggregation and IPC protocol
fastdeploy/worker/gpu_model_runner.py	Added `is_profiling` parameter to `_prepare_inputs` and `initialize_forward_meta` methods
fastdeploy/model_executor/forward_meta.py	Added `is_profiling` boolean flag to ForwardMeta dataclass
fastdeploy/model_executor/layers/attention/mla_attention_backend.py	Skip `init_kv_signal_per_query` during profiling to prevent message queue hang
fastdeploy/model_executor/layers/attention/flash_attn_backend.py	Skip `init_kv_signal_per_query` during profiling to prevent message queue hang
fastdeploy/model_executor/layers/attention/append_attn_backend.py	Skip `init_kv_signal_per_query` during profiling to prevent message queue hang
fastdeploy/engine/sched/resource_manager_v1.py	Changed decode instance to replace block_tables instead of extending them
fastdeploy/engine/args_utils.py	Removed NotImplementedError for V1 scheduler with IPC protocol and missing num_gpu_blocks_override
fastdeploy/cache_manager/cache_messager.py	Fixed IPC target_id to use device_ids instead of rdma_ports
custom_ops/xpu_ops/src/ops/remote_cache_kv_ipc.h	Wrapped send_signal in inited check to prevent sending to uninitialized message queue
custom_ops/gpu_ops/remote_cache_kv_ipc.h	Wrapped send_signal in inited check and applied code formatting improvements

tests/e2e/test_ernie_03b_pd_router_v1.py

fastdeploy/engine/args_utils.py

tests/e2e/test_ernie_03b_pd_router_v1.py

fastdeploy/worker/gpu_model_runner.py

fastdeploy/cache_manager/cache_messager.py

fastdeploy/engine/sched/resource_manager_v1.py

tests/e2e/test_ernie_03b_pd_router_v1.py

gongshaotian

LGTM

yuanlehome · 2025-11-20T11:51:02Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

        metadata.kv_signal_data_list = [None] * self.num_layers
        if self.pd_disaggregation_mode == "per_chunk":
-            if not self.keep_pd_step_flag:
+            if not self.keep_pd_step_flag and not forward_meta.is_profiling:


is_profiling换个名字，profiling一般指的是“对程序的性能进行测量与分析”，换成is_dummy_or_profile_run

[fix] fix v1 scheduler profile run for append attention in prefill node

4ea5681

liyonghua0910 added 5 commits November 19, 2025 11:57

[fix] skip send_signal if kv signal not inited for gpu and xpu

2c1ec9b

Merge remote-tracking branch 'upstream/develop' into develop+fix_v1_p…

ed9eeee

…_profile_run

[fix] extend fix to flash_attn & mla_attn

2e118ef

[fix] fix v1 pd run in ipc transfer protocol

93eeb17

[ci] add test for v1 pd profile run using ipc transfer protocol

8e26d9f

liyonghua0910 changed the title ~~[BugFix] fix v1 scheduler profile run for append attention in prefill node~~ [BugFix] fix v1 scheduler prefill node profile run & ipc transfer protocol Nov 19, 2025

liyonghua0910 changed the title ~~[BugFix] fix v1 scheduler prefill node profile run & ipc transfer protocol~~ [BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol Nov 19, 2025

liyonghua0910 added 2 commits November 19, 2025 12:50

[style] fix code style check

572242a

[style] fix code style again

b499a03

juncaipeng previously approved these changes Nov 20, 2025

View reviewed changes

juncaipeng requested a review from Copilot November 20, 2025 02:54

Copilot started reviewing on behalf of juncaipeng November 20, 2025 02:54 View session

Copilot finished reviewing on behalf of juncaipeng November 20, 2025 02:59

Copilot AI reviewed Nov 20, 2025

View reviewed changes

[fix] fix profile run

597b7aa

liyonghua0910 dismissed juncaipeng’s stale review via 597b7aa November 20, 2025 07:57

liyonghua0910 mentioned this pull request Nov 20, 2025

[BugFix] [PD Disaggregation] fix v1 pd run in ipc transfer protocol #5075

Closed

5 tasks

[update] remove --num-gpu-blocks-override in example script

0bb9a7c

juncaipeng previously approved these changes Nov 20, 2025

View reviewed changes

gongshaotian previously approved these changes Nov 20, 2025

View reviewed changes

yuanlehome reviewed Nov 20, 2025

View reviewed changes

[chore] rename forward_meta is_profiling to is_dummy_or_profile_run

09bc7ef

liyonghua0910 dismissed stale reviews from gongshaotian and juncaipeng via 09bc7ef November 20, 2025 12:16

yuanlehome approved these changes Nov 20, 2025

View reviewed changes

Jiang-Jia-Jun added the skip-ci: coverage label Nov 20, 2025

Jiang-Jia-Jun merged commit 43097a5 into PaddlePaddle:develop Nov 20, 2025
15 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol #5132

[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol #5132

Uh oh!

liyonghua0910 commented Nov 19, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongshaotian left a comment

Uh oh!

yuanlehome Nov 20, 2025

Uh oh!

liyonghua0910 Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol #5132

[BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol #5132

Uh oh!

Conversation

liyonghua0910 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

yuanlehome Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

liyonghua0910 Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liyonghua0910 commented Nov 19, 2025 •

edited

Loading