Skip to content

Conversation

@liyonghua0910
Copy link
Collaborator

@liyonghua0910 liyonghua0910 commented Nov 19, 2025

Motivation

修复以下问题:

问题 1:当 ENABLE_V1_KVCACHE_SCHEDULER=1 时,运行单机 PD 分离,如果数据传输采用 IPC 协议,则服务无法启动;

问题 2:当 ENABLE_V1_KVCACHE_SCHEDULER=1 时,若不手动指定 --num-gpu-blocks-override,则服务无法启动。

Modifications

问题 1:原因为 P 实例请求的 block_tables 被误传到了 D 节点,D 节点为请求分配资源后 block_tables 与 P 节点长度不一致;解决方式为让 D 实例忽略 P 传来请求的 block_tables。

问题 2:原因为 V1 调度下走 per_chunk 模式的 kv 信号通信,profile run 时往消息队列写数据,但没有接收端读数据,导致卡住;解决方式为 profile 模式下禁用消息队列的 init 和 send_signal 操作。

Usage or Command

bash examples/splitwise/start_v1_tp1.sh

Accuracy Tests

image

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 19, 2025

Thanks for your contribution!

@liyonghua0910 liyonghua0910 changed the title [BugFix] fix v1 scheduler profile run for append attention in prefill node [BugFix] fix v1 scheduler prefill node profile run & ipc transfer protocol Nov 19, 2025
@liyonghua0910 liyonghua0910 changed the title [BugFix] fix v1 scheduler prefill node profile run & ipc transfer protocol [BugFix] [PD Disaggregation] fix v1 scheduler prefill node profile run & ipc transfer protocol Nov 19, 2025
juncaipeng
juncaipeng previously approved these changes Nov 20, 2025
@juncaipeng juncaipeng requested a review from Copilot November 20, 2025 02:54
Copilot finished reviewing on behalf of juncaipeng November 20, 2025 02:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes two critical bugs in PD (Prefill-Decode) disaggregation when using the V1 KV cache scheduler with IPC protocol:

Key Changes:

  • Fixed block_tables mishandling: Decode instances now replace (not extend) block_tables from prefill instances
  • Fixed profile run hang: Added is_profiling flag to skip IPC message queue initialization during memory profiling
  • Removed obsolete NotImplementedError checks that blocked V1 scheduler usage with IPC protocol

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
tests/e2e/test_ernie_03b_pd_router_v1.py Added E2E test for V1 scheduler with PD disaggregation and IPC protocol
fastdeploy/worker/gpu_model_runner.py Added is_profiling parameter to _prepare_inputs and initialize_forward_meta methods
fastdeploy/model_executor/forward_meta.py Added is_profiling boolean flag to ForwardMeta dataclass
fastdeploy/model_executor/layers/attention/mla_attention_backend.py Skip init_kv_signal_per_query during profiling to prevent message queue hang
fastdeploy/model_executor/layers/attention/flash_attn_backend.py Skip init_kv_signal_per_query during profiling to prevent message queue hang
fastdeploy/model_executor/layers/attention/append_attn_backend.py Skip init_kv_signal_per_query during profiling to prevent message queue hang
fastdeploy/engine/sched/resource_manager_v1.py Changed decode instance to replace block_tables instead of extending them
fastdeploy/engine/args_utils.py Removed NotImplementedError for V1 scheduler with IPC protocol and missing num_gpu_blocks_override
fastdeploy/cache_manager/cache_messager.py Fixed IPC target_id to use device_ids instead of rdma_ports
custom_ops/xpu_ops/src/ops/remote_cache_kv_ipc.h Wrapped send_signal in inited check to prevent sending to uninitialized message queue
custom_ops/gpu_ops/remote_cache_kv_ipc.h Wrapped send_signal in inited check and applied code formatting improvements

juncaipeng
juncaipeng previously approved these changes Nov 20, 2025
gongshaotian
gongshaotian previously approved these changes Nov 20, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

metadata.kv_signal_data_list = [None] * self.num_layers
if self.pd_disaggregation_mode == "per_chunk":
if not self.keep_pd_step_flag:
if not self.keep_pd_step_flag and not forward_meta.is_profiling:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_profiling换个名字,profiling一般指的是“对程序的性能进行测量与分析”,换成is_dummy_or_profile_run

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 43097a5 into PaddlePaddle:develop Nov 20, 2025
15 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants