feat: add pipeline parallel #88

JYMiracle305 · 2025-11-04T14:28:57Z

1. 主要修改：

新增支持Pipeline Parallel（PP）特性，支持把原始model按最外层结构平均分配到多个rank，各个rank持有部分layer以及相应parameters，训练过程forward按计算图顺序从rank 0向rank n计算，barkward从rank n向rank 0计算，各个rank之间通过点对点通信。

net.cc：根据PP_size和pp_rank在构建模型时构建属于本rank的子块和对应参数。
pipeline_parallel.cc： PipelineParallel封装model，每个rank对应1个PipelineParallel，完成关联PipelineSchedule、PipelineStage、Optimizers，提供TrainStep函数为训练入口，调用调度器的训练方法Step。
pipeline_schedule.cc： PipelineSchedule调度器基类，提供Step函数为完整一轮训练的方法；ScheduleGPipe为调度器子类GPipe实现（示意图如下），StepMicrobatches为调度具体实现。

pipeline_stage.cc： PipelineStage表示当前rank所持有的子图，提供ForwardOneChunk方法执行当前子图内部的forward的计算。
send_recv.cc：ISend和IRecv是两个autograd节点，用于在rank间定向发送张量，依赖于autograd机制，在rank x的反向中最后一步为发送梯度到rank x-1，然后调用rank x-1上的ISend::Backward接收梯度，并开始rank x-1的反向过程。

2. 命令参数：

--pipeline_parallel #uint32类型，表示打开pipeline parallel，参数值为并行的设备数，即stage数量

示例：
./llama3 --input_bin <input_path> --llmc_filepath <model_path> --device cuda --nthread_per_process 8 --batch_size 10 --total_batch_size 5120 --num_iteration 10 --pipeline_parallel 8

…ruction

JYMiracle305 force-pushed the add_pp branch 2 times, most recently from 7552d8e to 1083190 Compare November 5, 2025 09:20

JYMiracle305 added 2 commits November 12, 2025 10:42

feat: add pipeline parallel

867d474

init pp_world_size accuratly

962509c

JYMiracle305 force-pushed the add_pp branch 4 times, most recently from 9c47bc2 to 4e47615 Compare November 13, 2025 02:59

kilinchange requested review from Chamberlain0w0 and kilinchange November 13, 2025 03:00

feat: Pipeline parallelism divides the model into chunks during const…

2e11a92

…ruction

JYMiracle305 force-pushed the add_pp branch from 4e47615 to 2e11a92 Compare November 13, 2025 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add pipeline parallel #88

feat: add pipeline parallel #88

JYMiracle305 commented Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add pipeline parallel #88

Are you sure you want to change the base?

feat: add pipeline parallel #88

Conversation

JYMiracle305 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. 主要修改：

2. 命令参数：

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JYMiracle305 commented Nov 4, 2025 •

edited

Loading