-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global Tensor 分布式并行策略 #512
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前开会的时候,我们的建议是“先列一个精确到二级标题的大纲,然后我们讨论下结构”。
现在直接出了一个初稿的,但是我觉得整体的方向是不对的。我觉得可能需要某种意义上的推倒重来。
并且建议阅读下 https://github.com/Oneflow-Inc/OneTeam/blob/master/tutorial/oneflow_docs_system.md
我的建议是:
- 先忘记这篇初稿,回退到 “整理大纲” 的阶段,重新整理下大纲
- 做以上工作时,注意 “这是一篇 how to”
- 整理大纲时,可以先不写具体代码,但是可以写一下自己想用哪类 demo 的例子,方便我和啸宇等讨论、给建议(比如简单说想用 matmul 的例子,我就能给出自己的观点,觉得离实际工作较远)
1. 数据 $x$ 按第 0 维度切分(`sbp=flow.sbp.split(dim=0)`),分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`) | ||
2. 模型 $w$ 保持完整(`sbp=flow.sbp.broadcast`),分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`) | ||
|
||
修改后,完整代码如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里只提供了代码,没有谈如何启动,不符合“重实践”。可能用户直接复制这类代码去跑,会发现跑不起来。
(这些代码已经跑过吗)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码还没有测试,所以启动部分没写,我想的是代码量也不大,就放上来展示一下
> 这里在考虑要不要放数据并行的其他介绍,例如: | ||
>> 数据并行策略下,在反向传播过程中,需要对各个设备上的梯度进行 AllReduce,以确保各个设备上的模型始终保持一致 | ||
|
||
>> 当数据集较大,模型较小时,由于反向过程中为同步梯度产生的通信代价较小,此时选择数据并行一般比较有优势,常见的视觉分类模型,如 ResNet50,比较适合采用数据并行。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些介绍又太偏 tutorial 了。其实感觉可以不放。或者说放的话,也要用 how to 的风格,体现实践和面向解决工作中的问题。比如,列出 ResNet50 这类模型用数据并行的线性加速比(效果好),再列出 BERT 那类模型用数据并行的线性加速比(效果不好)
|
||
> 这里要不要写“ Stage ID 及梯度累积设置” | ||
|
||
### 混合并行 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
混合并行这里考虑知识跨度,可以把三个例子组合起来就好,而不写复杂而具体的case。
实际使用中,混合并行需要用到 2D sbp,可以开单独的一篇文章。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
三个例子组合起来是用 matmul 再给一个程序吗?混合并行这里两两混合和三种全用应该总共有四种混合方式,需要全放上例子吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以不用做复杂组合,一个case就可以了。需要放上例子。
通过 # test.py
import oneflow as flow
placement = flow.placement(type="cpu", ranks=[0, 1])
x = flow.randn(4, 5, placement=placement, sbp=flow.sbp.split(dim=0))
w = flow.randn(5, 8, placement=placement, sbp=flow.sbp.broadcast)
out = flow.matmul(x, w)
print(out.shape) # (4, 8) 输出如下:
会出现个报错 |
这个是提示,多机在建立连接,等一下就好了 |
in_stage1 = out_stage0.to_global(placement=P23, sbp=flow.sbp.broadcast) | ||
out_stage1 = flow.matmul(in_stage1, w1) | ||
print(out_stage1.shape) # (4, 3) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个例子,再增加一个 nn.Graph 的版本来跑通执行吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# 使用 Global Tensor 进行多机多设备编程:分布式并行策略 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这篇 outline 记得在合并前需要删除。必要合并到仓库里去了。
另外,现在因为还没有修改 mkdocs.yml
文件,新增其实文章也不会真正的显示到网站上去的。
你参照其它新增文章的 PR,修改下 mkdocs.yml
文件,并本地 mkdocs build
查看下效果吧。
把最终的结果贴图下,这样方便排查一些编译html时产生的格式问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
return out | ||
|
||
|
||
class ModuleModel(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面的 eager 是不是也可以复用这个
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
记得修改 mkdocs.yml 和 添加 html 的截图。
后续我再找人翻译英文版。
…neflow-documentation into global-distributed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文版本里的示例代码可以加一些空行把不同功能的代码段隔开,便于阅读
Co-authored-by: Guoliang Cheng <[email protected]>
@@ -0,0 +1,329 @@ | |||
# Using Global Tensor for Multi-Device Multi-GPU Programming: Distributed Parallelism Strategies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-Device Multi-GPU
"多机多设备编程",现在感觉太啰嗦了,可以改了 “分布式编程”,这里翻译需要调整为 distributed programming
请帮忙把这两篇文章都改一下:
- 使用 Global Tensor 进行多机多设备编程:基础操作: cookies/global_tensor.md
- 使用 Global Tensor 进行多机多设备编程:分布式并行策略: cookies/global_tensor_distributed.md
en/mkdocs.yml
Outdated
@@ -133,7 +133,8 @@ nav: | |||
- Pipelining Parallelism: parallelism/06_pipeline.md | |||
|
|||
- Cookbook: | |||
- Basic Operations for Using Global Tensor to Program on Cluster: cookies/global_tensor.md | |||
- Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations: cookies/global_tensor.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也需要改下
en/docs/cookies/global_tensor.md
Outdated
@@ -1,4 +1,4 @@ | |||
# Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations | |||
# Using Global Tensor for Distributed Programming: Basic Operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Distributed Programming with Global Tensor ?
@@ -0,0 +1,329 @@ | |||
# Using Global Tensor for Distributed Programming: Distributed Parallelism Strategies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Distributed Programming with Global Tensor ?
相关 issue: #481 (comment)
列出《分布式并行策略》文章相关的思路以及部分内容
TODO: