Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global Tensor 分布式并行策略 #512

Merged
merged 27 commits into from
Sep 8, 2022
Merged

Global Tensor 分布式并行策略 #512

merged 27 commits into from
Sep 8, 2022

Conversation

lmyybh
Copy link
Contributor

@lmyybh lmyybh commented Aug 4, 2022

相关 issue: #481 (comment)
列出《分布式并行策略》文章相关的思路以及部分内容

TODO:

  • 添加作者
  • 代码测试
  • 完善内容

@lmyybh lmyybh added the cn Chinese documentation label Aug 4, 2022
Copy link
Collaborator

@doombeaker doombeaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前开会的时候,我们的建议是“先列一个精确到二级标题的大纲,然后我们讨论下结构”。

现在直接出了一个初稿的,但是我觉得整体的方向是不对的。我觉得可能需要某种意义上的推倒重来。

并且建议阅读下 https://github.com/Oneflow-Inc/OneTeam/blob/master/tutorial/oneflow_docs_system.md

我的建议是:

  1. 先忘记这篇初稿,回退到 “整理大纲” 的阶段,重新整理下大纲
  2. 做以上工作时,注意 “这是一篇 how to”
  3. 整理大纲时,可以先不写具体代码,但是可以写一下自己想用哪类 demo 的例子,方便我和啸宇等讨论、给建议(比如简单说想用 matmul 的例子,我就能给出自己的观点,觉得离实际工作较远)

cn/docs/cookies/global_tensor_distributed.md Outdated Show resolved Hide resolved
1. 数据 $x$ 按第 0 维度切分(`sbp=flow.sbp.split(dim=0)`),分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`)
2. 模型 $w$ 保持完整(`sbp=flow.sbp.broadcast`),分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`)

修改后,完整代码如下:
Copy link
Collaborator

@doombeaker doombeaker Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里只提供了代码,没有谈如何启动,不符合“重实践”。可能用户直接复制这类代码去跑,会发现跑不起来。

(这些代码已经跑过吗)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码还没有测试,所以启动部分没写,我想的是代码量也不大,就放上来展示一下

Comment on lines 83 to 86
> 这里在考虑要不要放数据并行的其他介绍,例如:
>> 数据并行策略下,在反向传播过程中,需要对各个设备上的梯度进行 AllReduce,以确保各个设备上的模型始终保持一致

>> 当数据集较大,模型较小时,由于反向过程中为同步梯度产生的通信代价较小,此时选择数据并行一般比较有优势,常见的视觉分类模型,如 ResNet50,比较适合采用数据并行。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些介绍又太偏 tutorial 了。其实感觉可以不放。或者说放的话,也要用 how to 的风格,体现实践和面向解决工作中的问题。比如,列出 ResNet50 这类模型用数据并行的线性加速比(效果好),再列出 BERT 那类模型用数据并行的线性加速比(效果不好)


> 这里要不要写“ Stage ID 及梯度累积设置”

### 混合并行
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

混合并行这里考虑知识跨度,可以把三个例子组合起来就好,而不写复杂而具体的case。

实际使用中,混合并行需要用到 2D sbp,可以开单独的一篇文章。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

三个例子组合起来是用 matmul 再给一个程序吗?混合并行这里两两混合和三种全用应该总共有四种混合方式,需要全放上例子吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以不用做复杂组合,一个case就可以了。需要放上例子。

@lmyybh
Copy link
Contributor Author

lmyybh commented Aug 8, 2022

通过 python3 -m oneflow.distributed.launch --nproc_per_node 2 test.py 运行下面程序时:

# test.py
import oneflow as flow

placement = flow.placement(type="cpu", ranks=[0, 1])
x = flow.randn(4, 5, placement=placement, sbp=flow.sbp.split(dim=0))
w = flow.randn(5, 8, placement=placement, sbp=flow.sbp.broadcast)
out = flow.matmul(x, w)
print(out.shape) # (4, 8)

输出如下:

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
W20220808 14:14:29.378463 17476 rpc_client.cpp:190] LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses
oneflow.Size([4, 8])
oneflow.Size([4, 8])

会出现个报错 LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses ,应该怎么避免?

@strint
Copy link
Contributor

strint commented Aug 8, 2022

LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses ,应该怎么避免?

这个是提示,多机在建立连接,等一下就好了

in_stage1 = out_stage0.to_global(placement=P23, sbp=flow.sbp.broadcast)
out_stage1 = flow.matmul(in_stage1, w1)
print(out_stage1.shape) # (4, 3)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个例子,再增加一个 nn.Graph 的版本来跑通执行吧

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1 to 2
# 使用 Global Tensor 进行多机多设备编程:分布式并行策略

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这篇 outline 记得在合并前需要删除。必要合并到仓库里去了。

另外,现在因为还没有修改 mkdocs.yml 文件,新增其实文章也不会真正的显示到网站上去的。

你参照其它新增文章的 PR,修改下 mkdocs.yml 文件,并本地 mkdocs build 查看下效果吧。

把最终的结果贴图下,这样方便排查一些编译html时产生的格式问题。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

return out


class ModuleModel(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面的 eager 是不是也可以复用这个

Copy link
Collaborator

@doombeaker doombeaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得修改 mkdocs.yml 和 添加 html 的截图。
后续我再找人翻译英文版。

@lmyybh
Copy link
Contributor Author

lmyybh commented Aug 23, 2022

页面展示:

distributed

Copy link
Contributor Author

@lmyybh lmyybh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文版本里的示例代码可以加一些空行把不同功能的代码段隔开,便于阅读

@@ -0,0 +1,329 @@
# Using Global Tensor for Multi-Device Multi-GPU Programming: Distributed Parallelism Strategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Device Multi-GPU

"多机多设备编程",现在感觉太啰嗦了,可以改了 “分布式编程”,这里翻译需要调整为 distributed programming

请帮忙把这两篇文章都改一下:

  • 使用 Global Tensor 进行多机多设备编程:基础操作: cookies/global_tensor.md
  • 使用 Global Tensor 进行多机多设备编程:分布式并行策略: cookies/global_tensor_distributed.md

en/mkdocs.yml Outdated
@@ -133,7 +133,8 @@ nav:
- Pipelining Parallelism: parallelism/06_pipeline.md

- Cookbook:
- Basic Operations for Using Global Tensor to Program on Cluster: cookies/global_tensor.md
- Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations: cookies/global_tensor.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也需要改下

@@ -1,4 +1,4 @@
# Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations
# Using Global Tensor for Distributed Programming: Basic Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Distributed Programming with Global Tensor ?

@@ -0,0 +1,329 @@
# Using Global Tensor for Distributed Programming: Distributed Parallelism Strategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Distributed Programming with Global Tensor ?

@doombeaker doombeaker added the en English documentation label Sep 8, 2022
@doombeaker doombeaker merged commit 5d42824 into master Sep 8, 2022
@doombeaker doombeaker deleted the global-distributed branch September 8, 2022 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cn Chinese documentation en English documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants