Global Tensor 分布式并行策略 #512

lmyybh · 2022-08-04T07:05:17Z

相关 issue: #481 (comment)
列出《分布式并行策略》文章相关的思路以及部分内容

TODO:

添加作者
代码测试
完善内容

doombeaker

之前开会的时候，我们的建议是“先列一个精确到二级标题的大纲，然后我们讨论下结构”。

现在直接出了一个初稿的，但是我觉得整体的方向是不对的。我觉得可能需要某种意义上的推倒重来。

并且建议阅读下 https://github.com/Oneflow-Inc/OneTeam/blob/master/tutorial/oneflow_docs_system.md

我的建议是：

先忘记这篇初稿，回退到 “整理大纲” 的阶段，重新整理下大纲
做以上工作时，注意 “这是一篇 how to”
整理大纲时，可以先不写具体代码，但是可以写一下自己想用哪类 demo 的例子，方便我和啸宇等讨论、给建议（比如简单说想用 matmul 的例子，我就能给出自己的观点，觉得离实际工作较远）

cn/docs/cookies/global_tensor_distributed.md

doombeaker · 2022-08-04T08:25:52Z

cn/docs/cookies/global_tensor_distributed.md

+1. 数据 $x$ 按第 0 维度切分(`sbp=flow.sbp.split(dim=0)`)，分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`)
+2. 模型 $w$ 保持完整(`sbp=flow.sbp.broadcast`)，分布在两卡设备上(`placement=flow.placement(type="cuda", ranks=[0, 1])`)
+
+修改后，完整代码如下：


这里只提供了代码，没有谈如何启动，不符合“重实践”。可能用户直接复制这类代码去跑，会发现跑不起来。

（这些代码已经跑过吗）

代码还没有测试，所以启动部分没写，我想的是代码量也不大，就放上来展示一下

doombeaker · 2022-08-04T08:28:46Z

cn/docs/cookies/global_tensor_distributed.md

+> 这里在考虑要不要放数据并行的其他介绍，例如：
+>> 数据并行策略下，在反向传播过程中，需要对各个设备上的梯度进行 AllReduce，以确保各个设备上的模型始终保持一致
+
+>> 当数据集较大，模型较小时，由于反向过程中为同步梯度产生的通信代价较小，此时选择数据并行一般比较有优势，常见的视觉分类模型，如 ResNet50，比较适合采用数据并行。


这些介绍又太偏 tutorial 了。其实感觉可以不放。或者说放的话，也要用 how to 的风格，体现实践和面向解决工作中的问题。比如，列出 ResNet50 这类模型用数据并行的线性加速比（效果好），再列出 BERT 那类模型用数据并行的线性加速比（效果不好）

strint · 2022-08-04T10:23:56Z

cn/docs/cookies/global_tensor_distributed.md

+
+> 这里要不要写“ Stage ID 及梯度累积设置”
+
+### 混合并行


混合并行这里考虑知识跨度，可以把三个例子组合起来就好，而不写复杂而具体的case。

实际使用中，混合并行需要用到 2D sbp，可以开单独的一篇文章。

三个例子组合起来是用 matmul 再给一个程序吗？混合并行这里两两混合和三种全用应该总共有四种混合方式，需要全放上例子吗？

可以不用做复杂组合，一个case就可以了。需要放上例子。

lmyybh · 2022-08-08T06:18:10Z

通过 python3 -m oneflow.distributed.launch --nproc_per_node 2 test.py 运行下面程序时：

# test.py
import oneflow as flow

placement = flow.placement(type="cpu", ranks=[0, 1])
x = flow.randn(4, 5, placement=placement, sbp=flow.sbp.split(dim=0))
w = flow.randn(5, 8, placement=placement, sbp=flow.sbp.broadcast)
out = flow.matmul(x, w)
print(out.shape) # (4, 8)

输出如下：

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
W20220808 14:14:29.378463 17476 rpc_client.cpp:190] LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses
oneflow.Size([4, 8])
oneflow.Size([4, 8])

会出现个报错 LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses ，应该怎么避免？

strint · 2022-08-08T08:40:01Z

LoadServer 127.0.0.1 Failed at 0 times error_code 14 error_message failed to connect to all addresses ，应该怎么避免？

这个是提示，多机在建立连接，等一下就好了

strint · 2022-08-19T09:46:59Z

cn/docs/cookies/global_tensor_distributed.md

+in_stage1 = out_stage0.to_global(placement=P23, sbp=flow.sbp.broadcast)
+out_stage1 = flow.matmul(in_stage1, w1)
+print(out_stage1.shape) # (4, 3)
+```


这个例子，再增加一个 nn.Graph 的版本来跑通执行吧

辅助参考资料：https://docs.oneflow.org/master/parallelism/06_pipeline.html

doombeaker · 2022-08-22T08:20:57Z

cn/docs/cookies/distributed_outline.md

+# 使用 Global Tensor 进行多机多设备编程：分布式并行策略
+


这篇 outline 记得在合并前需要删除。必要合并到仓库里去了。

另外，现在因为还没有修改 mkdocs.yml 文件，新增其实文章也不会真正的显示到网站上去的。

你参照其它新增文章的 PR，修改下 mkdocs.yml 文件，并本地 mkdocs build 查看下效果吧。

把最终的结果贴图下，这样方便排查一些编译html时产生的格式问题。

strint · 2022-08-22T09:47:42Z

cn/docs/cookies/global_tensor_distributed.md

+        return out
+
+
+class ModuleModel(nn.Module):


上面的 eager 是不是也可以复用这个

doombeaker

记得修改 mkdocs.yml 和添加 html 的截图。
后续我再找人翻译英文版。

lmyybh · 2022-08-23T07:40:34Z

页面展示：

…neflow-documentation into global-distributed

en/docs/cookies/global_tensor_distributed.md

lmyybh

英文版本里的示例代码可以加一些空行把不同功能的代码段隔开，便于阅读

Co-authored-by: Guoliang Cheng <[email protected]>

strint · 2022-09-05T03:04:17Z

en/docs/cookies/global_tensor_distributed.md

@@ -0,0 +1,329 @@
+# Using Global Tensor for Multi-Device Multi-GPU Programming: Distributed Parallelism Strategies 


Multi-Device Multi-GPU

"多机多设备编程"，现在感觉太啰嗦了，可以改了 “分布式编程”，这里翻译需要调整为 distributed programming

请帮忙把这两篇文章都改一下：

使用 Global Tensor 进行多机多设备编程：基础操作: cookies/global_tensor.md

使用 Global Tensor 进行多机多设备编程：分布式并行策略: cookies/global_tensor_distributed.md

strint · 2022-09-05T14:23:40Z

en/mkdocs.yml

@@ -133,7 +133,8 @@ nav:
      - Pipelining Parallelism: parallelism/06_pipeline.md

    - Cookbook:
-      - Basic Operations for Using Global Tensor to Program on Cluster: cookies/global_tensor.md
+      - Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations: cookies/global_tensor.md


这里也需要改下

strint · 2022-09-05T14:26:05Z

en/docs/cookies/global_tensor.md

@@ -1,4 +1,4 @@
-# Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations
+# Using Global Tensor for Distributed Programming: Basic Operations


-> Distributed Programming with Global Tensor ?

strint · 2022-09-05T14:26:14Z

en/docs/cookies/global_tensor_distributed.md

@@ -0,0 +1,329 @@
+# Using Global Tensor for Distributed Programming: Distributed Parallelism Strategies 


-> Distributed Programming with Global Tensor ?

docs(global_distributed): add global tensor distributed

c91a86d

lmyybh added the cn Chinese documentation label Aug 4, 2022

doombeaker reviewed Aug 4, 2022

View reviewed changes

docs(distributed_outline): write a new outline

d54086b

strint reviewed Aug 4, 2022

View reviewed changes

docs(distributed): modify global_tensor_distributed

b95ba9a

lmyybh added 2 commits August 8, 2022 14:18

docs(global_tensor_distributed): test code

bc3f5da

docs(global_tensor_distributed): modify master_adddr

d0400a5

docs(global_tensor_distributed): add run way

a66f63f

strint reviewed Aug 19, 2022

View reviewed changes

feat(distributed): add Graph

e06da9f

doombeaker reviewed Aug 22, 2022

View reviewed changes

lmyybh added 2 commits August 22, 2022 16:57

fix(graph): add set_stage

0414a7d

fix(graph): change set_stage

a30f737

strint reviewed Aug 22, 2022

View reviewed changes

fix(eager): change eager

d48a6a3

strint approved these changes Aug 22, 2022

View reviewed changes

doombeaker requested changes Aug 22, 2022

View reviewed changes

refactor(distributed): add authors and refactor eager and graph

e972610

httpshirley and others added 6 commits August 24, 2022 16:03

Create global_tensor_distributed.md

3c3691a

Merge branch 'global-distributed' of https://github.com/Oneflow-Inc/o…

9d15744

…neflow-documentation into global-distributed

Update global_tensor_distributed.md

b8f4c9d

Update global_tensor_distributed.md

2e774f3

Create hybrid-parallel.png

b2d43ff

change mkdocs.yml in en

e0186f7

lmyybh commented Aug 31, 2022

View reviewed changes

en/docs/cookies/global_tensor_distributed.md Outdated Show resolved Hide resolved

lmyybh commented Aug 31, 2022

View reviewed changes

Jiachuann and others added 2 commits September 5, 2022 10:15

Apply suggestions from code review

a9b2c2f

Co-authored-by: Guoliang Cheng <[email protected]>

Update global_tensor_distributed.md

9e8f7c5

strint reviewed Sep 5, 2022

View reviewed changes

httpshirley and others added 5 commits September 5, 2022 15:40

Update global_tensor_distributed.md

cb83a4e

Update global_tensor.md

8042d0b

update

a9dc8f0

Update mkdocs.yml

9b4b20b

Update mkdocs.yml

d63ff11

strint reviewed Sep 5, 2022

View reviewed changes

update

513c095

doombeaker added the en English documentation label Sep 8, 2022

Update mkdocs.yml

0ff15ae

doombeaker approved these changes Sep 8, 2022

View reviewed changes

Merge branch 'master' into global-distributed

dd9d374

doombeaker merged commit 5d42824 into master Sep 8, 2022

doombeaker deleted the global-distributed branch September 8, 2022 08:18


		> 这里要不要写“ Stage ID 及梯度累积设置”

		### 混合并行

		# 使用 Global Tensor 进行多机多设备编程：分布式并行策略

		@@ -0,0 +1,329 @@
		# Using Global Tensor for Multi-Device Multi-GPU Programming: Distributed Parallelism Strategies

		@@ -1,4 +1,4 @@
		# Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations
		# Using Global Tensor for Distributed Programming: Basic Operations

Global Tensor 分布式并行策略 #512

Global Tensor 分布式并行策略 #512

Uh oh!

Conversation

lmyybh commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

doombeaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

doombeaker Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lmyybh commented Aug 8, 2022

Uh oh!

strint commented Aug 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

doombeaker left a comment

Choose a reason for hiding this comment

Uh oh!

lmyybh commented Aug 23, 2022

Uh oh!

Uh oh!

lmyybh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lmyybh commented Aug 4, 2022 •

edited

Loading

doombeaker Aug 4, 2022 •

edited

Loading

lmyybh left a comment •

edited

Loading