-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add interface to launch parallel dygraph by multiprocessing #26044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interface to launch parallel dygraph by multiprocessing #26044
Conversation
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
gongweibao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spawn 模式最好有性能对比?
guru4elephant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| ParallelStrategy = core.ParallelStrategy | ||
|
|
||
|
|
||
| def init_parallel_env(backend='nccl'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick
guru4elephant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the backend argument for simplicity
感谢意见,确实应该有的,我后续出个报告可以吗?这个接口开发工作开展的时间有点短,近一周一直在讨论迭代接口形态,这个又要随2.0-beta发布,所以仅验证了正确性,性能对比还没来得及开展 这个接口在理论上与launch并无差别,只是换了一种多进程的启动方式,没有增加多余的实现,理论上不会有差别,同时这只是一种可选的启动方式,也不影响launch原来的使用 |
XiaoguangHu01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
jzhang533
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
raindrops2sea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
APIs
Describe
This PR add multiprocessing start method
start_processesandspawnfor dygraph data parallel training.1. Start method difference
launchpython -m paddle.distributed.launch --selected_gpus=0,1 train.pyspawnpython train.pyand add
spawnin__main__method, for example:2. Simple example
3. API change
Add 4 new apis:
paddle.distributed.spawn: start mulit-process training by spawn methodpaddle.distributed.init_parallel_env: init parallel environment variables & get paralllel strategypaddle.distributed.get_rank: get current process rankpaddle.distributed.get_world_size: get current world sizeMove 2 old apis:
paddle.prepare_context (fluid.dygraph.prepare_context)->paddle.distributed.prepare_contextpaddle.ParallelEnv (fluid.dygraph.ParallelEnv)->paddle.distributed.ParallelEnvRefine 1 old api:
paddle.DataParallel (fluid.dygraph.DataParallel): Setstrategyas an optional argumentDeprecate 1 old apis:
paddle.distributed.prepare_context (fluid.dygraph.prepare_context): replace bypaddle.distributed.init_parallel_envlater4. Correctness
Verify the correctness of the interface in the following models:
test_parallel_dygraph_mnist.pytest_parallel_dygraph_se_resnext.pytest_parallel_dygraph_transformer.py5. Related docs