Skip to content

Commit c79decd

Browse files
Yi Wangwayi
Yi Wang
and
wayi
authored
[v1.8 patch] [Resubmission] Add a documentation page for DDP communication hooks (pytorch#52215)
Co-authored-by: wayi <[email protected]>
1 parent c307a3f commit c79decd

File tree

4 files changed

+190
-103
lines changed

4 files changed

+190
-103
lines changed

docs/source/ddp_comm_hooks.rst

+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
DDP Communication Hooks
2+
=======================
3+
4+
DDP communication hook is a generic interface to control how to communicate
5+
gradients across workers by overriding the vanilla allreduce in
6+
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
7+
A few built-in communication hooks are provided,
8+
and users can easily apply any of these hooks to optimize communication.
9+
Besides, the hook interface can also support user-defined communication
10+
strategies for more advanced use cases.
11+
12+
.. warning ::
13+
DDP communication hook is experimental and subject to change.
14+
15+
.. warning ::
16+
DDP communication hooks can only support single process single device mode
17+
on NCCL backend.
18+
19+
How to Use a Communication Hook?
20+
--------------------------------
21+
22+
To use a communication hook, the user just needs to let the DDP model register
23+
the hook before the training loop as below.
24+
25+
:func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`.
26+
:noindex:
27+
28+
Default Communication Hooks
29+
---------------------------
30+
31+
Default communication hooks are simple **stateless** hooks, so the input state
32+
in ``register_comm_hook`` is either a process group or ``None``.
33+
34+
.. automodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
35+
:members:
36+
37+
PowerSGD Communication Hook
38+
---------------------------
39+
40+
PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_)
41+
is a gradient compression algorithm, which can provide very high compression
42+
rates and accelerate bandwidth-bound distributed training.
43+
This algorithm needs to maintain both some hyperparameters and the internal
44+
state. Therefore, PowerSGD communication hook is a **stateful** hook,
45+
and the user needs to provide a state object defined as below.
46+
47+
PowerSGD State
48+
^^^^^^^^^^^^^^^^
49+
50+
.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
51+
.. autoclass:: PowerSGDState
52+
53+
PowerSGD Hooks
54+
^^^^^^^^^^^^^^^^
55+
56+
.. warning ::
57+
PowerSGD typically requires extra memory of the same size as the model's
58+
gradients to enable error feedback, which can compensate for biased
59+
compressed communication and improve accuracy.
60+
61+
.. warning ::
62+
The current implementation may cause gradient overflow for FP16 input.
63+
64+
.. autofunction:: powerSGD_hook
65+
.. autofunction:: batched_powerSGD_hook
66+
67+
Acknowledgements
68+
----------------
69+
70+
Many thanks to PowerSGD paper author **Thijs Vogels** for the code review on
71+
PowerSGD communication hook, as well as the
72+
`comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_,
73+
which show that the performance of PowerSGD communication hook is on par with
74+
the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.

docs/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ Features described in this documentation are classified by release status:
7171
onnx
7272
optim
7373
complex_numbers
74+
ddp_comm_hooks
7475
pipeline
7576
quantization
7677
rpc

0 commit comments

Comments
 (0)