|
| 1 | +DDP Communication Hooks |
| 2 | +======================= |
| 3 | + |
| 4 | +DDP communication hook is a generic interface to control how to communicate |
| 5 | +gradients across workers by overriding the vanilla allreduce in |
| 6 | +`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_. |
| 7 | +A few built-in communication hooks are provided, |
| 8 | +and users can easily apply any of these hooks to optimize communication. |
| 9 | +Besides, the hook interface can also support user-defined communication |
| 10 | +strategies for more advanced use cases. |
| 11 | + |
| 12 | +.. warning :: |
| 13 | + DDP communication hook is experimental and subject to change. |
| 14 | +
|
| 15 | +.. warning :: |
| 16 | + DDP communication hooks can only support single process single device mode |
| 17 | + on NCCL backend. |
| 18 | +
|
| 19 | +How to Use a Communication Hook? |
| 20 | +-------------------------------- |
| 21 | + |
| 22 | +To use a communication hook, the user just needs to let the DDP model register |
| 23 | +the hook before the training loop as below. |
| 24 | + |
| 25 | +:func:`torch.nn.parallel.DistributedDataParallel.register_comm_hook`. |
| 26 | + :noindex: |
| 27 | + |
| 28 | +Default Communication Hooks |
| 29 | +--------------------------- |
| 30 | + |
| 31 | +Default communication hooks are simple **stateless** hooks, so the input state |
| 32 | +in ``register_comm_hook`` is either a process group or ``None``. |
| 33 | + |
| 34 | +.. automodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks |
| 35 | + :members: |
| 36 | + |
| 37 | +PowerSGD Communication Hook |
| 38 | +--------------------------- |
| 39 | + |
| 40 | +PowerSGD (`Vogels et al., NeurIPS 2019 <https://arxiv.org/abs/1905.13727>`_) |
| 41 | +is a gradient compression algorithm, which can provide very high compression |
| 42 | +rates and accelerate bandwidth-bound distributed training. |
| 43 | +This algorithm needs to maintain both some hyperparameters and the internal |
| 44 | +state. Therefore, PowerSGD communication hook is a **stateful** hook, |
| 45 | +and the user needs to provide a state object defined as below. |
| 46 | + |
| 47 | +PowerSGD State |
| 48 | +^^^^^^^^^^^^^^^^ |
| 49 | + |
| 50 | +.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook |
| 51 | +.. autoclass:: PowerSGDState |
| 52 | + |
| 53 | +PowerSGD Hooks |
| 54 | +^^^^^^^^^^^^^^^^ |
| 55 | + |
| 56 | +.. warning :: |
| 57 | + PowerSGD typically requires extra memory of the same size as the model's |
| 58 | + gradients to enable error feedback, which can compensate for biased |
| 59 | + compressed communication and improve accuracy. |
| 60 | +
|
| 61 | +.. warning :: |
| 62 | + The current implementation may cause gradient overflow for FP16 input. |
| 63 | +
|
| 64 | +.. autofunction:: powerSGD_hook |
| 65 | +.. autofunction:: batched_powerSGD_hook |
| 66 | + |
| 67 | +Acknowledgements |
| 68 | +---------------- |
| 69 | + |
| 70 | +Many thanks to PowerSGD paper author **Thijs Vogels** for the code review on |
| 71 | +PowerSGD communication hook, as well as the |
| 72 | +`comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_, |
| 73 | +which show that the performance of PowerSGD communication hook is on par with |
| 74 | +the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_. |
0 commit comments