Do ZeRO2 and ZeRO3 need gradient accumulation? #4118

Gy-Lu · 2023-06-29T08:32:29Z

Gy-Lu
Jun 29, 2023

As we know, ZeRO2 and ZeRO3 would split the gradient, which is incompatible with gradient accumulation.
However, they are not that incompatible.
For instance, we can accumulate the gradients belonging to each rank after communication.

Drawback

This version of gradient accumulation saves no communication.

Advantage

For some users who need a large batch to train their model(for model convergence, e.g.) but have limited GPU memory, gradient accumulation can be a solution.

Gy-Lu · 2023-06-29T08:34:15Z

Gy-Lu
Jun 29, 2023
Author

@ver217 @FrankLeeeee @kurisusnowdeng

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do ZeRO2 and ZeRO3 need gradient accumulation? #4118

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Do ZeRO2 and ZeRO3 need gradient accumulation? #4118

Uh oh!

Gy-Lu Jun 29, 2023

Drawback

Advantage

Replies: 1 comment

Uh oh!

Gy-Lu Jun 29, 2023 Author

Gy-Lu
Jun 29, 2023

Gy-Lu
Jun 29, 2023
Author