Do ZeRO2 and ZeRO3 need gradient accumulation? #4118
                  
                    
                      Gy-Lu
                    
                  
                
                  started this conversation in
                Development | Core
              
            Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
As we know, ZeRO2 and ZeRO3 would split the gradient, which is incompatible with gradient accumulation.
However, they are not that incompatible.
For instance, we can accumulate the gradients belonging to each rank after communication.
Drawback
This version of gradient accumulation saves no communication.
Advantage
For some users who need a large batch to train their model(for model convergence, e.g.) but have limited GPU memory, gradient accumulation can be a solution.
Beta Was this translation helpful? Give feedback.
All reactions