使用colossalai原生的fp16会在clip_grad_norm_fp32函数中卡死 #2253
              
                Unanswered
              
          
                  
                    
                      yhcc
                    
                  
                
                  asked this question in
                Community | Q&A
              
            Replies: 1 comment
-
| 应该是由于这个bug #2255 | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
如下的函数在被FP16Optimizer调用时会导致卡死(会运行一段时间后才卡死,但每次卡死的迭代数量是一致的)。我使用了流水线并行+tensor并行,不知道有没有一些可能导致这个问题的猜测,这样我可以尝试debug一下。
ColossalAI/colossalai/utils/common.py
Line 279 in 8897b8f
Beta Was this translation helpful? Give feedback.
All reactions