You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix long CPU->GPU synchronization during Gradient clipping (#3318)
Summary:
Pull Request resolved: #3318
1. Recent changes from D79128843 introduced sync point `clipping.py` which was seen in trace
2. It was creating CPU tensors which were being moved **synchronously** to cuda devices consequently causing long wait times in training with `CudaStreamSychronization` exhibiting in trace.
3. This caused QPS degradation in CTX FM model which I was actively working on optimizing and also it cause QPS degradation in most models including OmniFM that are enabling Optimizer Gradient clipping in their yaml config.
4. This fix helps bump qps by around 5% while keep NE unimpacted.
Reviewed By: wz337
Differential Revision: D80959986
fbshipit-source-id: 55b0ae4165cabe4d5ce66ad442814868d408a1ac
0 commit comments