-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun #6911
Comments
@dawnik17, thanks for reporting this issue. Can you please provide more details that enables us to reproduce the problem? |
Hi @tjruwase, I am using BERT mini (L=4, dim=256) from here. An update from my side, I started another training with a revised deepspeed config after going through other github issues. Below is the part I've changed from the previous config (mainly I've made "overlap_comm" as False).
I'm not sure if this would solve the issue. The training curve looks identical so far, but can't really tell because the deviation starts happening post 1000 steps. I'm attaching the screenshot of the training so far. Please let me know what you think and what else I can try. Thanks! :) |
@dawnik17, thanks for the update. We have seen reports of potential bugs in The new loss curve looks promising. Hopefully that works and unblocks you. What about the grad_norm curve? That seemed to show the error quicker. |
@tjruwase Unfortunately, the grad_norm curve has started to show deviation. Even the training curve has started to diverge now. The divergence for both starts at the same number of steps (expected). |
@dawnik17, can you share the command line for your run? |
@tjruwase I'm running things with deepspeed command like so:
And, I'm reading the hyperparameters from a yaml file. |
@dawnik17. I am unfamiliar with this bert codebase, so I will need very specific instructions. In particular:
Thanks! |
@tjruwase Let me share the codebase with you in a couple hours |
@tjruwase I have added all the relevant files here - https://github.com/dawnik17/debug_deepspeed/tree/main |
@tjruwase Thanks! :) |
@tjruwase An observation: though the loss curves deviate, there is an uncanny similarity in the loss curves. (Signal pattern in both the curves is exactly the same)
PS: an update on your 3rd point, zero_stage = 0 follows the same loss curve as zero3. And, I get the following after every step with zero3 |
@dawnik17, the curve similarity is indeed interesting. I currently lack any ideas of the cause. Can you share the corresponding grad norm curves?
Thanks for sharing this ablation. It suggests the problem is independent of ZeRO, since stage=0 is DeepSpeed's implementation of ddp.
This can be safely ignored. It indicates that zero3 is detecting a new module trace and thus invalidate its trace cache. |
Is it possible for you to share your dataset or a sample version? |
Describe the bug
There is a convergence issue when using Zero2/3 as compared to running it with torch ddp. I'm attaching my deepspeed config and training screenshots below. I'm training bert for embedding task.
I use reentrant True while training.
Expected behavior
The training curves (loss and gradient) should be similar.
Deepspeed Config
Training Curves
The green curve is of Zero3 and the purple curve is of torch ddp.
The text was updated successfully, but these errors were encountered: