Training strategy for Zipformer using fp16 ?? #1461
Unanswered
ZQuang2202
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
Are you using single GPU and max-duration=300? The gradient noise might be large with such a small batch size. You could try a smaller base-lr, like 0.025, and keep lr_batch/lr_epoch unchanged. Usually you don't need to tune the Balancer and Whitener configurations. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I am a student attempting to reproduce the results of Zipformer on Librispeech 100h, facing limitations in hardware resources that prevent me from using the recommended configuration. Due to these constraints, I have reduced the batch size (max_duration) to 300, as opposed to the recommended 1000. However, I am struggling to find the appropriate configuration for Eden.
Following the training strategy that suggests decreasing the learning rate by √k times when the batch size decreases by k times, I initially set the base_lr to 0.03, keeping other configurations at their default values. But the training process diverges. Despite attempts to adjust lr_batches, lr_epochs (3.5-6), and base_lr (0.3-0.45), it's still not working. Notably, the training process encounters divergence when the batch_count is around 700-900, leading to 'parameter domination' issues in the embed_conv and some attention modules. I attach some log information below.



In an effort to address this, I attempted to reduce the gradient scale of the layers experiencing 'parameter domination,' but this proved ineffective."
I have few questions:
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions