You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some changes in the losses as shown in the graph below. I have a few questions and would like to seek your guidance on whether these situations are normal:
For the balance_loss, it briefly increases and then stabilizes around 5.0 without decreasing. How can I verify if the experts have achieved balance in this case?
The aux_loss, which is the sum of weighted_balance_loss and weighted_router_z_loss, seems to have a relatively small contribution to the overall loss. Although it is indeed decreasing, should I increase the values of the two coef in your code?
Is there a recommended batch_size for training MoE? I have noticed that different batch_size values yield different results. The batch_size mentioned in the ST-MoE paper is too large for individual users like me to refer to.
The text was updated successfully, but these errors were encountered:
First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some changes in the losses as shown in the graph below. I have a few questions and would like to seek your guidance on whether these situations are normal:
For the
balance_loss
, it briefly increases and then stabilizes around 5.0 without decreasing. How can I verify if the experts have achieved balance in this case?The
aux_loss
, which is the sum ofweighted_balance_loss
andweighted_router_z_loss
, seems to have a relatively small contribution to the overallloss
. Although it is indeed decreasing, should I increase the values of the twocoef
in your code?Is there a recommended
![image](https://private-user-images.githubusercontent.com/58496879/301460316-8d71aa6c-46cc-421a-8177-8b0aadf9efae.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NDUxNzcsIm5iZiI6MTczODk0NDg3NywicGF0aCI6Ii81ODQ5Njg3OS8zMDE0NjAzMTYtOGQ3MWFhNmMtNDZjYy00MjFhLTgxNzctOGIwYWFkZjllZmFlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE2MTQzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWUxN2IyYWExODRmN2EwNDljNmJlN2I5YjViNDlkNjc0ZGM3NTNiMjI3MTI3YjQwZjQ5NzBmZjZhMjAwMzMyZGEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.BhIjKfmdLY1853CGZZZc3LdUYgj2-M3NPsbYNUjLo8E)
batch_size
for training MoE? I have noticed that differentbatch_size
values yield different results. Thebatch_size
mentioned in the ST-MoE paper is too large for individual users like me to refer to.The text was updated successfully, but these errors were encountered: