-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
muP (maximum update parametrization) #650
base: master
Are you sure you want to change the base?
Conversation
ce71c19
to
9864277
Compare
Hello, I looked over your implementation of muP in both CUDA and Python, and found that in CUDA ( but I haven't seen the same scaling in your Pytorch version |
@alxndrTL it's mentioned in |
Ok I didn't realize that the layernorm code I showcased is only used pre-logits, as per 3.3 (I thought it was used for every layer norms). |
Hyperparam sweepsNote:
Conclusion:
Conclusion: Using
Conclusion: ~1/2^10 is a sweet spot for lr. The curves are stable as we increase the depth, i.e. the optimal
Conclusion: next steps:
cc: @karpathy |
@YuchenJin would be great to kick off a 7B mup run if you have some bandwidth! :) |
Hey @gordicaleksa, happy to! Do you want me to just run the two scripts ( |
Line 165 in b125cc6
AFAIK this line zero initializes modules with 'LLMC_SKIP_INIT' flag if mup is enabled. There is only one module with 'LLMC_SKIP_INIT' flag, it is lm_head. lm_head weight is tied to wte.weight. Since embedding layers are initialized later in the code, what is the purpose of the zero initialization referenced above? |
After reading mup.md, I can see now that MUP requires output layers to be initialized to zero. The code assumes that embeddings are initialized before linear layers, which is a correct but IMHO a weak assumption. Thanks for the great work. |
Main changes (see
mup.md
file for more details):1/d
instead of1/sqrt(d)
, also add anattn_mult
tunable coefficient1/width_mult
before mapping into logitswhere:
width_mult
is the ratio of widths of the current model to the base modeld
is the number of channels in a single attn headTest
To test muP vs SP (standard parametrization):
scripts/mup_coordinate_check.sh
scriptdev/mup_coordinate_check_visualize.py
scriptRun
use_mup
to1
mup_width_mult
to ratio of widths of your target model to your base modelmup_base_attn_mult
is a tunable param, 1 seems to be working nicely for our family of models.Ablations
The coord check results are highly dependent on the learning rate and max width used.
In my preliminary ablations (max width = 1024 & lr = 0.0006) I concluded that the only thing that would mess up the coordinate check was this line:
scale = (model->use_mup && i != 0 && i != 1) ? mup_scale_inv*scale : scale;
In my subsequent ablations (max width = 1024 & lr = 0.006, i.e. lr almost the same as in the reference mup gpt-2 imp, they use 0.01) i concluded that the results are much more sensitive: Adam modifications also matter, 1/
width_mult
logits scaling matters and whether we use1/d
.See the next comment for more thorough ablation results.
References: