muP (maximum update parametrization) #650

gordicaleksa · 2024-06-26T20:11:30Z

Main changes (see mup.md file for more details):

Modify random initialization
Scale attention scores by 1/d instead of 1/sqrt(d), also add an attn_mult tunable coefficient
Scale activations by 1/width_mult before mapping into logits
Update learning rate & weight decay for a subset of layers
Add coordinate check test - it's like gradient check but for muP

where:

width_mult is the ratio of widths of the current model to the base model
d is the number of channels in a single attn head

Test

To test muP vs SP (standard parametrization):

Run scripts/mup_coordinate_check.sh script
Run dev/mup_coordinate_check_visualize.py script

Run

Set use_mup to 1
Set mup_width_mult to ratio of widths of your target model to your base model
mup_base_attn_mult is a tunable param, 1 seems to be working nicely for our family of models.

Ablations

The coord check results are highly dependent on the learning rate and max width used.

In my preliminary ablations (max width = 1024 & lr = 0.0006) I concluded that the only thing that would mess up the coordinate check was this line:
scale = (model->use_mup && i != 0 && i != 1) ? mup_scale_inv*scale : scale;

In my subsequent ablations (max width = 1024 & lr = 0.006, i.e. lr almost the same as in the reference mup gpt-2 imp, they use 0.01) i concluded that the results are much more sensitive: Adam modifications also matter, 1/width_mult logits scaling matters and whether we use 1/d.

See the next comment for more thorough ablation results.

References:

gordicaleksa · 2024-06-29T21:28:23Z

Ablation study

Setting:
lr = 0.006
width goes from 64 -> 4096 (geometric sequence, 2x coefficient)

Baseline muP:

At step 4 it seems we observe a bit more oscillations. These do dissapear with a bit lower learning rate so not sure what to think of it. Ultimately running a sweep of runs and confirming we have stable HPs will be the final test.

Here is what happens setting attn_mult to 1. It looks like it's a bit more stable? My implementation follows that of mutransformers but it's possible they had a bug when it comes to the attn_mult logic, will have to consult paper again:

SP (standard parametrization) baseline:

These clearly explode.

Comment out zeroing out of the embedding/readout layer & queries:

Observe that one of the layers is decreasing with width, which is undesirable.

Remove learning rate / weight decay modulating logic:

Remove scale = (model->use_mup && i != 0 && i != 1) ? mup_scale_inv*scale : scale;:

Remove logit scaling by 1/width_mult:

Set attn_mult to 1 and replace 1/d q*k scaling with the usual 1/sqrt(d):

alxndrTL · 2024-07-15T09:22:35Z

Hello, I looked over your implementation of muP in both CUDA and Python, and found that in CUDA (layernorm.cuh) you scale the output of all the layernorms by mup_scale (which is defined as sqrt(mup_width_multiplier)) :

but I haven't seen the same scaling in your Pytorch version (train_gpt2.py) (which is very concise and clear btw!)
Also, this is not mentioned in the doc/mup/mup.md file so I was wondering about where did it come from. Maybe I missed something.
Thank you

gordicaleksa · 2024-07-19T15:23:16Z

@alxndrTL it's mentioned in mup.md it is under 3.3 (1. / model->mup_width_mult), I don't use sqrt in the c/cuda code?

alxndrTL · 2024-07-19T15:33:44Z

Ok I didn't realize that the layernorm code I showcased is only used pre-logits, as per 3.3 (I thought it was used for every layer norms).

gordicaleksa · 2024-07-19T15:41:45Z

Hyperparam sweeps

Note:

y-axis is always training loss unless mentioned otherwise.
These checkpoints were trained to convergence on a 10B FineWeb subset.
For more info/experiments check out the following Discord thread and a few threads below it.

scheduler sweep:

Conclusion: cosine is a good choice.

attn_mult tunable param sweep:

Conclusion: Using 1 is a good default.

lr sweeps (note x-axis should be parsed as 1/2^x):

Conclusion: ~1/2^10 is a sweet spot for lr. The curves are stable as we increase the depth, i.e. the optimal lr is invariant to depth scaling.

out_mult sweep (out_mult is currently not supported in this PR but it's a minor tweak i've implemented locally):

Conclusion: 1 is a good default.

next steps:

See whether mup authors reply to my issue here: Not getting perf improvements from muP at ~1.5B scale microsoft/mup#76
Run mup at 7B scale

cc: @karpathy

gordicaleksa · 2024-07-19T15:53:17Z

@YuchenJin would be great to kick off a 7B mup run if you have some bandwidth! :)

YuchenJin · 2024-07-19T17:12:17Z

@YuchenJin would be great to kick off a 7B mup run if you have some bandwidth! :)

Hey @gordicaleksa, happy to! Do you want me to just run the two scripts (scripts/mup_coordinate_check.sh and dev/mup_coordinate_check_visualize.py)? What LR should I use for the 7B model?

habanoz · 2024-11-05T17:51:54Z

@gordicaleksa

llm.c/train_gpt2.py

Line 165 in b125cc6

if self.config.use_mup: torch.nn.init.zeros_(module.weight)

AFAIK this line zero initializes modules with 'LLMC_SKIP_INIT' flag if mup is enabled. There is only one module with 'LLMC_SKIP_INIT' flag, it is lm_head. lm_head weight is tied to wte.weight.

Since embedding layers are initialized later in the code, what is the purpose of the zero initialization referenced above?

habanoz · 2024-11-06T05:23:44Z

@gordicaleksa

llm.c/train_gpt2.py

Line 165 in b125cc6

if self.config.use_mup: torch.nn.init.zeros_(module.weight)

AFAIK this line zero initializes modules with 'LLMC_SKIP_INIT' flag if mup is enabled. There is only one module with 'LLMC_SKIP_INIT' flag, it is lm_head. lm_head weight is tied to wte.weight.

Since embedding layers are initialized later in the code, what is the purpose of the zero initialization referenced above?

After reading mup.md, I can see now that MUP requires output layers to be initialized to zero.

The code assumes that embeddings are initialized before linear layers, which is a correct but IMHO a weak assumption.

Thanks for the great work.

gordicaleksa force-pushed the mup branch 3 times, most recently from ce71c19 to 9864277 Compare June 29, 2024 14:30

gordicaleksa changed the title ~~muP (maximum update parametrization) [WIP]~~ muP (maximum update parametrization) Jun 29, 2024

gordicaleksa force-pushed the mup branch from 51600aa to a2ffeb5 Compare July 1, 2024 22:26

gordicaleksa added 24 commits July 2, 2024 13:09

Add input flag for mup and new attn scaling factor

3c6231d

Use use_mup, init q to 0

b55654f

Fix cudnn header signature, add mup scaling logic

dc2759d

Rescale readout

428a23a

Rescale readout fwd

db5d5f9

muP scale lr & wd

1e54485

Minor fixes

9094416

muP working!

a8ba5d5

Fix test script

0370775

Minor refactor

348298b

Refactor

5342d08

Refactor arguments

329c043

Add mup_base_attn_mult

723cb10

Move data collection into kernel launchers

1f3831d

Add coordinate check script

1239ac1

Add coord check visualization script

66778cd

Use 4 steps and 10x larger lr

e9d59f2

Fix profile script

367e1be

Use constant - fix windows CI

7a7158e

Add usage for ma - attn mult

54fc676

Speed up data collection; improve coord check script

e237f2a

Clear sum_d to 0 before collecting l1 stats

d6cfd80

Replace double with float for atomic add

4ec2a16

Add mup to PyTorch code

4341d57

gordicaleksa added 9 commits July 2, 2024 13:17

Support mup test

22cb9e8

mup set to false by default

64a2ffc

Fix test

59c2e2c

Add mup CI tests

6e76734

Fix CI mup test failing

318e6ca

Relax tensor threshold for mup

87c81de

Mup tutorial - wip

d49ce3e

Wrap up mup md

b1014b4

Add coord check data collection to matmul fwd

e8f1d28

gordicaleksa force-pushed the mup branch from 0e3d11f to e8f1d28 Compare July 2, 2024 11:23

Fix cudnn CI bug

8b45237

gordicaleksa mentioned this pull request Jul 19, 2024

Not getting perf improvements from muP at ~1.5B scale microsoft/mup#76

Open

alxndrTL mentioned this pull request Jul 30, 2024

muP for Mamba and Mamba-2 alxndrTL/mamba.py#50

Merged

gordicaleksa added 3 commits July 31, 2024 12:05

Merge branch 'master' into mup

46ca280

Update gpt3 hyperparams func as well

70158e9

Fix bug that creeped in during merge conflict resolution

b125cc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

muP (maximum update parametrization) #650

muP (maximum update parametrization) #650

gordicaleksa commented Jun 26, 2024 •

edited

Loading

gordicaleksa commented Jun 29, 2024

alxndrTL commented Jul 15, 2024 •

edited

Loading

gordicaleksa commented Jul 19, 2024 •

edited

Loading

alxndrTL commented Jul 19, 2024

gordicaleksa commented Jul 19, 2024 •

edited

Loading

gordicaleksa commented Jul 19, 2024

YuchenJin commented Jul 19, 2024

habanoz commented Nov 5, 2024

habanoz commented Nov 6, 2024

muP (maximum update parametrization) #650

Are you sure you want to change the base?

muP (maximum update parametrization) #650

Conversation

gordicaleksa commented Jun 26, 2024 • edited Loading

Test

Run

Ablations

References:

gordicaleksa commented Jun 29, 2024

Ablation study

alxndrTL commented Jul 15, 2024 • edited Loading

gordicaleksa commented Jul 19, 2024 • edited Loading

alxndrTL commented Jul 19, 2024

gordicaleksa commented Jul 19, 2024 • edited Loading

Hyperparam sweeps

gordicaleksa commented Jul 19, 2024

YuchenJin commented Jul 19, 2024

habanoz commented Nov 5, 2024

habanoz commented Nov 6, 2024

gordicaleksa commented Jun 26, 2024 •

edited

Loading

alxndrTL commented Jul 15, 2024 •

edited

Loading

gordicaleksa commented Jul 19, 2024 •

edited

Loading

gordicaleksa commented Jul 19, 2024 •

edited

Loading