Skip to content

Add support for gradient checkpointing#60

Open
eric-czech wants to merge 1 commit intokuleshov-group:mainfrom
eric-czech:main
Open

Add support for gradient checkpointing#60
eric-czech wants to merge 1 commit intokuleshov-group:mainfrom
eric-czech:main

Conversation

@eric-czech
Copy link
Collaborator

@eric-czech eric-czech commented Jan 11, 2025

This adds support for gradient checkpointing using both the Mosaic Composer Trainer and Hugging Face Trainer interfaces.

Both of those assume that checkpointing is configured at training time, rather than during configuration, so you can see that little changes about configuration other than adding a gradient_checkpointing_stride to control how frequently checkpoints are added to the Mamba blocks.

I went back and forth a little bit on how to validate this functionality, and ultimately landed on counting executions of forward passes (through hooks) as being the cleanest way to do it. Let me know if anybody is aware of other ways to test it.

@linnnnCTCT
Copy link

Thanks for providing this implementation!

I've been testing scaling up the model parameters on an A100-80G GPU using gradient checkpointing. However, I hit a limit around 300M parameters (with a caduceus_ph config, d_model:1024, n_layer:48).

Have you tested larger parameter? Do you have any recommended strategies for scaling further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants