Closed
Description
System Info
transformers
version: 4.41.2- Platform: Linux-5.15.0-1042-nvidia-x86_64-with-glibc2.35
- Python version: 3.9.18
- Huggingface_hub version: 0.23.3
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
n/a
Expected behavior
The run_clm.py
and other related scripts in:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
notionally support training / fine-tuning of models whose gradients are too large to fit on a single GPU, if you believe their CLI. However there is no example showing how to actually do that.
For instance, accelerate estimate-memory
says training the Mistral-7B family with Adam takes roughly 55 GB with float16, which is more memory than a single 40GB A100 has. So I'd need to use more than one GPU.
Would it be possible to modify the language_modeling documentation to explain how to do that?