Skip to content

Conversation

@BenjaminBossan
Copy link
Member

Trying to train LoRA with int4 torchao used to raise a RuntimeError. Lately, this error no longer being raised, suggesting that int4 training is unblocked.

For this to work, we need:

  • fbgemm-gpu-genai package from PyTorch (added to Dockerfile)
  • transformers has to allow int4 training

The latter is not yet implemented, so this PR is WIP for now.

Moreover, on my machine, I still get an error with int4, namely:

RuntimeError: cutlass cannot initialize

This could be something specific to my setup and has to be investigated further.

Moreover, there was a warning that bfloat16 should be used with torchao, so I switched the dtype used in the tests to bfloat16. This required one test to increase the tolerances for comparing outputs.

Trying to train LoRA with int4 torchao used to raise a RuntimeError.
Lately, this error no longer being raised, suggesting that int4 training
is unblocked.

For this to work, we need:

- fbgemm-gpu-genai package from PyTorch (added to Dockerfile)
- transformers has to allow int4 training

The latter is not yet implemented, so this PR is WIP for now.

Moreover, on my machine, I still get an error with int4, namely:

> RuntimeError: cutlass cannot initialize

This could be something specific to my setup and has to be investigated
further.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member Author

Update: The error with int4 is now confirmed on two other machines. Let's not merge this for now (nor make the change in transformers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants