-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autoquant support for torchao quantizer #35503
Conversation
README can be updated after #35490 is landed |
cc @SunMarc can you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this ! You also wanted to update the min version of torchao no ? If so, we need to update the checks in TorchAoConfig and in TorchAoHfQuantizer. Also, maybe it could be nice to update the docs with details about autoquant option !
@SunMarc please take a look again, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for adding this ! Thanks for updating the docstring. It would be even better if you can add a paragraph on autoquant inside torchao docs there : https://huggingface.co/docs/transformers/main/en/quantization/torchao ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens) | ||
quantized_model.finalize_autoquant() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to include this in the doc, otherwise it will be hard for the user to understand how to use autoquant
. Instead of generate, can we just call the forward of the model or it makes a difference ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call the codepath that is finally being tested, in this case generate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are calling generate you need to make sure the static
cache is used otherwise compilation will happen at each steps
gentle ping @ArthurZucker as @jerryzh168 will soon be in PTO. |
@ArthurZucker can you help land this? |
Can you update the documentation @jerryzh168 ? After that, I will merge the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this needs some default generation config as well :
- static cache
- disable compile add disable compile option #36161 will bring it!
output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens) | ||
quantized_model.finalize_autoquant() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are calling generate you need to make sure the static
cache is used otherwise compilation will happen at each steps
Summary: att, also verified that autoquantized model can be saved and loaded: save: https://gist.github.com/jerryzh168/01d367aaf44dbbbfd4068a4a10a00061 load: https://gist.github.com/jerryzh168/d5c6c401b2abdf18e0b6771341f1525c Test Plan: tested locally with above script model uploaded to https://huggingface.co/jerryzh168/llama3-8b-autoquant Reviewers: Subscribers: Tasks: Tags:
88ef86d
to
dfbb0a0
Compare
@ArthurZucker added the cache_implementation settings, I didn't use disable_compile right now, is it required as well? @SunMarc added autoquant to doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating !
Summary:
att, also verified that autoquantized model can be saved and loaded:
save: https://gist.github.com/jerryzh168/01d367aaf44dbbbfd4068a4a10a00061
load: https://gist.github.com/jerryzh168/d5c6c401b2abdf18e0b6771341f1525c
Test Plan:
tested locally with above script
model uploaded to https://huggingface.co/jerryzh168/llama3-8b-autoquant
Reviewers:
Subscribers:
Tasks:
Tags: