Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoquant support for torchao quantizer #35503

Merged
merged 12 commits into from
Feb 24, 2025

Conversation

jerryzh168
Copy link
Contributor

Summary:
att, also verified that autoquantized model can be saved and loaded:

save: https://gist.github.com/jerryzh168/01d367aaf44dbbbfd4068a4a10a00061
load: https://gist.github.com/jerryzh168/d5c6c401b2abdf18e0b6771341f1525c

Test Plan:
tested locally with above script
model uploaded to https://huggingface.co/jerryzh168/llama3-8b-autoquant

Reviewers:

Subscribers:

Tasks:

Tags:

@jerryzh168
Copy link
Contributor Author

README can be updated after #35490 is landed

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Jan 3, 2025

cc @SunMarc can you help review?

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this ! You also wanted to update the min version of torchao no ? If so, we need to update the checks in TorchAoConfig and in TorchAoHfQuantizer. Also, maybe it could be nice to update the docs with details about autoquant option !

@jerryzh168 jerryzh168 requested a review from SunMarc January 8, 2025 07:17
@jerryzh168
Copy link
Contributor Author

@SunMarc please take a look again, thanks

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for adding this ! Thanks for updating the docstring. It would be even better if you can add a paragraph on autoquant inside torchao docs there : https://huggingface.co/docs/transformers/main/en/quantization/torchao ?

@SunMarc SunMarc requested a review from ArthurZucker January 8, 2025 10:25
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jerryzh168
Copy link
Contributor Author

@SunMarc thanks for the review, yeah for the doc, I plan to add autoquant after #35490 is landed to avoid conflict

Comment on lines 259 to 273
output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens)
quantized_model.finalize_autoquant()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to include this in the doc, otherwise it will be hard for the user to understand how to use autoquant. Instead of generate, can we just call the forward of the model or it makes a difference ?

Copy link
Contributor Author

@jerryzh168 jerryzh168 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call the codepath that is finally being tested, in this case generate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are calling generate you need to make sure the static cache is used otherwise compilation will happen at each steps

@SunMarc
Copy link
Member

SunMarc commented Jan 14, 2025

gentle ping @ArthurZucker as @jerryzh168 will soon be in PTO.

@jerryzh168
Copy link
Contributor Author

@ArthurZucker can you help land this?

@SunMarc
Copy link
Member

SunMarc commented Feb 4, 2025

Can you update the documentation @jerryzh168 ? After that, I will merge the PR

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this needs some default generation config as well :

Comment on lines 259 to 273
output = quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens)
quantized_model.finalize_autoquant()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are calling generate you need to make sure the static cache is used otherwise compilation will happen at each steps

@jerryzh168
Copy link
Contributor Author

@ArthurZucker added the cache_implementation settings, I didn't use disable_compile right now, is it required as well?

@SunMarc added autoquant to doc

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating !

@SunMarc SunMarc requested a review from ArthurZucker February 21, 2025 13:30
@SunMarc SunMarc merged commit 2af272c into huggingface:main Feb 24, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants