-
Notifications
You must be signed in to change notification settings - Fork 214
Refactor and clean up hf_ptq.py #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
832fb13 to
070ae87
Compare
This script has several separate logic and the code of them are entangled, making it really hard to add new features Refactor them so that we separate these logics: 1. sparsity, all logic go to sparsity_main. TODO: we may actually move this logic out to a separate script 2. quantize, all logic go to quantize_main. 2.1 plain quantization with a single quantization format 2.2 auto quantization In the quantization pipeline, separate the pipeline to: 1. model loading 2. calibrate dataset loading 3. pre-quantize processing 4. actual quantize 5. post-quantize processing 6. quantized model export Signed-off-by: Shengliang Xu <[email protected]>
070ae87 to
a89625b
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #665 +/- ##
==========================================
- Coverage 74.80% 74.78% -0.02%
==========================================
Files 192 192
Lines 18814 18814
==========================================
- Hits 14073 14070 -3
- Misses 4741 4744 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| args.qformat, | ||
| args.kv_cache_qformat, | ||
| args.awq_block_size, | ||
| None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shengliangxu is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this None is for auto_quantize, the target function does not apply to auto_quantize at all, so removed this useless arg.
examples/llm_ptq/hf_ptq.py
Outdated
| mts.export(full_model) | ||
|
|
||
|
|
||
| def plain_quantize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe let's call default_quantize or single_precision_quantize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary single precision, just a single config. We can have mixed precision even using single config.
default_quantize does not sounds a bit too general and a bit tedious. How about mono_quantize? meaning quantize using a single config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_quantize vs mono_quantize, quite symmetric
cjluo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shengliangxu for the refactoring. Could you also validate that the checkpoint before and after this change is the same?
done, updated the description. |
Signed-off-by: Shengliang Xu <[email protected]>
What does this PR do?
Refactor and clean up hf_ptq.py
This script has several separate logic and the code of them are entangled, making it really hard to add new features
Refactor them so that we separate these logics:
sparsity, all logic go to sparsity_main. TODO: we may actually move this logic out to a separate script
quantize, all logic go to quantize_main.
2.1 plain quantization with a single quantization format
2.2 auto quantization
In the quantization pipeline, separate the pipeline to:
Testing
tested the mono quantization:
and deployed to vLLM and TRTLLM, validated accuracy using lm_eval
tested auto quantize:
and compared exported files