-
Notifications
You must be signed in to change notification settings - Fork 195
Added support to export for BF16 weight and amax for vLLM fakequant QAT #579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #579 +/- ##
==========================================
+ Coverage 74.43% 74.47% +0.04%
==========================================
Files 182 182
Lines 18234 18255 +21
==========================================
+ Hits 13572 13596 +24
+ Misses 4662 4659 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Kinjal Patel <[email protected]>
Signed-off-by: Kinjal Patel <[email protected]>
9946463 to
560dfc7
Compare
Signed-off-by: Kinjal Patel <[email protected]>
Signed-off-by: Kinjal Patel <[email protected]>
| ## Known Problems | ||
|
|
||
| 1. AWQ is not yet supported in vLLM. | ||
| 2. PTQ/QAT checkpoint doesn't work with KV Cache quantization enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kinjal for documenting this. Create a jira ticket to address this - https://jirasw.nvidia.com/browse/OMNIML-3051
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating the ticket
realAsma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to export the entire TensorQuantizer state - This way we can seamlessly support PTQ/QAT and fake quantization.
The current support does not work in cases such as Mixed precision quantization (some layers in FP4, some in FP8, some disabled etc.) - we would need manual work arounds for this case. This support also does not work for other quantization such as AWQ.
We are relying on the fact that we are quantizing the model with the same quantization formats as that of PTQ/QAT during vllm_serve.
Signed-off-by: Kinjal Patel <[email protected]>
e5a095d to
13f6bcd
Compare
I have created two tickets to explore mixed precision and other quantization algorithm support. Exporting Tensorquantizer state and loading may require additional effort since vLLM model also combines multiple layers etc. |
meenchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add unit tests for modelopt.torch.export.unified_export_hf.export_hf_checkpoint and modelopt.torch.export.unified_export_megatron.export_mcore_gpt_to_hf
| gate_up_match = "mixer" not in key and re.search(r"(.*\.)(gate|up)_proj(\..+_amax)$", key) | ||
| if gate_up_match: | ||
| base_pattern = gate_up_match.group(1) + "gate_up_proj" + gate_up_match.group(3) | ||
| merge_groups[base_pattern].append((key, value)) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with MoE models that use this quant module:
| class _QuantFusedMoEBase(QuantModule): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example for which model you are talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you can try Qwen/Qwen3-30B-A3B-Instruct-2507
Signed-off-by: Kinjal Patel <[email protected]>
Signed-off-by: Kinjal Patel <[email protected]>
32968c9 to
32f7ae8
Compare
@meenchen Added tests for both in separate file: tests/gpu/torch/export/test_vllm_fakequant_export.py |
meenchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you also try if Qwen/Qwen3-30B-A3B-Instruct-2507 works with this change?
@meenchen Thank you. I checked |
What does this PR do?
Type of change: New Feature
Overview:
Support for vLLM fakequantize QAT/QAD checkpoint evaluation. This MR adds function to export checkpoint as BF16 weights and amax using
export_hf_checkpointfor HF andexport_mcore_gpt_to_hffor MCore usingexport_bf16_weights_amaxoption. The exported weights and amax can be used with vllm_serve_fakequant.py script to run saved checkpoint.Usage
Refer to README.md
Testing
Before your PR is "Ready for review"
Additional Information
MCore export script doesn't have the option to export enable currently