Skip to content

Conversation

@shengliangxu
Copy link
Contributor

@shengliangxu shengliangxu commented Dec 8, 2025

What does this PR do?

Refactor and clean up hf_ptq.py

This script has several separate logic and the code of them are entangled, making it really hard to add new features

Refactor them so that we separate these logics:

  1. sparsity, all logic go to sparsity_main. TODO: we may actually move this logic out to a separate script

  2. quantize, all logic go to quantize_main.

    2.1 plain quantization with a single quantization format

    2.2 auto quantization

In the quantization pipeline, separate the pipeline to:

  1. model loading
  2. calibrate dataset loading
  3. pre-quantize processing
  4. actual quantize
  5. post-quantize processing
  6. quantized model export

Testing

tested the mono quantization:

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path=Qwen/Qwen3-8B \
    --export_path=qwen3-8B_fp8 \
    --qformat=fp8 \
    --kv_cache_qformat=fp8 \
    --calib_size=16 \
    --batch_size=0 \
    --trust_remote_code \
    --export_fmt=hf

and deployed to vLLM and TRTLLM, validated accuracy using lm_eval

tested auto quantize:

python examples/llm_ptq/hf_ptq.py \
    --qformat=nvfp4,fp8 \
    --auto_quantize_score_size 128 \
    --auto_quantize_bits 5.0 \
    --auto_quantize_checkpoint llama-8B-auto-quantize-checkpoint \
    --pyt_ckpt_path=meta-llama/Meta-Llama-3-8B \
    --export_path=llama-8B_auto_quantize
    --kv_cache_qformat=fp8 \
    --calib_size=16 \
    --batch_size=0 \
    --trust_remote_code \
    --export_fmt=tensorrt_llm

and compared exported files

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@shengliangxu shengliangxu force-pushed the shengliangx/hf_ptq_refactor_cleanup branch from 832fb13 to 070ae87 Compare December 8, 2025 21:01
This script has several separate logic and the code of them are
entangled, making it really hard to add new features

Refactor them so that we separate these logics:

1. sparsity, all logic go to sparsity_main. TODO: we may actually move this
   logic out to a separate script

2. quantize, all logic go to quantize_main.

   2.1 plain quantization with a single quantization format

   2.2 auto quantization

In the quantization pipeline, separate the pipeline to:

1. model loading
2. calibrate dataset loading
3. pre-quantize processing
4. actual quantize
5. post-quantize processing
6. quantized model export

Signed-off-by: Shengliang Xu <[email protected]>
@shengliangxu shengliangxu force-pushed the shengliangx/hf_ptq_refactor_cleanup branch from 070ae87 to a89625b Compare December 8, 2025 22:05
@shengliangxu shengliangxu marked this pull request as ready for review December 8, 2025 22:11
@shengliangxu shengliangxu requested review from a team as code owners December 8, 2025 22:11
@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.78%. Comparing base (cd0d185) to head (defa50a).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #665      +/-   ##
==========================================
- Coverage   74.80%   74.78%   -0.02%     
==========================================
  Files         192      192              
  Lines       18814    18814              
==========================================
- Hits        14073    14070       -3     
- Misses       4741     4744       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shengliangxu shengliangxu self-assigned this Dec 11, 2025
args.qformat,
args.kv_cache_qformat,
args.awq_block_size,
None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shengliangxu is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this None is for auto_quantize, the target function does not apply to auto_quantize at all, so removed this useless arg.

mts.export(full_model)


def plain_quantize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe let's call default_quantize or single_precision_quantize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary single precision, just a single config. We can have mixed precision even using single config.

default_quantize does not sounds a bit too general and a bit tedious. How about mono_quantize? meaning quantize using a single config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_quantize vs mono_quantize, quite symmetric

Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shengliangxu for the refactoring. Could you also validate that the checkpoint before and after this change is the same?

@shengliangxu
Copy link
Contributor Author

Thanks @shengliangxu for the refactoring. Could you also validate that the checkpoint before and after this change is the same?

done, updated the description.

Signed-off-by: Shengliang Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants