v0.7.0
🚀 Highlights
- 
Enhanced NVFP4 algorithm and added support to export MXFP4/NVFP4 to the
llm-compressorformat
by @WeiweiZhang1 and @wenhuach21 - 
Improved W2A16 quantization algorithm
by @wenhuach21 - 
Introduced the
schemeinterface for easier configuration of quantization settings
by @wenhuach21 - 
Added support for using FP8 models as input and str name as model input in API
by @wenhuach21 and @n1ck-guo - 
Unified
deviceanddevice_maparguments and introduceddevice_map="auto"
to simplify quantization of extremely large models
by @Kaihui-intel 
What's Changed
- fix ut import issue by @WeiweiZhang1 in #686
 - support to export static afp8 model by @n1ck-guo in #662
 - Add ruff and isort by @XuehaoSun in #578
 - Improved log message for unsupported dataset by @wenhuach21 in #688
 - support rceil for mxfp by @wenhuach21 in #660
 - Add black and blacken-docs in pre-commit by @XuehaoSun in #692
 - support static global scale for nvfp4 and update readme by @wenhuach21 in #691
 - Update readme by @wenhuach21 in #695
 - Add script for cuda unit test by @XuehaoSun in #567
 - support to save image_processor by @n1ck-guo in #694
 - support for static activation quantization calibration with group_size by @n1ck-guo in #693
 - fix xpu oom checker by @n1ck-guo in #705
 - FIXBUG: CPU Offloading for Cache Blocks in Low-Memory GPU Systems or Single GPU on ROCM Configs by @JartX in #703
 - fix bug of zero accuracy for mx-fp by @n1ck-guo in #709
 - catch oom error and move to cpu directly by @n1ck-guo in #708
 - code optimization of vlm by @n1ck-guo in #704
 - fix critic bug of gguf tuning by @wenhuach21 in #710
 - support fp8 model and str as input in llm quantization by @wenhuach21 in #699
 - change act_scale to input_scale for fp8 export by @n1ck-guo in #711
 - simply CpuInfo class by @wenhuach21 in #715
 - Update step_by_step.md by @wenhuach21 in #717
 - fix bug of activation quant when act_max is None by @n1ck-guo in #718
 - Bump transformers in /test/test_cuda by @dependabot[bot] in #719
 - Freeze torchvision version in CI by @XuehaoSun in #720
 - update autoround mllm and support Mistral 3.2 series by @n1ck-guo in #713
 - Fix hpu CI by @XuehaoSun in #723
 - fix fp8 model input issue by @wenhuach21 in #724
 - update gguf convert.py and support for gpt-oss by @n1ck-guo in #721
 - new cast_to_nvfp4 with high performance by @xin3he in #727
 - make the tuning deterministic and move infrequently used arguments to kwargs by @wenhuach21 in #726
 - add original convert file and support for the newest llama.cpp by @n1ck-guo in #729
 - fix bug for exporting afp8 fake format by @n1ck-guo in #731
 - Fix torch_zp infer bug & API disable_deterministic_algorithms bug by @WeiweiZhang1 in #733
 - fix gguf mistral_common import by @n1ck-guo in #736
 - Enable mxfp exporting by @WeiweiZhang1 in #649
 - support for glm4.5 gguf by @n1ck-guo in #735
 - support auto-round-mllm command by @n1ck-guo in #742
 - Optimize pack zeros for int sym by @WeiweiZhang1 in #743
 - fix UT check for int zp by @WeiweiZhang1 in #745
 - support llama4 quant by @mengniwang95 in #744
 - fix bug of loading fp8 model by @n1ck-guo in #747
 - improved algorithm for int2 by @wenhuach21 in #748
 - Add Static FP8 KV Support by @yiliu30 in #737
 - refine code by @wenhuach21 in #749
 - mllm supports loading fp8 model and fix bug of loading fp8 model by @n1ck-guo in #750
 - support deepspeed LinearLayer and LinearAllreduce by @xin3he in #698
 - fix alg_ext moe and model str input bug by @wenhuach21 in #751
 - api support for fp8 model and mllm api support load from str by @n1ck-guo in #752
 - fix some torch compile warnings by @wenhuach21 in #755
 - Speedup FP4 packing by @yiliu30 in #760
 - fix_script_fp_layer_config_for_bits_checking by @WeiweiZhang1 in #756
 - support quant lm_head for rtn w8afp8 static quant by @n1ck-guo in #754
 - Revert "Speedup FP4 packing" by @yiliu30 in #763
 - refine code and fix activation quantization eval regression by @wenhuach21 in #762
 - fix gguf ut bug by @n1ck-guo in #767
 - fix gguf bug by int zp by @n1ck-guo in #771
 - Keep the model’s buffer dtype unchanged in most cases by @wenhuach21 in #770
 - fix set_layer_config bug by @wenhuach21 in #768
 - fix bug of auto_round exporting by @n1ck-guo in #772
 - gguf format supports for fp8 model by @n1ck-guo in #778
 - [API CHANGE] Stage 1 add quant scheme and consolidate device and device_map by @wenhuach21 in #774
 - Speedup FP4 packing by @yiliu30 in #766
 - hot fix for nvfp4 scheme by @wenhuach21 in #784
 - fix alg_ext regression and support mxfp4 in it with slight improvement by @wenhuach21 in #785
 - refine nvfp code, typofix by @WeiweiZhang1 in #777
 - mxfp/nvfp/fp8 support torch compile in tuning by @wenhuach21 in #789
 - refine nvfp4 algorithm by @wenhuach21 in #790
 - add limit arg for eval by @n1ck-guo in #764
 - torch backend bugfix and speedup ut by @WeiweiZhang1 in #793
 - Support auto device mapping by @Kaihui-intel in #781
 - fix bug and add nvfp in alg-ext with slight improvement by @wenhuach21 in #794
 - rename llmcompressor to llm_compressor for align with other formats by @WeiweiZhang1 in #780
 - align formats packing device to API by @WeiweiZhang1 in #795
 - add fp8 export format check by @n1ck-guo in #779
 - fix several regressions including lm-head quantization, 3bit asym torch backend,etc by @wenhuach21 in #796
 - refine readme by @wenhuach21 in #798
 - fix typo in readme by @wenhuach21 in #799
 - fix several cuda ut bug by @n1ck-guo in #797
 - enable model python files saving by @WeiweiZhang1 in #802
 - AutoRoundMLLM supports scheme and fix device_map=dict regression by @n1ck-guo in #801
 - improve the robustness of scheme by @wenhuach21 in #803
 - fix mxfp exporting by @WeiweiZhang1 in #806
 
New Contributors
- @JartX made their first contribution in #703
 - @mengniwang95 made their first contribution in #744
 
Full Changelog: v0.6.0...v0.7.0