Releases: ModelCloud/GPTQModel
GPT-QModel v5.0.0
Notable Changes:
- New Data-parallel quant support for MoE models on multi-gpu using
nogilPython (Python >= 3.13t withPYTHON_GIL=0env). - New
offload_to_disksupport enabled by default to massively reduce cpu ram usage. - New Intel optimized and Amd compatible
cpuhw acceleratedTorchFusedkernel. - Packing stage is now 4x faster and now inlined with quantization.
- Vram pressure for large models reduced during quantization.
act_group_awareis now 16k+ times faster and the default whendesc_act=Falsefor higher quality recovery without inference penalty ofdesc_act=True.- New beta quality AWQ support with full GEMM, GEMM_Fast, Marlin kernel support.
- New LFM, Ling, Qwen3 Omni model support.
- Bitblas kernel updated to support Bitblas 0.1.0.post1 reelase.
- Quantization is now faster with reduced vram usage. Enhanced logging support with LogBar.
- And much much more...
What's Changed
- rename
torch_dtypetodtypeto sync with hf transformers by @Qubitium in #1804 - drop support for python < 3.11 by @CSY-ModelCloud in #1805
- hard deprecated ipex in favor of torch_fused by @Qubitium in #1807
- update pyproject.toml by @CSY-ModelCloud in #1808
- [CI] release with 3.13t by @CSY-ModelCloud in #1811
- [QUANTIZATION] Add AWQ support by @ZX-ModelCloud in #1703
- find mapping by @LRL-ModelCloud in #1812
- Update README.md by @Qubitium in #1813
- Update version.py by @Qubitium in #1814
- Turtle in a half shell by @Qubitium in #1809
- note about memory saving by @Qubitium in #1817
- move fail_safe by @LRL-ModelCloud in #1818
- rename turtle method by @Qubitium in #1820
- add threads by @Qubitium in #1821
- remove AWQ mod defs by @ZX-ModelCloud in #1822
- [CI] use new docker by @CSY-ModelCloud in #1823
- Fix awq quantize by @LRL-ModelCloud in #1824
- [CI] use new docker for release source by @CSY-ModelCloud in #1825
- fix awq pack by @LRL-ModelCloud in #1826
- fix loading autoawq models and hf/vllm/sglang loading of newly awq qu… by @Qubitium in #1827
- wrong arg check by @Qubitium in #1828
- fix thread task var scoping by @Qubitium in #1829
- fix call param by @Qubitium in #1830
- fix threads > 1 not considered (unsafe) by @Qubitium in #1832
- cleanup by @Qubitium in #1833
- fix gptqmodel offload paths conflict by @Qubitium in #1834
- Ci test by @Qubitium in #1835
- eora: always diff in fp32 + cleanup by @Qubitium in #1836
- add register_buffer/parameter to NamedModule class by @Qubitium in #1837
- typo by @Qubitium in #1839
- add thread safety to all classes by @Qubitium in #1840
- fix fail_safe by @LRL-ModelCloud in #1844
- update marlin kernel by @ZX-ModelCloud in #1838
- fix fp32 reduce on/off by @Qubitium in #1845
- bypass marlin kernel bias issue by @Qubitium in #1846
- disable marlin atomics by default as it failed ci accuracy test by @Qubitium in #1847
- [FIX] awq marlin by @ZX-ModelCloud in #1816
- cleanup var names by @Qubitium in #1849
- pack per module by @LRL-ModelCloud in #1842
- [CI] use new docker by @CSY-ModelCloud in #1850
- tweak eora test by @Qubitium in #1851
- wait for thread tasks only when every module has completed. by @Qubitium in #1852
- [FIX] Compatible with vllm v0.10.2 by @ZX-ModelCloud in #1855
- move req.txt into toml by @CSY-ModelCloud in #1858
- do not create buffers only to overite them by @Qubitium in #1857
- pop states after use by @Qubitium in #1859
- [FIX] multiple "register_buffers" parameters by @ZX-ModelCloud in #1860
- Low memory pack by @Qubitium in #1861
- fix packing ci test by @Qubitium in #1862
- simplify by @Qubitium in #1853
- Fix 3bit packing regression in previous commit by @Qubitium in #1863
- remove deprecated
parallel_packingproperty by @Qubitium in #1864 - Fix qqq quant/offloading by @Qubitium in #1866
- temp disable awq gemm kernel due to failing ci by @Qubitium in #1867
- update vllm compat by @Qubitium in #1869
- fix regression by @Qubitium in #1870
- fix setup.py crashed because torch may not support float8_e8m0fnu by @CSY-ModelCloud in #1871
- [FIX] AwqGEMMQuantLinear skip gptq_v1 convert to v2 by @ZX-ModelCloud in #1872
- Fix awq gemm auto kernel selection order by @Qubitium in #1873
- Update README.md by @Qubitium in #1874
- reduce forwarding to minimal by @Qubitium in #1876
- Update README.md by @Qubitium in #1877
- fix exllama tests by @Qubitium in #1879
- debug print all params/buffers by @Qubitium in #1880
- skip internal loading of non-pkg compatible quantization models, i.e.… by @Qubitium in #1881
- Loader by @Qubitium in #1882
- Cleanup awq by @Qubitium in #1883
- remove broken test by @Qubitium in #1884
- [CI] remove old cuda/torch support for release by @CSY-ModelCloud in #1885
- fix loader by @LRL-ModelCloud in #1886
- fix nvcc warnings about pending cuda > 13.x compat by @Qubitium in #1887
- fix packing speed test by @Qubitium in #1889
- fix licenses warning by @CSY-ModelCloud in #1888
- set licenses to apache by @CSY-ModelCloud in #1890
- [FIX] AwqGEMMQuantLinear should is PackableQuantLinear by @ZX-ModelCloud in #1891
- skip modules that have no parameters and no buffers since they can't be offloaded by @LRL-ModelCloud in #1892
- skip modules that have no parameters and no buffers since they can't offload by @LRL-ModelCloud in #1894
- Fix device check by @Qubitium in #1896
- [CI] disable test install by @CSY-ModelCloud in #1895
- remove hash feature by @Qubitium in #1897
- fix cuda ext cannot be loaded by @Qubitium in #1898
- lock numpy to 2.2.6 by @CSY-ModelCloud in #1899
- [FIX] test_lm_eval.py by @ZX-ModelCloud in #1900
- Patch fix model save by @Qubitium in #1901
- Ugly patch save 2 by @Qubitium in #1902
- fix potential leak by @Qubitium in #1904
- [FIX] test_integration by @ZX-ModelCloud in #1903
- fix build will uploaded a empty wheel by @CSY-ModelCloud in #1905
- fix lm_head quant by @LRL-ModelCloud in #1906
- batch tweaks by @Qubitium in #1907
- [FIX] test_kernel_output_torch_fused by @ZX-ModelCloud in ...
GPT-QModel v4.2.5
What's Changed
- Cleanup hyb_act by @Qubitium in #1791
- Remove torch import in setup.py by @Qubitium in #1729
- Refractor: rename
hyb_acttoact_group_awareby @Qubitium in #1794 - Cleanup by @Qubitium in #1795, #1796
- [CI] Add torch 2.8.0 by @CSY-ModelCloud in #1797
- [CI] torch-2.6.0+cu128-python-3.9 does not exist by @CSY-ModelCloud in #1798
- Fix wf_unsqueeze_zero and wf_unsqueeze_neg_one by @LRL-ModelCloud in #1799
- GAR field save to meta on quant save by @Qubitium in #1800
- Add pyproject.toml by @CSY-ModelCloud in #1801
- [CI] Don't detect arch list when it has already been set & fix build-system requirments by @CSY-ModelCloud in #1802
Full Changelog: v4.2.0...v4.2.5
GPT-QModel v4.2.0
Notable Changes
- Add Qwen3-Next by @Qubitium and @LRL-ModelCloud in #1787
- Add Apertus support by @LRL-ModelCloud in #1767
- Add Kimi k2 support by @LRL-ModelCloud in #1768
- Add Klear support by @LRL-ModelCloud in #1769
- Add FastLLM support by @LRL-ModelCloud in #1771
- Add Nemotron H support by @LRL-ModelCloud in #1773
- Add
fail_safeoption by @LRL-ModelCloud in #1775 - Use threading lock to protect unsafe tensor moves in multi-gpu by @Qubitium in #1778
- Avoid building experimental extensions to reduce wheel size by @Qubitium in #1763
What's Changed
- Fix LlavaQwen2GPTQ by @LRL-ModelCloud in #1772
- Fix Q.to on multi-gpu gptq when proceeding fast and has many experts and gpus by @avtc in #1774
- Bump actions/setup-python from 5 to 6 in the github-actions group by @dependabot[bot] in #1758
- [CI] fix release jobs were skipped by @CSY-ModelCloud in #1759
- ignore compile warns about var declared but not used by @Qubitium in #1760
- allow prebuilt wheel path to be customized via env by @Qubitium in #1761
- add build toggles for all cpp kernels by @Qubitium in #1764
- fix multi gpu inference by @LRL-ModelCloud in #1762
- [CI] reduce wheel download size by @CSY-ModelCloud in #1765
- start 4.2.0-dev cycle by @Qubitium in #1766
- fix klear by @LRL-ModelCloud in #1770
- FIX transformers >= 4.56.1 force changed
torch.default_dtypeby @Qubitium in #1779 - fix multi gpu fail_safe by @LRL-ModelCloud in #1780
- fix device instance by @LRL-ModelCloud in #1783
- prepare for 4.2 release by @Qubitium in #1785
Full Changelog: v4.1.0...v4.2.0
GPT-QModel v4.1.0
Notable Changes:
- Add a config option: mock_quantization to simplify heavy computations… by @avtc in #1731
- Add GLM-4.5-Air support by @avtc in #1730
- Add GPT-OSS support by @LRL2-ModelCloud in #1737
- Add LongCatFlashGPTQ by @LRL-ModelCloud in #1751
- Add Llama 4 Support by @Qubitium in #1508
What's Changed
- Minor Cleanup by @Qubitium in #1718
- disable some compilation on torch 2.8 due to compat issues by @Qubitium in #1727
- add glm4 moe test by @LRL2-ModelCloud in #1734
- deprecate autoround by @Qubitium in #1735
- [FIX] test_kernel_output with XPU by @ZX-ModelCloud in #1741
- cleanup checks for GIL control, GIL=0, and python >= 3.13.3t by @Qubitium in #1743
- update torch/transformer depends by @Qubitium in #1749
- reduce pkg depend by @Qubitium in #1750
- fix triton compat check for 3.13.3t by @Qubitium in #1752
- Bump torch from 2.7.1 to 2.8.0 in /gptqmodel_ext/exllama_eora by @dependabot[bot] in #1755
- pkg update: tokenicer 0.0.5 by @Qubitium in #1756
New Contributors
Full Changelog: v4.0.0...v4.1.0
GPT-QModel v4.0.0
Notable Changes
- Supprt add glm4 by @glide-the in #1559
- Add Xiaomi MiMo model by @Qubitium in #1571
- Free threading (GIL free) Quantization for Linear NxGPU scaling of Quantization by @Qubitium in #1581
- feat: add Qwen-Omni support. by @tiger-of-shawn in #1613
- add Qwen 2.5 Omni support by @Qubitium in #1615
- [MODEL] ERNIE4.5 by @LRL-ModelCloud in #1645
- [MODEL]support pangu_alpha model by @ZX-ModelCloud in #1646
- new baidu ernie & huawei pangu model support by @Qubitium in #1647
- [MODEL] Add falcon h1 support by @LRL-ModelCloud in #1621
- feat(gemma3): also support larger gemma3 models and not only small te… by @joennlae in #1627
- Add Group Aware Reordering (GAR) for Efficient Activation Reordering by @tgafni in #1656
- Enable pytorch fused op on XPU by @jiqing-feng in #1660
- [MODEL] Add Seed-OSS support by @LRL2-ModelCloud in #1702
Other Changed
-
[CI] add release source with github's vm by @CSY-ModelCloud in #1543
-
Fix rotation for tied embedding models by @smpanaro in #1550
-
Fix input processing for convolution by @Cecilwang in #1554
-
[FIX] moe model quant division by zero issue by @LRL-ModelCloud in #1565
-
[FIX] remove too short calib data by @LRL-ModelCloud in #1566
-
[FIX] hook_module and qwen3_moe by @LRL-ModelCloud in #1569
-
[FIX] hook linear and triton by @LRL-ModelCloud in #1570
-
[MISC] simplify model definition by @LRL-ModelCloud in #1572
-
[FIX]qwen2 moe loop module by @LRL-ModelCloud in #1574
-
[CI] fix unit test was unable to run by @CSY-ModelCloud in #1580
-
fix has_gil was not imported & device-smi api wrong by @CSY-ModelCloud in #1586
-
fix older python didn't have EnumType by @CSY-ModelCloud in #1590
-
[FIX] get_module_by_name_prefix by @LRL-ModelCloud in #1591
-
[CI] update release CI, add torch 2.7.0 by @CSY-ModelCloud in #1592
-
[FIX] Qwen2.5 vl quant by @LRL-ModelCloud in #1623
-
Bump torch from 2.6.0 to 2.7.1 in /gptqmodel_ext/exllama_eora by @dependabot[bot] in #1628
-
fix bug for device error by @kaixuanliu in #1631
-
[FIX]config seq len by @LRL-ModelCloud in #1640
-
register buffer for
wf_unsqueeze_zeroandwf_unsqueeze_neg_oneto… by @kaixuanliu in #1642 -
set_postfix is a tqdm function, no need anymore by @CSY-ModelCloud in #1643
-
fix exception to avoid memory issue by @jiqing-feng in #1679
-
lm_head hooked by @Chunfei-He in #1673
-
Bump the github-actions group across 1 directory with 2 updates by @dependabot[bot] in #1677
-
Model config.use_cache not correctly used during inference for some models by @LRL2-ModelCloud in #1686
-
[FIX] transformers compat by @LRL2-ModelCloud in #1687
-
Update module_looper.py by @LRL2-ModelCloud in #1690
-
Update requirements.txt by @LRL2-ModelCloud in #1689
-
add ACCEPT_USE_FLASH_ATTEN2_ARG by @LRL2-ModelCloud in #1693
-
Fix kwarg vs pos arg hidden states by @LRL2-ModelCloud in #1694
-
fix import Perplexity failed by @CSY-ModelCloud in #1695
-
[CI] fix CI installed wrong libs' version by @CSY-ModelCloud in #1696
-
[FIX] GIL Check by @ZX-ModelCloud in #1697
-
[FIX] minicpm test by @LRL2-ModelCloud in #1698
-
[FIX] use AutoModelForImageTextToText instead of AutoModelForVision2Seq by @ZX-ModelCloud in #1699
-
[CI] change qwen2.5-omni model path by @ZX-ModelCloud in #1701
-
[CI] install jieba for test_pangu_alpha by @CSY-ModelCloud in #1706
-
disable torch.compile by @LRL2-ModelCloud in #1707
-
FIX minicpm CI test by @LRL2-ModelCloud in #1708
-
[CI] update torch for build by @CSY-ModelCloud in #1709
-
[CI] update release matrix by @CSY-ModelCloud in #1710
-
[CI] install torch compiled with cuda 126 by @CSY-ModelCloud in #1711
-
use "attn_implementation" by @LRL2-ModelCloud in #1712
-
[CI] add 5090 support & install latest intel_extension_for_pytorch by @CSY-ModelCloud in #1713
-
[CI] don't compile 5090 for cuda < 12.8 by @CSY-ModelCloud in #1714
-
[CI] Update unit test docker by @CSY-ModelCloud in #1715
-
[CI] fix release ci by @CSY-ModelCloud in #1716
-
fix model path is not public by @CSY-ModelCloud in #1720
-
[CI] don't exit when package doesn't exist by @CSY-ModelCloud in #1719
-
[CI] no need install logbar manually by @CSY-ModelCloud in #1721
-
[CI] remove legacy tests & skip intel tests & disable flash_attn for some models by @CSY-ModelCloud in #1722
-
[CI] no need install uv by @CSY-ModelCloud in #1723
-
[CI] use new docker with uv binary to fix shim/uv didn't exist by @CSY-ModelCloud in #1724
New Contributors
- @Cecilwang made their first contribution in #1554
- @glide-the made their first contribution in #1559
- @tiger-of-shawn made their first contribution in https://github.com/ModelClo...
GPT-QModel v3.0.0
🎉 New ground-breaking GPTQ v2 quantization option for improved model quantization accuracy validated by GSM8K_PLATINUM benchmarks vs original gptq.
✨ New Phi4-MultiModal model support.
✨ New Nvidia Nemotron Ultra model support.
✨ New Dream model support. New experimental multi-gpu quantization support. Reduced vram usage. Faster quantization.
What's Changed
- Multi GPU Quantization by @Qubitium in #1502
- experimental multi-gpu quantization by @Qubitium in #1503
- reduce allocation by @Qubitium in #1504
- revert add_ by @Qubitium in #1506
- Switch to non-deprecated mlx.core.clear_cache() by @smpanaro in #1510
- Dream Model Support by @Qubitium in #1512
- fix disabling batch/mask for dream by @Qubitium in #1514
- reduce tensor device movement by @Qubitium in #1516
- fix deepseek v3 module order by @Qubitium in #1517
- Nemotron Ultra Support by @Qubitium in #1518
- faster process_batch by @Qubitium in #1519
- Fix missing arg due to recent
Processorapi changes by @Qubitium in #1523 - Fix gpt2 columns calculation by @Qubitium in #1524
- temp damper should not overwrite damp cfg by @Qubitium in #1526
- Replace module hooking with tree-defined targeting by @Qubitium in #1527
- Fix compat with XPU by @Qubitium in #1535
- Phi4 MultiModal by @Qubitium in #1511
- disable selection of ExllamaV2 kernel for group_size=16 for now by @Qubitium in #1537
- Add Gptqv2 by @yhhhli and @Qubitium in #1533
New Contributors
Full Changelog: v2.2.0...v3.0.0
GPTQModel v2.2.0
What's Changed
✨ New Qwen 2.5 VL model support. Prelim Qwen 3 model support.
✨ New samples log column during quantization to track module activation in MoE models.
✨ Loss log column now color-coded to highlight modules that are friendly/resistant to quantization.
✨ Progress (per-step) stats during quantization now streamed to log file.
✨ Auto bfloat16 dtype loading for models based on model config.
✨ Fix kernel compile for Pytorch/ROCm.
✨ Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus.
- Enable ipex tests for CPU/XPU by @jiqing-feng in #1460
- test kernel accuracies with more shapes on cuda by @Qubitium in #1461
- Fix rocm flags by @Qubitium in #1467
- use table like logging format by @Qubitium in #1471
- stream process log entries to persistent file by @Qubitium in #1472
- fix some models need trust-remote-code arg by @Qubitium in #1474
- Fix wq dtype by @Qubitium in #1475
- add colors to quant loss column by @Qubitium in #1477
- add prelim qwen3 support by @Qubitium in #1478
- Update eora.py for further optimization by @nbasyl in #1488
- faster cholesky inverse and avoid oom when possible by @Qubitium in #1494
- [MODEL] supports qwen2_5_vl by @ZX-ModelCloud in #1493
Full Changelog: v2.1.0...v2.2.0
GPTQModel v2.1.0
What's Changed
✨ New QQQ quantization method and inference support!
✨ New Google Gemma 3 day-zero model support.
✨ New Alibaba Ovis 2 VL model support.
✨ New AMD Instella day-zero model support.
✨ New GSM8K Platinum and MMLU-Pro benchmarking suppport.
✨ Peft Lora training with GPTQModel is now 30%+ faster on all gpu and IPEX devices.
✨ Auto detect MoE modules not activated during quantization due to insufficient calibration data.
✨ ROCm setup.py compat fixes.
✨ Optimum and Peft compat fixes.
✨ Fixed Peft bfloat16 training.
- auto enable flash_attn only when flash-attn was installed by @CSY-ModelCloud in #1372
- Fix rocm compat by @Qubitium in #1373
- fix unnecessary mkdir by @CSY-ModelCloud in #1374
- add test_kernel_output_xpu.py by @CSY-ModelCloud in #1382
- clean test_kernel_output_xpu.py by @CSY-ModelCloud in #1383
- tremove xpu support of triton kernel by @Qubitium in #1384
- [MODEL] Add instella support by @LRL-ModelCloud in #1385
- Fix optimum/peft trainer integration by @CSY-ModelCloud in #1381
- rename peft test file by @CSY-ModelCloud in #1387
- [CI] fix wandb was not installed & update test_olora_finetuning_xpu.py by @CSY-ModelCloud in #1388
- Add lm-eval
GSM8k Platinumby @Qubitium in #1394 - Remove cuda kernel by @Qubitium in #1396
- fix exllama kernels not compiled by @Qubitium in #1397
- update tests by @Qubitium in #1398
- make the kernel output validation more robust by @Qubitium in #1399
- speed up ci by @Qubitium in #1400
- add fwd counter by @yuchiwang in #1389
- allow triton and ipex to inherit torch kernel and use torch for train… by @Qubitium in #1401
- fix skip moe modules when fwd count is 0 by @Qubitium in #1404
- fix ipex linear post init for finetune by @jiqing-feng in #1406
- fix optimum compat by @Qubitium in #1408
- [Feature] Add mmlupro API by @CL-ModelCloud in #1405
- add training callback by @CSY-ModelCloud in #1409
- Fix bf16 training by @Qubitium in #1410
- fix bf16 forward for triton by @Qubitium in #1411
- Add QQQ by @Qubitium in #1402
- make IPEX or any kernel that uses Torch for Training to auto switch v… by @Qubitium in #1412
- [CI] xpu inference test by @CL-ModelCloud in #1380
- [FIX] qqq with eora by @ZX-ModelCloud in #1415
- [FIX] device error by @ZX-ModelCloud in #1417
- make quant linear expose internal buffers by @Qubitium in #1418
- Fix bfloat16 kernels by @Qubitium in #1420
- fix qqq bfloat16 forward by @Qubitium in #1423
- Fix ci10 by @Qubitium in #1424
- fix marlin bf16 compat by @Qubitium in #1427
- [CI] no need reinstall requirements by @CSY-ModelCloud in #1426
- [FIX] dynamic save error by @ZX-ModelCloud in #1428
- [FIX] super().post_init() calling order by @ZX-ModelCloud in #1431
- fix bitblas choose IPEX in cuda env by @CSY-ModelCloud in #1432
- Fix exllama is not packable by @Qubitium in #1433
- disable exllama for training by @Qubitium in #1435
- remove TritonV2QuantLinear for xpu test by @CSY-ModelCloud in #1436
- [MODEL] add gemma3 support by @LRL-ModelCloud in #1434
- fix the error when downloading models using modelscope by @mushenL in #1437
- Add QQQ Rotation by @ZX-ModelCloud in #1425
- fix no init.py by @CSY-ModelCloud in #1438
- Fix hardmard import by @Qubitium in #1441
- Eora final by @nbasyl in #1440
- triton is not validated for ipex by @Qubitium in #1445
- Fix exllama adapter by @Qubitium in #1446
- fix rocm compile by @Qubitium in #1447
- [FIX] Correctly obtain the submodule's device by @ZX-ModelCloud in #1448
- fix rocm not compatible with exllama v2 and eora kernel by @Qubitium in #1449
- revert overflow code by @Qubitium in #1450
- add kernel dtype support and add full float15 vs bfloat16 kernel testing by @Qubitium in #1452
- [MODEL] add Ovis2 support and bug fix by @Fusionplay in #1454
- add unit test for ovis2 by @CSY-ModelCloud in #1456
New Contributors
- @yuchiwang made their first contribution in #1389
- @mushenL made their first contribution in #1437
- @nbasyl made their first contribution in #1440
- @Fusionplay made their first contribution in #1454
Full Changelog: v2.0.0...v2.1.0
GPTQModel v2.0.0
What's Changed
🎉 GPTQ quantization internals are now broken into multiple stages (processes) for feature expansion.
🎉 Synced Marlin kernel inference quality fix from upstream. Added MARLIN_FP16, lower-quality but faster backend.
🎉 ModelScope support added.
🎉 Logging and cli progress bar output has been revamped with sticky bottom progress.
🎉 Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes.
🎉 Delegate loggin/progressbar to LogBar pkg.
🐛 Fix ROCm version auto detection in setup install.
🐛 Fixed generation_config.json save and load.
🐛 Fixed Transformers v4.49.0 compat. Fixed compat of models without bos.
🐛 Fixed group_size=-1 and bits=3 packing regression.
🐛 Fixed Qwen 2.5 MoE regressions.
- fix 3 bit packing regression, fixed #1278 by @CSY-ModelCloud in #1280
- Fix supported models list (syntax error) by @Forenche in #1281
- feat: load model from modelscope by @suluyana in #1283
- merge eval & utils.lm_eval by @CSY-ModelCloud in #1282
- fix modelscope import & tests by @CSY-ModelCloud in #1285
- allow passing model instance to evalplus & update tokenizer loading logics by @CSY-ModelCloud in #1284
- fix lm-eval & vllm check tokenizer type by @CSY-ModelCloud in #1287
- Fix
generation_config.jsonnot auto-saved by @Qubitium in #1292 - [SAVE] Save config files with empty state dict by @ZX-ModelCloud in #1293
- [SAVE] Save processor related config files by @ZX-ModelCloud in #1295
- fix wrong order of config save causing sharded tensors to be removed by @Qubitium in #1297
- [FIX] not pack when group_size=-1 by @ZX-ModelCloud in #1298
- cleanup marlin paths: marlin does conversion on
post_initby @Qubitium in #1310 - bump tokenicer to v0.0.3 by @CSY-ModelCloud in #1308
- clean is_marlin_format for tests by @CSY-ModelCloud in #1311
- [CI] fix sglang test name & add status logs & remove exllama packing test by @CSY-ModelCloud in #1312
- skip v1 to v2 conversion for sym=True only kernels by @Qubitium in #1314
- bump tokenicer to 0.0.4 & remove FORMAT_FIELD_COMPAT_MARLIN by @CSY-ModelCloud in #1315
- revert is_marlin_format check by @CSY-ModelCloud in #1316
- Improve Marlin accuracy (default) but add
MARLIN_FP16backend for faster with less-accuracy by @Qubitium in #1317 - marlin fp32 mode should also be enabled if kernel was selected due to… by @Qubitium in #1318
- refractor logger by @Qubitium in #1319
- fix typo by @Qubitium in #1320
- refractor logger and have progress bar sticky to bottom of cli by @Qubitium in #1322
- [CI] fix tokenicer upgraded transformers & install bitblas for test_save_quanted_model by @CSY-ModelCloud in #1321
- [CI] allow to select compiler server & move model test to correct dir by @CSY-ModelCloud in #1323
- fix bitblas loading regression by @Qubitium in #1324
- marlin fp16 warning missed check by @Qubitium in #1325
- fix custom logger overriding system level logger by @Qubitium in #1327
- fix progress bar for packing by @CSY-ModelCloud in #1326
- More log fixes by @Qubitium in #1328
- fix no backend when creating a quant linear by @CSY-ModelCloud in #1329
- use relative path instead of importing gptqmodel by @CSY-ModelCloud in #1331
- no need patch vllm now by @CSY-ModelCloud in #1332
- [CI] fix CI url by @CSY-ModelCloud in #1333
- fix oom by @CSY-ModelCloud in #1335
- add default value for backend, fix optimum doesn't pass it by @CSY-ModelCloud in #1334
- refractor pb and pb usage by @Qubitium in #1341
- fix generator has no length info by @CSY-ModelCloud in #1342
- replace utils.Progressbar with logbar by @CSY-ModelCloud in #1343
- [CI] update UI by @CSY-ModelCloud in #1344
- fix logbar api usage by @CSY-ModelCloud in #1345
- fix v2 to v1 missed logic bypass by @Qubitium in #1347
- [CI] fix xpu env has no logbar by @CSY-ModelCloud in #1346
- [CI] update runner ip env & fix show-statistics didn't run by @CSY-ModelCloud in #1348
- fix time was not imported by @CSY-ModelCloud in #1349
- update device-smi depend to v0.4.0 by @Qubitium in #1351
- [CI] install requirements.txt for m4 by @CSY-ModelCloud in #1352
- Exllama V1 is Packable by @ZX-ModelCloud in #1356
- [FIX] test_packable.py by @ZX-ModelCloud in #1357
- [setup] use torch.version.hip for rocm version check by @CSY-ModelCloud in #1360
- save/load peft lora by @Qubitium in #1358
- update device-smi to 0.4.1 for rocm fix by @Qubitium in #1362
- strip model path by @Qubitium in #1363
- [CI] exllama v1 kernel now eligible for quant stage by @Qubitium in #1364
- Fix transformers modeling code passing
input.shape[0] == 0to nn.module by @Qubitium in #1365 - simplify log var by @Qubitium in #1368
- fix import by @CSY-ModelCloud in #1369
- update by @Qubitium in #1370
New Contributors
Full Changelog: v1.9.0...v2.0.0
GPTQModel v1.9.0
What's Changed
⚡ Offload tokenizer fixes to Toke(n)icer pkg.
⚡ Optimized lm_head quant time and vram usage.
⚡ Optimized DeekSeek v3/R1 model quant vram usage.
⚡ 3x speed-up for Torch kernel when using Pytorch >= 2.5.0 with model.compile().
⚡ New calibration_dataset_concat_size option to enable calibration data concat mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like wikitext2.
🐛 Fixed Optimum compat and XPU/IPEX auto kernel selection regresion in v1.8.1
- Fix init arg order and
optimumcompat by @CSY-ModelCloud in #1240 - [FIX][Optimize] lm_head quantize by @ZX-ModelCloud in #1239
- [Model] [DeepSpeek] un-merge
gate_projandup_projby @LRL-ModelCloud in #1241 - Use Toke(n)icer by @CL-ModelCloud in #1242
#1244 - Add Tokenicer Test by @CL-ModelCloud in #1245
- prepare for 1.8.2 release by @Qubitium in #1243
- simplify calls to tokenicer by @CL-ModelCloud in #1246
- Update requirements.txt by @Qubitium in #1248
- fix trust_remote was lost by @CSY-ModelCloud in #1249
- fix trust_remote was lost by @CSY-ModelCloud in #1250
- prepare for 1.8.5 release by @Qubitium in #1251
- fix unit tests & tweak logic for selecting backends by @CSY-ModelCloud in #1253
- install tokenicer form git & do ruff by @CSY-ModelCloud in #1254
- fix k,v is not a dict by @CSY-ModelCloud in #1255
- fix not enough values to unpack (expected 2, got 1) by @CSY-ModelCloud in #1256
- fix sglang test requires numpy<2.0 by @CSY-ModelCloud in #1258
- fix ipex backend by @jiqing-feng in #1259
- ipex should be packable, reverted pr #1259 importer.py changes by @CSY-ModelCloud in #1260
- remove sentencepiece by @CSY-ModelCloud in #1261
- speed up torch dequantize by @Qubitium in #1262
- Add
calibration_dataset_concat_sizeoption/mode by @LRL-ModelCloud in #1257 - add transformers test by @CSY-ModelCloud in #1264
- Add kernel torch.compile hook by @Qubitium in #1265
- [FIX]fix vl model prepare_dataset by @LRL-ModelCloud in #1266
Full Changelog: v1.8.1...v1.9.0