GPT-QModel

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

Latest News

10/21/2025 5.0.0: 🎉 Data-parallel quant support for MoE models on multi-gpu using nogil Python. offload_to_disk support enabled by default to massively reduce cpu ram usage. New Intel and AMD cpu hw accelerated TorchFused kernel. Packing stage is now 4x faster and now inlined with quantization. Vram pressure for large models reduced during quantization. Machete kernel added for Hopper+/Blackwell acceleration for gptq and awq models. act_group_aware is 16k+ times faster and now the default when desc_act=False for higher quality recovery without inference penalty of desc_act=True. New beta quality AWQ support with full gemm, gemm_fast, marlin kernel support. LFM, Ling, Qwen3 Omni model support. Quantization is now faster with reduced vram usage. Enhanced logging support with LogBar.
09/16/2025 4.2.5: hyb_act renamed to act_group_aware. Removed finicky torch import within setup.py. Packing bug fix and prebuilt Pytorch 2.8 whls.
09/12/2025 4.2.0: ✨ New Models Support: Qwen3-Next, Apertus, Kimi K2, Klear, FastLLM, Nemotron H. New fail_safe boolean toggle to .quantize() to patch-fix non-activated MoE modules due to highly uneven MoE model training. Fixed LavaQwen2 compat. Patch fix GIL=0 cuda error for multi-gpu. Fix compat with autoround + new transformers.
09/04/2025 4.1.0: ✨ Meituan LongCat Flash Chat, Llama 4, GPT-OSS (BF16), and GLM-4.5-Air support. New experiemental mock_quantization config to skip complex computational code paths during quantization to accelerate model quant testing.
08/21/2025 4.0.0: 🎉 New Group Aware Reordering (GAR) support. New models support: Bytedance Seed-OSS, Baidu Ernie, Huawei PanGu, Gemma3, Xiaomi Mimo, Qwen 3/MoE, Falcon H1, GPT-Neo. Memory leak and multiple model compatibility fixes related to Transformers >= 4.54. Python >= 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. Early access Pytorch 2.8 fused-ops on Intel XPU for up to 50% speedup.

Archived News

* 10/17/2025 5.0.0-dev `main`: 👀: EoRA now multi-gpu compatible. Fixed both quality stability of multi-gpu quanta and vram usage. New LFM and Ling models support. * 09/30/2025 5.0.0-dev `main`: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5. * 09/29/2025 5.0.0-dev `main`: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes. * 09/24/2025 5.0.0-dev `main`: 🎉 Up to 90% cpu mem saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. `act_group_aware` now faster and auto enabled for GPTQ when `desc_act` is False for higher quality recovery. * 09/19/2025 5.0.0-dev `main`: 👀 Cpu memory saving of ~73.5% during quantization stage with new `offload_to_disk` quantization config property default to `True`. * 09/18/2025 5.0.0-dev `main`: 🎉 AWQ quantization support! Complete refractor and simplification of model definitions in prepreation for future quantization formats. * 08/19/2025 4.0.0-dev `main`: Fix quantization memory usage due to some model's incorrect application of `config.use_cache` during inference. Fixed `Transformers` >= 4.54.0 compat which changed layer forward return signature for some models. * 08/18/2025 4.0.0-dev `main`: GPT-Neo model support. Memory leak fix in error capture (stacktrace) and fixed `lm_head` quantization compatibility for many models. * 07/31/2025 4.0.0-dev `main`: New Group Aware Reordering (GAR) support and prelim Pytorch 2.8 fused-ops for Intel XPU for up to 50% speedup. * 07/03/2025 4.0.0-dev `main`: New Baidu Ernie and Huawei PanGu model support. * 07/02/2025 4.0.0-dev `main`: Gemma3 4B model compat fix. * 05/29/2025 4.0.0-dev `main`: Falcon H1 model support. Fixed Transformers `4.52+` compat with Qwen 2.5 VL models. * 05/19/2025 4.0.0-dev `main`: Qwen 2.5 Omni model support. * 05/05/2025 4.0.0-dev `main`: Python 3.13t free-threading support added with near N x GPU linear scaling for quantization of MoE models and also linear N x Cpu Core scaling of packing stage. * 04/29/2025 3.1.0-dev (Now 4.) `main`: Xiaomi Mimo model support. Qwen 3 and 3 MoE model support. New arg for `quantize(..., calibration_dataset_min_length=10)` to filter out bad calibration data that exists in public dataset (wikitext). * 04/13/2025 [3.0.0](https://github.com/ModelCloud/Model/releases/tag/v3.0.0): 🎉 New experimental ` v2` quantization option for improved model quantization accuracy validated by `GSM8K_PLATINUM` [benchmarks](https://github.com/ModelCloud/Model#quantization-using-gptq-v2) vs original `gptq`. New `Phi4-MultiModal` model support . New Nvidia Nemotron-Ultra model support. New `Dream` model support. New experimental `multi-gpu` quantization support. Reduced vram usage. Faster quantization. * 04/2/2025 [2.2.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.2.0): New `Qwen 2.5 VL` model support. New `samples` log column during quantization to track module activation in MoE models. `Loss` log column now color-coded to highlight modules that are friendly/resistant to quantization. Progress (per-step) stats during quantization now streamed to log file. Auto `bfloat16` dtype loading for models based on model config. Fix kernel compile for Pytorch/ROCm. Slightly faster quantization and auto-resolve some low-level oom issues for smaller vram gpus. * 03/12/2025 [2.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.1.0): ✨ New `QQQ` quantization method and inference support! New Google `Gemma 3` zero-day model support. New Alibaba `Ovis 2` VL model support. New AMD `Instella` zero-day model model support. New `GSM8K Platinum` and `MMLU-Pro` benchmarking suppport. Peft Lora training with GPT-QModel is now 30%+ faster on all gpu and IPEX devices. Auto detect MoE modules not activated during quantization due to insufficient calibration data. `ROCm` `setup.py` compat fixes. `Optimum` and `Peft` compat fixes. Fixed `Peft` `bfloat16` training. * 03/03/2025 [2.0.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v2.0.0): 🎉 `GPTQ` quantization internals are now broken into multiple stages (processes) for feature expansion. Synced `Marlin` kernel inference quality fix from upstream. Added `MARLIN_FP16`, lower-quality but faster backend. `ModelScope` support added. Logging and cli progress bar output has been revamped with sticky bottom progress. Fixed `generation_config.json` save and load. Fixed Transformers v4.49.0 compat. Fixed compat of models without `bos`. Fixed `group_size=-1` and `bits=3` packing regression. Fixed Qwen 2.5 MoE regressions. Added CI tests to track regression in kernel inference quality and sweep all bits/group_sizes. Delegate loggin/progressbar to [LogBar](https://github.com/modelcloud/logbar) pkg. Fix ROCm version auto detection in `setup` install. * 02/12/2025 [1.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.9.0): ⚡ Offload `tokenizer` fixes to [Toke(n)icer](https://github.com/modelcloud/tokenicer) pkg. Optimized `lm_head` quant time and vram usage. Optimized `DeepSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.optimize()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`. * 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): ⚡ `DeepSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes. `Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`. New `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom. New `GPTQModel.push_to_hub()` api for easy quant model upload to HF repo. New `buffered_fwd: bool` control in `model.quantize()`. Over 50% quantization speed-up for visual (vl) models. Fixed `bits=3` packing and `group_size=-1` regression in v1.7.4. * 01/26/2025 [1.7.4](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.4): New `compile()` api for ~4-8% inference tps improvement. Faster `pack()` for post-quantiztion model save. `Triton` kernel validated for Intel/`XPU` when Intel Triton packages are installed. Fixed Transformers (bug) downcasting tokenizer class on save. * 01/20/2025 [1.7.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.3): New Telechat2 (China Telecom) and PhiMoE model support. Fixed `lm_head` weights duplicated in post-quantize save() for models with tied-embedding. * 01/19/2025 [1.7.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.2): Effective BPW (bits per weight) will now be logged during `load()`. Reduce loading time on Intel Arc A770/B580 `XPU` by 3.3x. Reduce memory usage in MLX conversion and fix Marlin kernel auto-select not checking CUDA compute version. * 01/17/2025 [1.7.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.7.0): 👀 ✨ `backend.MLX` added for runtime-conversion and execution of GPTQ models on Apple's `MLX` framework on Apple Silicon (M1+). Exports of `gptq` models to `mlx` also now possible. We have added `mlx` exported models to [huggingface.co/ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2). ✨ `lm_head` quantization now fully support by GPTQModel without external pkg dependency. * 01/07/2025 [1.6.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.1): 🎉 New OpenAI api compatible end-point via `model.serve(host, port)`. Auto-enable flash-attention2 for inference. Fixed `sym=False` loading regression. * 01/06/2025 [1.6.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.0): ⚡25% faster quantization. 35% reduction in vram usage vs v1.5. 👀 AMD ROCm (6.2+) support added and validated for 7900XT+ GPU. Auto-tokenizer loader via `load()` api. For most models you no longer need to manually init a tokenizer for both inference and quantization. * 01/01/2025 [1.5.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.1): 🎉 2025! Added `QuantizeConfig.device` to clearly define which device is used for quantization: default = `auto`. Non-quantized models are always loaded on cpu by-default and each layer is moved to `QuantizeConfig.device` during quantization to minimize vram usage. Compatibility fixes for `attn_implementation_autoset` in latest transformers.

What is GPT-QModel?

GPT-QModel is a production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF Transformers, vLLM, and SGLang.

Public and ModelCloud's internal tests have shown that GPTQ is on-par and/or exceeds other 4bit quantization methods in terms of both quality recovery and production-level inference speed for token latency and rps. GPTQ has the optimal blend of quality and inference speed you need in a real-world production deployment.

GPT-QModel not only supports GPTQ but also QQQ, GPTQv2, Eora with more quantization methods and enhancements planned.

Quantization Support

GPT-QModel is a modular design supporting multiple quantization methods and feature extensions.

Quantization Feature	GPT-QModel	Transformers	vLLM	SGLang	Lora Training
GPTQ	✅	✅	✅	✅	✅
EoRA	✅	✅	✅	✅	x
Group Aware Act Reordering	✅	✅	✅	✅	✅
AWQ	✅	✅*	✅*	✅*	✅*
QQQ	✅	x	x	x	x
Rotation	✅	x	x	x	x
GPTQ v2*	✅	✅	✅	✅	✅

Multi-Modal

Native support support some of the most popular multi-modal models:

Multi-Modal
Qwen 2.5 Omni	✅
Qwen2 VL	✅
Ovis 1.6 + 2	✅
Phi-4 MultiModal	✅

Features

✨ Native integration with HF Transformers, Optimum, and Peft (main)
🚀 vLLM and SGLang inference integration for quantized model with format = FORMAT.GPTQ
✨ GPTQ, AWQ, and QQQ quantization format with hw accelerated inference kernels.
🚀 Data Parallism for 80%+ quantization speed reduction with Multi-GPU.
🚀 Optimized for Python >= 3.13t (free threading) with lock-free threading.
✨ Linux, MacOS, Windows platform quantization and accelerated inference support for CUDA (Nvidia), XPU (Intel), ROCm (AMD), MPS (Apple Silicon), CPU (Intel/AMD/Apple Silicon).
✨ Dynamic mixed quantization control on a per-module basis. Each layer/module can have a unique quantization config or be excluded from quantization all together.
🚀 Intel Torch 2.8 fused kernel support for XPU [Arc + Datacenter Max] and CPU [avx, amx, xmx].
🚀 Python 3.13.3t (free-threading, GIL disabled) support for multi-gpu accelerated quantization for MoE models and multi-core cpu boost for post-quant packing.
✨ Asymmetric Sym=False support. Model weights sharding support with optional hash check of model weights on load.
✨ lm_head module quant inference support for further VRAM reduction.
🚀 Microsoft/BITBLAS format + dynamically compiled inference.
💯 100% CI unit-test coverage for all supported models and kernels including post-quantization quality regression.

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

🤗 ModelCloud quantized Vortex models on HF

Model Support

Model
Apertus	✅	EXAONE 3.0	✅	InternLM 1/2.5	✅	Mixtral	✅	Qwen 2/3 (Next/MoE)	✅
Baichuan	✅	Falcon (H1)	✅	Kimi K2	✅	MobileLLM	✅	Qwen 2/2.5 VL	✅
Bloom	✅	FastVLM	✅	Klear	✅	MOSS	✅	Qwen 2.5/3 Omni	✅
ChatGLM	✅	Gemma 1/2/3	✅	LING/RING	✅	MPT	✅	RefinedWeb	✅
CodeGen	✅	GPTBigCod	✅	Llama 1-3.3	✅	Nemotron H	✅	StableLM	✅
Cohere 1-2	✅	GPTQ-Neo/GPT-NeoX	✅	Llama 3.2 VL	✅	Nemotron Ultra	✅	StarCoder2	✅
DBRX Converted	✅	GPT-2	✅	Llama 4	✅	OPT	✅	TeleChat2	✅
Deci	✅	GPT-J	✅	LongCatFlash	✅	OLMo2	✅	Yi	✅
DeepSeek-V2/V3/R1	✅	GPT-OSS	✅	LongLLaMA	✅	Ovis 1.6/2	✅	Seed-OSS	✅
DeepSeek-V2-Lite	✅	Granite	✅	Instella	✅	Phi 1-4	✅	XVERSE	✅
Dream	✅	GRIN-MoE	✅	MiniCPM3	✅	PanGu-α	✅
ERNIE 4.5	✅	Hymba	✅	Mistral	✅	Qwen 1/2/3	✅

Platform and HW Support

GPT-QModel is validated for Linux, MacOS, and Windows 11:

Platform	Device		Optimized Arch	Kernels
🐧 Linux	Nvidia GPU	✅	`Ampere+`	Machete, Marlin, Exllama V2, Exallma V1, Triton, Torch
🐧 Linux	AMD GPU	✅	`7900XT+`, `ROCm 6.2+`	Exllama V2, Exallma V1, Torch
🐧 Linux	Intel XPU	✅	`Arc`, `Datacenter Max`	Torch Fused (Python 2.8+), Torch
🐧 Linux	Intel/AMD CPU	✅	`avx`, `amx`, `xmx`	Torch Fused (Python 2.8+), Torch
🍎 MacOS	GPU (Metal) / CPU	✅	`Apple Silicon`, `M1+`	Torch, MLX via conversion
🪟 Windows	GPU (Nvidia) / CPU	✅	`Nvidia`	Torch

Install

PIP/UV

# You can install optional modules like autoround, ipex, vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation gptqmodel[vllm,sglang,bitblas]
pip install -v gptqmodel --no-build-isolation 
uv pip install -v gptqmodel --no-build-isolation

Install from source

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# python3-dev is required 
apt install python3-dev
# ninja is required to speed up kernel compile by many factors
pip install ninja

# pip: compile and install
# You can install optional modules like  vllm, sglang, bitblas.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas]
pip install -v . --no-build-isolation

Inference

Three line api to use GPT-QModel for gptq model inference:

from gptqmodel import GPTQModel

model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

To use models from ModelScope instead of HuggingFace Hub, set an environment variable:

export GPTQMODEL_USE_MODELSCOPE=True

from gptqmodel import GPTQModel
# load Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4 from modelscope
model = GPTQModel.load("Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4")
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

OpenAI API compatible end-point

# load model using above inference guide first
model.serve(host="0.0.0.0",port="12345")

Quantization

Basic example of using GPT-QModel to quantize a llm model:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

Quantization using GPTQ V2* (Experimental, not MoE compatible, and results may not be better than v1)

Enable GPTQ v2 quantization by setting v2 = True.

# Note v2 is currently experimental, not MoE compatible, and requires 2-4x more vram to execute
# We have many reports of v2 not working better or exceeding v1 so please use for testing only
# If oom on 1 gpu, please set CUDA_VISIBLE_DEVICES=0,1 to 2 gpu and gptqmodel will auto use second gpu
quant_config = QuantizeConfig(bits=4, group_size=128, v2=True)

Llama 3.1 8B-Instruct quantized using test/models/test_llama3_2.py

Method	Bits/Group Size	ARC_CHALLENGE	GSM8K_Platinum_COT
GPTQ	4 / 128	49.15	48.30
GPTQ v2	4 / 128	49.74 +1.20%	61.46 +27.25%
GPTQ	3 / 128	39.93	43.26
GPTQ v2	3 / 128	41.13 +3.01%	50.54 +16.83%

Quantization Inference

# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

Quantization + EoRA Accuracy Recovery

GPT-QModel now support EoRA, a LoRA method that can further imporve the accuracy of the quantized model

# higher rank improves accuracy at the cost of vram usage
# suggestion: test rank 64 and 32 before 128 or 256 as latter may overfit while increasing memory usage
eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank32", 
  rank=32,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

tokens = model.generate("Capital of France is")[0]
result = model.tokenizer.decode(tokens)

print(f"Result: {result}")
# For more detail of EoRA please see GPTQModel/examples/eora
# Please use the benchmark tools in later part of this README to evaluate EoRA effectiveness

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation and Quality Benchmarks

GPTQModel inference is integrated into both lm-eval and evalplus
We highly recommend avoid using ppl and use lm-eval/evalplus to validate post-quantization model quality. ppl should only be used for regression tests and is not a good indicator of model output quality.

# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7

# gptqmodel is integrated into evalplus[main]
pip install -U "evalplus @ git+https://github.com/evalplus/evalplus"

Below is a basic sample using GPTQModel.eval API

from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_data = GPTQModel.eval(model_id, 
                    framework=EVAL.LM_EVAL, 
                    tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])


# Use `evalplus` as framework to evaluate the model
evalplus_data = GPTQModel.eval(model_id, 
                    framework=EVAL.EVALPLUS, 
                    tasks=[EVAL.EVALPLUS.HUMAN])

Dynamic Quantization (Per Module QuantizeConfig Override)

QuantizeConfig.dynamic is dynamic control which allows specific matching modules to be skipped for quantization (negative matching) or have a unique [bits, group_size, sym, desc_act, mse, pack_dtype] property override per matching module vs base QuantizeConfig (postive match with override).

Sample QuantizerConfig.dynamic usage:

dynamic = { 
    # `.*\.` matches the layers_node prefix 
    # layer index start at 0 
    
    # positive match: layer 19, gate module 
    r"+:.*\.18\..*gate.*": {"bits": 4, "group_size": 32},  
    
    # positgive match: layer 20, gate module (prefix defaults to positive if missing)
    r".*\.19\..*gate.*": {"bits": 8, "group_size": 64},  
    
    # negative match: skip layer 21, gate module
    r"-:.*\.20\..*gate.*": {}, 
    
    # negative match: skip all down modules for all layers
    r"-:.*down.*": {},  
 }

Group Aware Reordering (GAR)

Group Aware Reordering (GAR) is an enhanced activation reordering scheme designed to significantly improve the accuracy of quantized models without incurring additional inference overhead. Unlike traditional activation reordering, GAR restricts permutations to within individual groups or rearrangements of entire groups. This ensures each group's associated scales and zero-points remain efficiently accessible during inference, thereby avoiding any inference-time overhead.

How to enable GAR:

Set the act_group_aware parameter to True and disable the default activation reordering by setting desc_act to False in your QuantizeConfig. For example:

quant_config = QuantizeConfig(bits=4, group_size=128, act_group_aware=True)

Experimental Features

GPTQ v2: set v2=True in quantization config.

Attribution of Quantization Methods:

GPTQ (v1): IST-DASLab, main-author: Elias Frantar, arXiv:2210.17323
GPTQ (v2*): Yale Intelligent Computing Lab, main-author: Yuhang Li, arXiv:2504.02692. v2 naming is by Yale author and not endorsed by original GPTQ authors.
QQQ: Meituan, main-author Ying Zhang, arXiv:2406.09904
EoRA: Nvidia, main-author: Shih-Yang Liu, arXiv preprint arXiv:2410.21271.
GAR: Intel, main-author: T Gafni, A Karnieli, Y Hanani, Paper
AWQ: main-authors: Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song

Citation

# GPT-QModel
@misc{qubitium2024gptqmodel,
  author = {ModelCloud.ai and qubitium@modelcloud.ai},
  title = {GPT-QModel},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/modelcloud/gptqmodel}},
  note = {Contact: qubitium@modelcloud.ai},
  year = {2024},
}

# GPTQ
@article{frantar-gptq,
  title={{GPTQ}: Accurate Post-training Compression for Generative Pretrained Transformers}, 
  author={Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
  
}

# EoRA
@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and Yang, Huck and Wang, Chien-Yi and Fung, Nai Chit and Yin, Hongxu and Sakr, Charbel and Muralidharan, Saurav and Cheng, Kwang-Ting and Kautz, Jan and Wang, Yu-Chiang Frank and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

# Group Aware Reordering (GAR)
@article{gar,
  title={Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference, CVPRW 2025.},
  author={T. Gafni, A. Karnieli, Y. Hanani},
  journal={arXiv preprint arXiv:2505.14638},
  year={2025}
}

# GPTQ Marlin Kernel
@article{frantar2024marlin,
  title={MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models},
  author={Frantar, Elias and Castro, Roberto L and Chen, Jiale and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2408.11743},
  year={2024}
}

# QQQ 
@article{zhang2024qqq,
      title={QQQ: Quality Quattuor-Bit Quantization for Large Language Models}, 
      author={Ying Zhang and Peng Zhang and Mincong Huang and Jingyang Xiang and Yujie Wang and Chao Wang and Yineng Zhang and Lei Yu and Chuan Liu and Wei Lin},
      journal={arXiv preprint arXiv:2406.09904},
      year={2024}
}

# AWQ
@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

# GPTQ v2
@article{li2025gptqv2,
  title={GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration}, 
  author={Yuhang Li and Ruokai Yin and Donghyun Lee and Shiting Xiao and Priyadarshini Panda},
  journal={arXiv preprint arXiv:2504.02692},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,570 Commits
.github		.github
chat		chat
examples		examples
format		format
gptqmodel		gptqmodel
gptqmodel_ext		gptqmodel_ext
licenses		licenses
tests		tests
.gitignore		.gitignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
sync_cuda_toolkit_with_torch.sh		sync_cuda_toolkit_with_torch.sh
upload_model.py		upload_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-QModel

Latest News

What is GPT-QModel?

Quantization Support

Multi-Modal

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

Quantization using GPTQ V2* (Experimental, not MoE compatible, and results may not be better than v1)

Quantization Inference

Quantization + EoRA Accuracy Recovery

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Group Aware Reordering (GAR)

Experimental Features

Attribution of Quantization Methods:

Citation

About

Uh oh!

Releases 51

Packages

Uh oh!

Contributors 84

Languages

License

ModelCloud/GPTQModel

Folders and files

Latest commit

History

Repository files navigation

GPT-QModel

Latest News

What is GPT-QModel?

Quantization Support

Multi-Modal

Features

Quality: GPTQ 4bit (5.0 bpw) can match BF16:

Model Support

Platform and HW Support

Install

PIP/UV

Install from source

Inference

OpenAI API compatible end-point

Quantization

Quantization using GPTQ V2* (Experimental, not MoE compatible, and results may not be better than v1)

Quantization Inference

Quantization + EoRA Accuracy Recovery

How to Add Support for a New Model

Evaluation and Quality Benchmarks

Dynamic Quantization (Per Module QuantizeConfig Override)

Group Aware Reordering (GAR)

Experimental Features

Attribution of Quantization Methods:

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 51

Packages 0

Uh oh!

Contributors 84

Languages

Packages