Is this library much slower than AutoGPTQ? #990

CHNtentes · 2025-01-02T03:26:16Z

CHNtentes
Jan 2, 2025

I tried quantizing the same model with same quantize config, with AutoGPTQ I can finish in 15~20 minutes, but with this library I need over 2 hours...

Answered by Qubitium

Jan 6, 2025

@CHNtentes Memory usage issue has been fixed in https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.0

We are now 35% lower vram usage than before and 15% lower than AutoGPTQ and our test quants per layer, using same calibration data, with QwQ 32B test GPTQModel has consistent lower error_loss which is critical for quantization. Speed is about 7.5% slower than AutoGPTQ for QwQ 32B but with lower vram usage and higher quality quants, it is worth the one-off cost. We will try to improve the speed in our next release.

View full answer

Qubitium · 2025-01-02T08:11:31Z

Qubitium
Jan 2, 2025
Maintainer

Please use and increase batch_size value when calling model.quantize(). gptqmodel is faster than auto-gptq if used correctly.

auto-gptq has broken batch support for calibration. gptqmodel has batching support but you need to set to a proper value accordingly to your gpu capability and vram size.

0 replies

Qubitium · 2025-01-03T07:02:16Z

Qubitium
Jan 3, 2025
Maintainer

@CHNtentes Was the speed issue resolved by batch_size?

3 replies

CHNtentes Jan 3, 2025
Author

I tried 16 and it was faster, but still took around 30 min. Tried 64 but the speed only increased a little bit, and the vram usage was much higher...

Qubitium Jan 3, 2025
Maintainer

Please post you full quantization code including path to HF hosted base model and/or datasets so we can verify this slowness as normal or regression.

Qubitium Jan 3, 2025
Maintainer

@CHNtentes Memory usage will increase as batch_size increases but as you have found out, more is not better. Find the large batch_size that both target your gpu vram and also bring about faster quant speed. Having huge batch doesn't help when the gpu cores is not able to run such a large batch in parallel.

Qubitium · 2025-01-04T09:50:41Z

Qubitium
Jan 4, 2025
Maintainer

@CHNtentes In addition to batch_size=16, please test using our latest v1.5.3 release. Make sure your import gptqmodel in your python code before all torch/transformer package to get the memory saving. v1.5.3 has been tested to reduce memory by 25-30% on Nvidia gpu and now uses about 15% memory less than AutoGPTQ using same batch, model (QwQ 32B), same dataset.

Quantization time is a bit slower than AutoGPTQ at the moment but we hope to fix that in our next release.

More importanly, GPTQModel's error_loss is consistently lower for almost all layers vs AutoGPTQ apples to apples comparision using QwQ 32B model as test with same batch and calibration dataset. Lower error_loss will often result in better post-quantized models.

https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.3

0 replies

Qubitium · 2025-01-06T07:55:15Z

Qubitium
Jan 6, 2025
Maintainer

@CHNtentes Memory usage issue has been fixed in https://github.com/ModelCloud/GPTQModel/releases/tag/v1.6.0

We are now 35% lower vram usage than before and 15% lower than AutoGPTQ and our test quants per layer, using same calibration data, with QwQ 32B test GPTQModel has consistent lower error_loss which is critical for quantization. Speed is about 7.5% slower than AutoGPTQ for QwQ 32B but with lower vram usage and higher quality quants, it is worth the one-off cost. We will try to improve the speed in our next release.

5 replies

CHNtentes Jan 6, 2025
Author

Thanks for your reply. Will increasing batch_size result in higher loss? I tried batch_size 64 and it seems that the loss is slightly higher.

Qubitium Jan 6, 2025
Maintainer

Do not use high batch size if it doesn't result in faster quant. Use as low as possible while maintaining max quant speed. It may or may not cause higher error_loss. This is very calibration data and model dependent.

CHNtentes Jan 6, 2025
Author

I did some more experiments and found out actually batch_size=1 is fastest... Maybe last time there was other processes using gpu.

Qubitium Jan 6, 2025
Maintainer

@CHNtentes Do you see faster quant speed, lower vram usage in our latest 1.6.0 release?

CHNtentes Jan 6, 2025
Author

Well, I did not do direct comparison, but it seems to be a bit faster.

Qubitium · 2025-09-24T12:34:15Z

Qubitium
Sep 24, 2025
Maintainer

We are now 90% less memory usage than AutoGPTQ on quantization of large models and should also be much faster with multi-gpu acceleration during quantization.

0 replies

Is this library much slower than AutoGPTQ? #990

Uh oh!

CHNtentes Jan 2, 2025

Replies: 5 comments · 8 replies

Uh oh!

Uh oh!

Qubitium Jan 2, 2025 Maintainer

Uh oh!

Qubitium Jan 3, 2025 Maintainer

Uh oh!

CHNtentes Jan 3, 2025 Author

Uh oh!

Qubitium Jan 3, 2025 Maintainer

Uh oh!

Qubitium Jan 3, 2025 Maintainer

Uh oh!

Uh oh!

Qubitium Jan 4, 2025 Maintainer

Uh oh!

Uh oh!

Qubitium Jan 6, 2025 Maintainer

Uh oh!

CHNtentes Jan 6, 2025 Author

Uh oh!

Uh oh!

Qubitium Jan 6, 2025 Maintainer

Uh oh!

CHNtentes Jan 6, 2025 Author

Uh oh!

Qubitium Jan 6, 2025 Maintainer

Uh oh!

CHNtentes Jan 6, 2025 Author

Uh oh!

Qubitium Sep 24, 2025 Maintainer

CHNtentes
Jan 2, 2025

Replies: 5 comments 8 replies

Qubitium
Jan 2, 2025
Maintainer

Qubitium
Jan 3, 2025
Maintainer

CHNtentes Jan 3, 2025
Author

Qubitium Jan 3, 2025
Maintainer

Qubitium Jan 3, 2025
Maintainer

Qubitium
Jan 4, 2025
Maintainer

Qubitium
Jan 6, 2025
Maintainer

CHNtentes Jan 6, 2025
Author

Qubitium Jan 6, 2025
Maintainer

CHNtentes Jan 6, 2025
Author

Qubitium Jan 6, 2025
Maintainer

CHNtentes Jan 6, 2025
Author

Qubitium
Sep 24, 2025
Maintainer