Title: KeyError: 'packing' in Quantizer.dequantize after save/load round-trip with nbits=8
Summary
Models quantized with nbits=8, saved via HF save_pretrained, and reloaded via from_pretrained crash on the first forward with KeyError: 'packing'. The same model works fine if used in-memory without the save/load round-trip. The bug reproduces for any 8-bit quantized model; I hit it with google/gemma-4-E4B-it via transformers's SinqConfig integration (huggingface/transformers#46050), but the fault is in sinq itself.
Reproduction
from transformers import AutoProcessor, AutoModelForCausalLM, SinqConfig
model_id = 'google/gemma-4-E4B-it'
save_dst = './gemma-4-E4B-it-sinq/'
quant_cfg = SinqConfig(
nbits=8, group_size=64, tiling_mode='2D', method='sinq',
modules_to_not_convert=["lm_head", "model.audio_tower"],
)
# Quantize + save
model = AutoModelForCausalLM.from_pretrained(
model_id, device_map='cpu', quantization_config=quant_cfg,
)
model.save_pretrained(save_dst)
# Reload + run
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(save_dst, device_map='cpu')
inputs = processor.apply_chat_template(
[{'role': 'user', 'content': 'this is a test.'}],
add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors='pt',
).to(model.device)
model.generate(**inputs, max_new_tokens=32) # KeyError: 'packing'
Traceback tail:
File ".../sinq/sinqlinear.py", line 329, in dequantize
W_est = Quantizer.dequantize(W_q, meta, use_unpack_kernel=self.use_unpack_kernel)
File ".../sinq/quantizer.py", line 246, in dequantize
if meta["packing"]:
~~~~^^^^^^^^^^^
KeyError: 'packing'
Root cause
Two sites interact:
-
SINQLinear.load_state_dict (sinq/sinqlinear.py:430-431) drops a stale "packing" flag when the stored tensor's element count already matches the unpacked size:
if self.meta.get("packing") and numel == expected_unpacked:
self.meta.pop("packing", None)
For nbits=8, expected_packed = N*K*8//8 = N*K = expected_unpacked, so this branch always fires and the "packing" key is always removed on reload. (For nbits=4 the two sizes differ, the key survives, and this path is fine.)
-
Quantizer.dequantize (sinq/quantizer.py:246) uses bracket access on a key the rest of the package treats as optional:
if meta["packing"]: # KeyError after step 1
...
W_r = cls.unpack[meta["packing"]](W_q, dtype=compute_dtype)
else:
W_r = W_q.to(compute_dtype)
The else branch is the correct one for unpacked 8-bit tensors; only the bracket access on the if line crashes before we get there.
Note that every other consumer in sinqlinear.py (lines 184, 298, 430) already guards with meta.get("packing"). Line 246 of quantizer.py is the lone outlier.
Proposed fix
--- a/sinq/quantizer.py
+++ b/sinq/quantizer.py
@@ -243,7 +243,7 @@ class Quantizer:
compute_dtype = meta.get("compute_dtype", torch.float16)
# 1) Unpack to per-element codes
- if meta["packing"]:
+ if meta.get("packing"):
if meta.get("view_as_float", False):
W_q = W_q.view(meta["unpack_view_dtype"])
W_r = cls.unpack[meta["packing"]](W_q, dtype=compute_dtype)
I've verified this locally against the repro above — inference completes and outputs are sensible. Happy to open a PR with the patch plus a regression test that exercises an 8-bit quantize → save → load → forward round-trip if it would help.
Environment
sinq: 0.2.0
transformers: 5.8.1
torch: 2.11.0+cu130
- Python: 3.12.13
- Platform: Linux x86_64
- GPU: NVIDIA RTX 3070 Ti (issue is CPU-path; GPU not exercised)
Related: huggingface/transformers#46050
Title:
KeyError: 'packing'inQuantizer.dequantizeafter save/load round-trip withnbits=8Summary
Models quantized with
nbits=8, saved via HFsave_pretrained, and reloaded viafrom_pretrainedcrash on the first forward withKeyError: 'packing'. The same model works fine if used in-memory without the save/load round-trip. The bug reproduces for any 8-bit quantized model; I hit it withgoogle/gemma-4-E4B-itviatransformers'sSinqConfigintegration (huggingface/transformers#46050), but the fault is insinqitself.Reproduction
Traceback tail:
Root cause
Two sites interact:
SINQLinear.load_state_dict(sinq/sinqlinear.py:430-431) drops a stale"packing"flag when the stored tensor's element count already matches the unpacked size:For
nbits=8,expected_packed = N*K*8//8 = N*K = expected_unpacked, so this branch always fires and the"packing"key is always removed on reload. (Fornbits=4the two sizes differ, the key survives, and this path is fine.)Quantizer.dequantize(sinq/quantizer.py:246) uses bracket access on a key the rest of the package treats as optional:The
elsebranch is the correct one for unpacked 8-bit tensors; only the bracket access on theifline crashes before we get there.Note that every other consumer in
sinqlinear.py(lines 184, 298, 430) already guards withmeta.get("packing"). Line 246 ofquantizer.pyis the lone outlier.Proposed fix
I've verified this locally against the repro above — inference completes and outputs are sensible. Happy to open a PR with the patch plus a regression test that exercises an 8-bit quantize → save → load → forward round-trip if it would help.
Environment
sinq: 0.2.0transformers: 5.8.1torch: 2.11.0+cu130Related: huggingface/transformers#46050