Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213
Add Gemma 4 31B-IT model, export, and quantization framework for ExecuTorch#19213mergennachin wants to merge 11 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19213
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 5 Unrelated FailuresAs of commit 5b54f50 with merge base 8a397b4 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds a new Gemma 4 31B-IT example pipeline for ExecuTorch (CUDA backend), including a packing-agnostic quantization format + recipes, CUDA packers, export/inference scripts, a C++ runner, and CI coverage.
Changes:
- Introduces
examples/models/gemma4_31b/quant/with recipe → quantize → serialize → pack flow plus unit tests. - Adds Gemma 4 31B model implementation with hybrid attention and a sliding-window KV cache, plus export + eager inference entrypoints.
- Adds CUDA runner build targets and runs Gemma 4 31B tests in the CUDA GitHub Actions workflow.
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/gemma4_31b/test_pipeline.py | CPU-only integration tests for quantize/save/load roundtrip and tiny checkpoint fixtures. |
| examples/models/gemma4_31b/test_cuda_pipeline.py | CUDA integration tests for pack/infer/export on a tiny model. |
| examples/models/gemma4_31b/sampler.py | GPU-side Gumbel-max sampler used by the exported model. |
| examples/models/gemma4_31b/quantize_and_save.py | CLI to quantize HF checkpoints and write packing-agnostic safetensors bundles + production recipes. |
| examples/models/gemma4_31b/quant/test_serialize.py | Unit tests for nibble packing and safetensors serialization format. |
| examples/models/gemma4_31b/quant/test_recipe.py | Unit tests for regex/layer-filter recipe matching + production recipe regression tests. |
| examples/models/gemma4_31b/quant/test_quantize.py | Unit tests for quantize_weight and quantize_model APIs (CPU + CUDA/HQQ paths). |
| examples/models/gemma4_31b/quant/test_pack_cuda.py | CUDA unit tests for int4/int8 packers and load-and-pack dispatcher behavior. |
| examples/models/gemma4_31b/quant/serialize.py | Canonical quantized weight format + safetensors save/load with versioned metadata. |
| examples/models/gemma4_31b/quant/recipe.py | Declarative quantization recipe/rule objects with regex FQN matching and optional layer filters. |
| examples/models/gemma4_31b/quant/quantize.py | Implements min-max and HQQ quantization into canonical (packing-free) representations. |
| examples/models/gemma4_31b/quant/pack_cuda.py | CUDA-specific packers converting canonical weights into torchao runtime tensor subclasses. |
| examples/models/gemma4_31b/quant/pack.py | Backend-agnostic pack dispatcher that assigns weights/buffers and calls module-type packers. |
| examples/models/gemma4_31b/quant/init.py | Public API re-exports for quant/ package. |
| examples/models/gemma4_31b/quant/README.md | Documentation of the quant framework, data flow, and backend extension points. |
| examples/models/gemma4_31b/model.py | Gemma 4 31B model definition, HF checkpoint loader, ring KV cache for sliding layers, runtime buffer materialization. |
| examples/models/gemma4_31b/model.md | Architecture/design notes for model + quant pipeline. |
| examples/models/gemma4_31b/main.cpp | ExecuTorch CUDA runner driving exported prefill/decode and HF tokenizer decoding. |
| examples/models/gemma4_31b/inference.py | Eager CUDA inference script loading prequantized weights, packing, and generating text. |
| examples/models/gemma4_31b/export.py | Export + lowering pipeline (decode + prefill methods) targeting the CUDA backend. |
| examples/models/gemma4_31b/init.py | Package marker for the new model example. |
| examples/models/gemma4_31b/README.md | User-facing instructions for quantize/export/inference/build/run workflows. |
| examples/models/gemma4_31b/CMakePresets.json | CMake preset for building the Gemma 4 31B CUDA runner. |
| examples/models/gemma4_31b/CMakeLists.txt | CMake build for the Gemma 4 31B runner, linking ExecuTorch + CUDA backend + tokenizer. |
| examples/models/gemma4/text_decoder/gemma4_norm.py | Replaces transformers RMSNorm dependency with a self-contained implementation. |
| examples/models/gemma4/text_decoder/init.py | Exposes attention/norm/MLP primitives used by gemma4_31b for shared numerically-sensitive ops. |
| Makefile | Adds gemma4_31b-cuda build target. |
| .github/workflows/cuda.yml | Adds Gemma 4 31B quant + pipeline tests to the CUDA CI job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a full Gemma 4 31B-IT example to ExecuTorch, including a new packing-agnostic quantization framework, CUDA packing/export/inference tooling, GGUF import support, a C++ CUDA runner, and a comprehensive test suite integrated into CI.
Changes:
- Introduce
examples/models/gemma4_31b/quant/canonical quantization framework (recipe → quantize → serialize → pack) with CUDA packers and safetensors persistence. - Add Gemma 4 31B-IT model implementation with ring-buffer KV cache for sliding-window layers, plus export/eager inference/runner scripts.
- Add unit + integration tests (CPU and CUDA) and run them in the CUDA CI workflow.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/gemma4_31b/tests/test_pipeline.py | CPU-only integration tests for quantize→save→load roundtrip and fixtures for CUDA tests. |
| examples/models/gemma4_31b/tests/test_cuda_pipeline.py | CUDA integration tests for packing, generation, chunked prefill, and export. |
| examples/models/gemma4_31b/sampler.py | GPU-side Gumbel-max sampler used by the exported model for on-device sampling. |
| examples/models/gemma4_31b/quantize_and_save.py | CLI to quantize HF checkpoints and persist packing-agnostic safetensors checkpoints. |
| examples/models/gemma4_31b/quant/tests/test_serialize.py | Unit tests for canonical serialize/deserialize and nibble pack/unpack. |
| examples/models/gemma4_31b/quant/tests/test_recipe.py | Unit tests for regex/layer-filter recipe matching + production recipe regression tests. |
| examples/models/gemma4_31b/quant/tests/test_quantize.py | Unit tests for canonical quantize/dequantize APIs and model-walking quantization. |
| examples/models/gemma4_31b/quant/tests/test_pack_cuda.py | CUDA unit tests for packing canonical weights into CUDA runtime formats and dispatch. |
| examples/models/gemma4_31b/quant/tests/test_gguf.py | Unit tests validating GGUF Q4_K/Q6_K unpacking against reference formulas. |
| examples/models/gemma4_31b/quant/serialize.py | Canonical CQW representation + safetensors format, nibble packing, save/load. |
| examples/models/gemma4_31b/quant/recipe.py | Declarative quantization recipe/rule/config structures and matching logic. |
| examples/models/gemma4_31b/quant/quantize.py | Canonical quantization implementations (min_max, HQQ) + per-model quantization walk. |
| examples/models/gemma4_31b/quant/pack_cuda.py | CUDA packers from canonical weights to tinygemm/int8 subclass runtime formats. |
| examples/models/gemma4_31b/quant/pack.py | Backend-agnostic pack dispatcher grouping weights per module and applying packers. |
| examples/models/gemma4_31b/quant/gguf.py | GGUF tensor unpacker/streamer to canonical CQW or dense tensors. |
| examples/models/gemma4_31b/quant/init.py | Public API re-exports for quant/ package. |
| examples/models/gemma4_31b/quant/README.md | Documentation of the quant framework layers, data flow, and on-disk format. |
| examples/models/gemma4_31b/model.py | Gemma 4 31B-IT model definition, ring-buffer KV cache, HF load/remap, runtime buffer materialization. |
| examples/models/gemma4_31b/model.md | Architecture/design notes including attention flavors, caching strategy, and export methods. |
| examples/models/gemma4_31b/main.cpp | CUDA ExecuTorch runner driving exported prefill/decode with tokenizer integration. |
| examples/models/gemma4_31b/inference.py | Eager CUDA inference path (load/pack/materialize/compile + generate loop). |
| examples/models/gemma4_31b/gguf_loader.py | Streams GGUF weights, remaps keys, handles tied weights, and packs into runtime model. |
| examples/models/gemma4_31b/export.py | Export+lower pipeline producing .pte/.ptd with decode/prefill methods and CUDA lowering. |
| examples/models/gemma4_31b/init.py | Package marker for the Gemma 4 31B example. |
| examples/models/gemma4_31b/README.md | User-facing docs for quantize/export/inference/build/run flows. |
| examples/models/gemma4_31b/CMakePresets.json | CMake workflow presets to build the CUDA runner. |
| examples/models/gemma4_31b/CMakeLists.txt | CMake build for the Gemma 4 31B CUDA runner binary. |
| examples/models/gemma4/text_decoder/gemma4_norm.py | Replace transformers RMSNorm dependency with a self-contained implementation. |
| examples/models/gemma4/text_decoder/init.py | Export shared primitives (attention helpers, KV cache, MLP, norm) for reuse by gemma4_31b. |
| Makefile | Add gemma4_31b-cuda build target and help entry. |
| .github/workflows/cuda.yml | Run Gemma 4 31B quant and pipeline tests as part of CUDA unit test job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 2m 54s —— View job PR Review: Gemma 4 31B-IT Model, Export, and Quantization Framework
Overall AssessmentThis is a well-structured, substantial addition (~5300 lines) that introduces a full Gemma 4 31B-IT pipeline. The architecture is clean — the quantization framework separates concerns nicely (recipe → quantize → serialize → pack), and the code is well-organized for reuse with future models. The test coverage is good (78 tests across multiple files), and the CI integration is appropriate. I've reviewed the full diff and have the following feedback: Architecture & Design (Positive)
Issues1.
|
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 2m 30s —— View job PR Review: Gemma 4 31B-IT Model, Export, and Quantization Framework
Overall AssessmentThis is a high-quality, well-architected PR (~5500 lines) adding a full Gemma 4 31B-IT pipeline. The separation of concerns is clean — recipe declares intent, quantize produces canonical form, serialize persists, pack converts to backend format. Test coverage is comprehensive (78 tests), and the PR addresses the previous review's concerns well (added the Architecture & Design (Strengths)
Issues & Suggestions1.
|
| File | Scope | Key tests |
|---|---|---|
test_recipe.py |
Regex matching, layer filters, production recipe regression | 14 test cases |
test_quantize.py |
quantize_weight/quantize_model API contract, HQQ paths |
12 test cases |
test_serialize.py |
Nibble packing, save/load, 3D tensors, versioned headers | ~15 test cases |
test_pack_cuda.py |
Module-type dispatch, matmul accuracy, mixed precision, MoE | ~20 test cases |
test_gguf.py |
Q4_K/Q6_K unpacking vs reference formulas, serialize roundtrip | 7 test cases |
test_pipeline.py |
Quantize→save→load integration, RingKVCache isolation | 7 test cases |
test_cuda_pipeline.py |
Pack→generate, chunked prefill correctness, export e2e | 4 test cases |
Good coverage. The RingKVCache unit tests (test_pipeline.py:212-266) are a welcome addition addressing the earlier review's gap.
Summary
This is a well-executed, production-quality addition. The architecture decisions are sound, the code is clean and readable (comments only where non-obvious), and the test coverage is thorough. The main items from previous reviews have been addressed. No blocking issues remain.
Actionable items (non-blocking):
- Comment the
group_size=5376coupling to hidden_size inquantize_and_save.py:51 - Consider adding
@torch.no_grad()decorator toGemma4_31B.forwardsince this is inference-only
There was a problem hiding this comment.
Pull request overview
Adds a full Gemma 4 31B-IT text-only pipeline to the ExecuTorch examples, including a reusable packing-agnostic quantization framework (recipe/quantize/serialize/pack), GGUF import, CUDA packing/export/inference flows, and a CUDA runner, with CI coverage.
Changes:
- Introduces
examples/models/gemma4_31b/model implementation (ring-buffer KV cache), export/inference scripts, GGUF loader, and C++ CUDA runner + build targets. - Adds a new
quant/framework (recipes, min-max + HQQ quantization, safetensors format, CUDA packing, GGUF Q4_K/Q6_K unpack). - Adds unit/integration tests and wires them into the CUDA GitHub Actions workflow.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/gemma4_31b/tests/test_pipeline.py | CPU-only pipeline + RingKVCache tests for quant/save/load and cache behavior |
| examples/models/gemma4_31b/tests/test_cuda_pipeline.py | CUDA integration tests for pack/infer/export + chunked prefill equivalence |
| examples/models/gemma4_31b/sampler.py | GPU-side Gumbel-max sampler (mirrors Qwen sampler behavior) |
| examples/models/gemma4_31b/quantize_and_save.py | CLI to quantize HF checkpoint and save canonical safetensors checkpoint |
| examples/models/gemma4_31b/quant/tests/test_serialize.py | Unit tests for canonical format + nibble packing + safetensors I/O |
| examples/models/gemma4_31b/quant/tests/test_recipe.py | Unit tests for regex/layer-filter recipe matching + production recipe regression |
| examples/models/gemma4_31b/quant/tests/test_quantize.py | Unit tests for min-max + HQQ quantize/dequantize and quantize_model behavior |
| examples/models/gemma4_31b/quant/tests/test_pack_cuda.py | CUDA unit tests for packers (int4 tinygemm, int8 intx, dispatch/grouping) |
| examples/models/gemma4_31b/quant/tests/test_gguf.py | Unit tests for GGUF Q4_K/Q6_K unpacking and serialize roundtrip |
| examples/models/gemma4_31b/quant/serialize.py | CanonicalQuantizedWeight + serialize/deserialize + safetensors save/load |
| examples/models/gemma4_31b/quant/recipe.py | QuantConfig/QuantRule/QuantRecipe declarative matching logic |
| examples/models/gemma4_31b/quant/quantize.py | min-max + HQQ quantize_weight/dequantize_weight + per-model quantization |
| examples/models/gemma4_31b/quant/pack_cuda.py | CUDA packers for Linear/Embedding and load+pack convenience wrapper |
| examples/models/gemma4_31b/quant/pack.py | Backend-agnostic pack_model/pack_one dispatch + grouping by parent module |
| examples/models/gemma4_31b/quant/gguf.py | GGUF tensor unpack + streaming iterator to canonical representation |
| examples/models/gemma4_31b/quant/init.py | Public API exports for quant framework and CUDA packers |
| examples/models/gemma4_31b/quant/README.md | Framework overview, dataflow, and backend/model extension guidance |
| examples/models/gemma4_31b/model.py | Gemma4 31B model, ring-buffer KV cache, HF loader, runtime buffer materialization |
| examples/models/gemma4_31b/model.md | Architecture/design notes + export/quantization details |
| examples/models/gemma4_31b/main.cpp | CUDA runner driving exported prefill/decode methods with chunked prefill |
| examples/models/gemma4_31b/inference.py | Eager CUDA inference with torch.compile + GGUF/prequantized load paths |
| examples/models/gemma4_31b/gguf_loader.py | Streams GGUF tensors, remaps keys, handles tied embed/lm_head, packs model |
| examples/models/gemma4_31b/export.py | Export+lower (decode+prefill) to ExecuTorch with CUDA backend and metadata |
| examples/models/gemma4_31b/init.py | Package marker for gemma4_31b example |
| examples/models/gemma4_31b/README.md | End-to-end usage docs for quantize/export/infer/build/run |
| examples/models/gemma4_31b/CMakePresets.json | CMake workflow presets for building the runner (CUDA preset) |
| examples/models/gemma4_31b/CMakeLists.txt | Runner build configuration + CUDA backend linkage |
| Makefile | Adds gemma4_31b-cuda build target |
| .github/workflows/cuda.yml | Runs gemma4_31b quant + pipeline tests in CUDA CI |
| examples/models/gemma4/text_decoder/gemma4_norm.py | Replaces transformers RMSNorm import with self-contained implementation |
| examples/models/gemma4/text_decoder/init.py | Exposes additional shared primitives used by gemma4_31b model |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 2m 30s —— View job PR Review: Gemma 4 31B-IT Model, Export, and Quantization Framework
Overall AssessmentThis is a well-structured, high-quality addition (~5600 lines across 31 files) introducing a full Gemma 4 31B-IT text-only pipeline. The architecture is clean — the quantization framework separates concerns into orthogonal layers (recipe → quantize → serialize → pack), the model implementation is faithful to the HF reference, and the streaming GGUF support keeps peak memory tractable. Test coverage is comprehensive (78 tests), and the PR has clearly evolved through multiple review iterations addressing earlier feedback. The codebase is ready for merge with minor non-blocking suggestions below. Architecture Strengths
Issues & Suggestions1.
|
| File | Tests | Key coverage |
|---|---|---|
quant/tests/test_recipe.py |
14 | Regex matching, layer filters, production recipe regression |
quant/tests/test_quantize.py |
12 | quantize_weight/dequantize_weight roundtrip, HQQ paths, error cases |
quant/tests/test_serialize.py |
~15 | Nibble pack/unpack, save/load, 3D tensors, format versioning |
quant/tests/test_pack_cuda.py |
~20 | Module dispatch, matmul accuracy, mixed precision, MoE grouping |
quant/tests/test_gguf.py |
7 | Q4_K/Q6_K vs reference formulas, serialize roundtrip, edge cases |
tests/test_pipeline.py |
7 | Quantize→save→load, RingKVCache isolation, corrupted checkpoint |
tests/test_cuda_pipeline.py |
4 | Pack→generate, chunked prefill correctness, export e2e |
The GGUF test skip handling (test_gguf.py:21-28) is properly implemented — @unittest.skipUnless(_HAS_GGUF, ...) on each class, conditional import of unpack_gguf_tensor, and CI installs gguf before running (cuda.yml:152).
The RingKVCache unit tests (test_pipeline.py:212-266) cover sequential write, wraparound, multi-token, and oversized-prefill assertion — good targeted coverage for the most subtle piece of logic.
Summary
This is production-quality work with sound architecture decisions for memory efficiency at 31B scale. The main actionable items are:
- Add
<cstring>and<cmath>includes inmain.cpp(fragile transitive dependency) - Update
group_sizecomment inrecipe.pyto reflect actual supported values - Consider reshaping qdata in
deserialize()for 8-bit weights (defensive hardening)
None of these are blocking — the code is correct as written and the invariants are upheld by the export/runner logic. Previous review iterations (symmetric INT4 dequant fix, RingKVCache assert, GGUF skip handling, allclose tolerances) have all been addressed properly.
|
|
||
|
|
||
| @dataclass | ||
| class CanonicalQuantizedWeight: |
There was a problem hiding this comment.
There is subclass support for this "canonical" format that is later converted to specialized formats.
The basic flow is:
- On your device-specific subclass, define a class method constructor "from_{canonical_format}_tensor", e.g., Int4PreshuffledTensor.from_int4_tensor(tensor)
- Load "canonical" tensor subclass checkpoint
- Iterate through canonical tensors, applying the conversion above.
IIRC, this is the path used by torchao in vLLM (cc @jerryzh168 to confirm).
CanonicalQuantizedWeight feels like a new abstraction that we'd maintain to do the same thing?
There was a problem hiding this comment.
@metascroy Good point.
I'll use Int4Tensor and IntxUnpackedToInt8Tensor as they seem like canonical/portable format that can serialized to/from safetensors.
Will delete CanonicalQuantizedWeight
There was a problem hiding this comment.
@metascroy Made the change. Thank for the recommendation
49cfc55 to
69474e3
Compare
|
@claude Review this PR |
|
Claude encountered an error —— View job PR Review: Gemma 4 31B-IT Model, Export, and Quantization Framework
|
There was a problem hiding this comment.
Pull request overview
Adds a new ExecuTorch example for Gemma 4 31B-IT with a reusable INT4/INT8 weight-only quantization + serialization + packing pipeline (torchao tensor subclasses + safetensors), along with export/inference tooling, a CUDA runner, and extensive test coverage.
Changes:
- Introduce
examples/models/gemma4_31b/model implementation, export/inference scripts, GGUF loader, and CUDA runner (CMake + Make target). - Add
quant/framework (recipe → quantize → pack) using torchao tensor subclasses and torchao safetensors flatten/unflatten. - Add CPU/CUDA pipeline tests + quant unit tests; extend CUDA CI workflow to execute them.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/models/gemma4_31b/tests/test_pipeline.py | CPU-only integration tests for quantize/save/load + unit tests for RingKVCache and GGUF key mapping. |
| examples/models/gemma4_31b/tests/test_cuda_pipeline.py | CUDA integration tests for generate, chunked prefill correctness, and export paths. |
| examples/models/gemma4_31b/sampler.py | GPU-side Gumbel-max sampler used to keep exported programs single-output across temperatures. |
| examples/models/gemma4_31b/quantize_and_save.py | CLI to quantize HF checkpoints on CPU and save torchao-subclass safetensors checkpoints. |
| examples/models/gemma4_31b/quant/tests/test_safetensors_roundtrip.py | Smoke tests for safetensors roundtrip of torchao tensor subclasses. |
| examples/models/gemma4_31b/quant/tests/test_recipe.py | Unit tests for QuantRecipe matching + regression tests for production recipes. |
| examples/models/gemma4_31b/quant/tests/test_quantize.py | Unit tests for quantize/dequantize behavior and error paths (incl. HQQ). |
| examples/models/gemma4_31b/quant/tests/test_pack_cuda.py | CUDA unit tests for packing INT4/INT8 weights and model-level pack/load paths. |
| examples/models/gemma4_31b/quant/tests/test_gguf.py | Tests for GGUF Q4_K/Q6_K unpacking correctness + safetensors roundtrip. |
| examples/models/gemma4_31b/quant/recipe.py | Declarative quantization recipe types (QuantConfig/Rule/Recipe) with regex + layer filtering. |
| examples/models/gemma4_31b/quant/quantize.py | Quantize/dequantize and model-walk quantization producing torchao tensor subclasses. |
| examples/models/gemma4_31b/quant/pack_cuda.py | CUDA packers for nn.Linear / nn.Embedding (INT4 tinygemm + INT8 intx pass-through). |
| examples/models/gemma4_31b/quant/pack.py | Backend-agnostic dispatch for packing state dicts into meta-built runtime models. |
| examples/models/gemma4_31b/quant/gguf.py | GGUF tensor unpacking (Q4_K/Q6_K/F16/F32) into torchao subclasses with streaming iterator. |
| examples/models/gemma4_31b/quant/init.py | Re-exports for quant package public API. |
| examples/models/gemma4_31b/quant/README.md | Documentation of the quant framework dataflow and on-disk format (torchao safetensors). |
| examples/models/gemma4_31b/model.py | Export-friendly Gemma 4 31B model with hybrid attention + ring-buffer KV cache + sampling. |
| examples/models/gemma4_31b/model.md | Architecture/design notes covering attention variants, export methods, quantization, runtime buffers. |
| examples/models/gemma4_31b/main.cpp | CUDA runner driving exported prefill/decode with chunking and BOS/EOS handling. |
| examples/models/gemma4_31b/inference.py | Eager CUDA inference path (optionally torch.compile) for prequantized or GGUF-loaded models. |
| examples/models/gemma4_31b/gguf_loader.py | Streams GGUF tensors, remaps keys, unties embed/lm_head behavior, and packs for backend. |
| examples/models/gemma4_31b/export.py | Export+lower pipeline producing shared-buffer prefill/decode methods + metadata constants. |
| examples/models/gemma4_31b/init.py | Package marker for the new example. |
| examples/models/gemma4_31b/README.md | End-user instructions and recommended workflows (quantize → export/infer). |
| examples/models/gemma4_31b/CMakePresets.json | Presets to build the Gemma 4 31B CUDA runner. |
| examples/models/gemma4_31b/CMakeLists.txt | CMake target for runner; enforces CUDA build and links required ExecuTorch extensions. |
| examples/models/gemma4/text_decoder/gemma4_norm.py | Removes transformers dependency by re-implementing Gemma4 RMSNorm in-tree. |
| examples/models/gemma4/text_decoder/init.py | Exposes additional gemma4 text_decoder primitives (attention helpers, norms, MLP). |
| Makefile | Adds gemma4_31b-cuda build target and help entry. |
| .github/workflows/cuda.yml | Runs Gemma 4 31B quant + pipeline tests in CUDA CI (installs gguf). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
69474e3 to
551f3b0
Compare
551f3b0 to
2604159
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2604159 to
53612c9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b26fb75 to
d488474
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 35 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <cinttypes> | ||
| #include <fstream> | ||
| #include <string> | ||
| #include <vector> | ||
|
|
||
| #include <executorch/runtime/platform/platform.h> | ||
| #include <executorch/runtime/platform/types.h> | ||
| extern "C" void et_pal_emit_log_message( | ||
| ET_UNUSED et_timestamp_t timestamp, | ||
| et_pal_log_level_t level, | ||
| const char* filename, | ||
| ET_UNUSED const char* function, | ||
| size_t line, | ||
| const char* message, | ||
| ET_UNUSED size_t length) { | ||
| if (level < 'W') { | ||
| return; | ||
| } | ||
| fprintf(stderr, "%c [%s:%zu] %s\n", (char)level, filename, line, message); | ||
| } |
| GGUF names (e.g., ``blk.0.attn_q.weight``); the caller handles key | ||
| remapping. GGUF shapes are reversed to PyTorch convention automatically. | ||
| """ | ||
| from gguf import GGUFReader |
There was a problem hiding this comment.
curious do we need gguf support even for enablement?
There was a problem hiding this comment.
We'd like to do apples-to-apples latency comparison on a same checkpoint
…uTorch Text-only export of Gemma 4 31B-IT to ExecuTorch with the CUDA backend and INT4/INT8 weight quantization via a new packing-agnostic quant/ framework. The quant/ package separates quantization into four concerns: - recipe.py: declarative QuantRecipe with regex FQN matching - quantize.py: produces CanonicalQuantizedWeight (min_max, HQQ) - serialize.py: save/load to safetensors with versioned headers - pack.py + pack_cuda.py: per-module packer dispatch for CUDA Two production recipes: "default" (INT4 min_max + INT8 embedding) and "sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ elsewhere). Sliding window attention uses a ring-buffer KV cache (2x window size) for the 50 sliding layers, saving memory for long sequences. The 10 full-attention layers use a standard flat KV cache. Includes C++ runner (main.cpp), eager inference script, and 60+ unit and integration tests across quant/ and pipeline test files.
- Sliding window layers use RingKVCache (2×window) instead of flat max_seq_len buffer, reducing KV cache memory for long sequences. - Prefill is capped to ring buffer size; the C++ runner chunks longer prompts automatically via get_max_prefill_chunk metadata. - Both recipes now quantize embed_tokens to INT8 per-axis (~1.4 GB savings vs bf16). Embedding packer uses IntxUnpackedToInt8Tensor which supports gather. - pack_model handles top-level FQNs (no parent module). - C++ runner aligned with Qwen patterns: #ifdef guards for non-CUDA builds, better weight_sharing error handling, cudaDeviceSynchronize between prefill and decode. - Test suite split into test_pipeline.py (CPU) and test_cuda_pipeline.py (CUDA) with shared fixtures. New chunked prefill correctness test. - Prequantized checkpoint available at huggingface.co/SocialLocalMobile/gemma-4-31B-it-HQQ-INT4. - Added Gemma 4 31B tests to cuda.yml CI workflow. - Cleaned up stale terminology, docstrings, and comments throughout.
- quant/gguf.py: unpack Q4_K/Q6_K GGUF blocks to CanonicalQuantizedWeight, with iter_gguf_tensors for streaming (low peak memory). Validated against original bf16 weights (Q4_K: 7.9%, Q6_K: 1.9% error). - gguf_loader.py: Gemma 4 31B GGUF key mapping + load_gguf_model. Handles tied embed/lm_head: embedding dequantized to bf16 (gather), lm_head keeps Q4_K (tinygemm matmul). - export.py and inference.py: --gguf flag for direct GGUF file loading. - quant/quantize.py: dequantize_weight (inverse of quantize_weight). - quant/pack.py: pack_one for single-weight streaming; pack_model delegates to pack_one for unquantized, groups quantized by parent for multi-weight modules (MoE-compatible). - quant/serialize.py: CanonicalQuantizedWeight.__post_init__ validation (dtype, shape, symmetric/zero consistency). - Tests moved to tests/ folders (quant/tests/ and tests/).
- dequantize_weight now subtracts 8 from symmetric 4-bit qdata (stored as unsigned [0,15]) before scaling, matching the quantize_weight shift - Guard test_gguf.py with skipUnless so CI doesn't break without gguf - Install gguf in cuda.yml for GGUF test coverage - Use torch.allclose instead of torch.equal for chunked prefill logit comparison to avoid CUDA FP flakiness - Fix Usage docblock paths in test_pipeline.py and test_cuda_pipeline.py
- Fix float→uint64 truncation in main.cpp read_token (use llrintf) - Add assert in RingKVCache.update to catch seq_len > buf_size misuse - Add RingKVCache unit tests (sequential, wraparound, multi-token, assert) - Add CanonicalQuantizedWeight __post_init__ validation error path tests - Add GGUF Q4_K through tinygemm pack pipeline test (asymmetric) - Add 8-bit asymmetric matmul test - Add F16 GGUF tensor type test - Document QuantConfig.bits as storage width and _INT8_PER_AXIS coupling
- serialize.py: add iter_load() generator that streams weights one at a time from safetensors, keeping peak memory proportional to the largest single weight instead of loading all weights into memory at once. - pack_cuda.py: rewrite load_and_pack_for_cuda to use iter_load for streaming — avoids ~40 GB peak memory when loading the 31B checkpoint. - __init__.py: remove low-level CUDA packer internals (pack_int4_for_cuda, pack_int8_for_cuda, pack_linear_for_cuda, pack_embedding_for_cuda) from the public API. Tests import these directly from pack_cuda.py.
Gemma's HuggingFace tokenizer does not auto-prepend BOS. Without it the model's logits collapse. Add --bos_id (default 2) to prepend and --eos_id (default 1) as a fallback stop token.
Delete the custom CanonicalQuantizedWeight dataclass and serialize.py format. Quantized weights are now stored as torchao's native Int4Tensor (4-bit) and IntxUnpackedToInt8Tensor (8-bit) subclasses, serialized via torchao's safetensors integration. Key changes: - quantize_weight returns Int4Tensor or IntxUnpackedToInt8Tensor - quantize_model returns a single state_dict (not two dicts) - 8-bit quantization done in float32 to avoid bf16 precision loss (manual quantize + direct IntxUnpackedToInt8Tensor construction) - Sensitive recipe uses HQQ asymmetric INT4 (scale + zero optimization) - pack_model takes a single state_dict, dispatches by isinstance - pack.py uses TorchAOBaseTensor for quantized weight detection - GGUF unpacker produces Int4Tensor/IntxUnpackedToInt8Tensor directly - serialize.py dissolved — callers inline torchao safetensors directly Breaking change: existing prequantized checkpoints (old format) must be regenerated with quantize_and_save.py.
- Use .detach() instead of .data when moving packed INT4 weight to CPU to preserve tensor subclass identity safely - Remove unused loaded_keys set in load_and_pack_for_cuda - Handle top-level tensor keys (no dot) in load_and_pack_for_cuda
Extend ReplaceEdgeOpWithTritonOpPass to select triton::sdpa_decode_splitk
for SDPA nodes where L_q=1 (decode) and L_kv exceeds 2048 (large KV
cache). This dramatically improves GPU utilization for full-attention
layers at long context lengths — standard SDPA launches only a handful
of CTAs (proportional to H_kv), while split-K partitions the KV sequence
across up to 128 CTAs.
Benchmarked on A100 with Gemma4 31B shapes at 128K context:
Full-attention decode (H_kv=4, D=512, L_kv=131072):
standard SDPA: 15.7ms/layer → split-K: 0.7ms/layer (22x)
Sliding-attention decode (H_kv=16, D=256, L_kv=2048):
unchanged (standard SDPA is faster for small L_kv)
The threshold of 2048 is chosen to match the sliding-window ring buffer
size — anything above is a full-attention cache where split-K wins.
No changes to model code — the pass inspects Q/K shapes in the exported
graph and selects the kernel automatically.
Change pack_cuda.py to store INT4 weights as IntxUnpackedToInt8Tensor (dequant+cuBLAS) instead of Int4TilePackedTo4dTensor (tinygemm). Add use_tinygemm_linears source transform for decode optimization. The export flow now exports prefill first (default dequant+cuBLAS, optimal for large M) then applies the tinygemm transform and exports decode (optimal for M=1). Prefill speedup: 12x vs tinygemm at T=2048 (2.6ms vs 32ms per linear). Decode unchanged (tinygemm, 68us per linear at M=1). pack_cuda.py no longer requires CUDA for packing. The tinygemm conversion moves to a model-agnostic source transform in backends/cuda/transforms/int4_linear_dispatch.py.
d488474 to
5b54f50
Compare

Text-only export of Gemma 4 31B-IT to ExecuTorch with INT4/INT8 weight quantization. Quantized weights use torchao's native tensor subclasses (Int4Tensor, IntxUnpackedToInt8Tensor) for serialization, aligning with the torchao ecosystem.
quant/ package separates quantization into independent modules:
Serialization uses torchao's safetensors integration (torchao.prototype.safetensors) — no custom format. Checkpoints are compatible with torchao's save_pretrained/load_pretrained and can be loaded by vLLM.
This framework is designed to be promoted and reused for Qwen 3.5 MoE and other models — adding a new model requires only a QuantRecipe and optionally a custom packer.
Quantization recipes: "default" (INT4 min_max linears + INT8 per-axis embedding) and "sensitive" (INT8 for edge-layer v_proj/down_proj, INT4 HQQ asymmetric elsewhere).
Dual-path INT4 linear dispatch: IntxUnpackedToInt8Tensor's F.linear dispatch dequantizes to bf16 and calls cuBLAS, optimal for prefill (12x faster than tinygemm at T=2048). For decode, a model-agnostic source transform (backends/cuda/transforms/int4_linear_dispatch.py) converts to Int4TilePackedTo4dTensor (tinygemm), optimal for M=1. Export flow: prefill first (dequant+cuBLAS), then tinygemm transform, then decode export. inference.py applies the tinygemm transform for fast eager decode.
Split-K flash-decoding: ReplaceEdgeOpWithTritonOpPass in the CUDA backend selects triton::sdpa_decode_splitk for SDPA nodes where L_q=1 and L_kv exceeds 2048. At 128K context, full-attention decode SDPA improves from 15.7ms/layer to 0.7ms/layer (22x). Sliding-window layers (ring buffer <= 2048) use standard triton::sdpa. No model code changes — the pass inspects Q/K shapes in the exported graph automatically.
GGUF support: inference.py --gguf and export.py --gguf load community-quantized GGUF files directly. Tied embed/lm_head is untied — embedding dequantized to bf16 for gather, lm_head keeps INT4 for matmul.
Ring-buffer KV cache: Sliding window layers use RingKVCache (2x window) instead of flat max_seq_len buffers. The C++ runner chunks long prompts automatically via get_max_prefill_chunk metadata. Chunked prefill produces identical logits to sequential (verified by test).
Includes: C++ runner with BOS/EOS handling, chunked prefill, and #ifdef guards for non-CUDA builds; eager inference with torch.compile; unit and integration tests across quant/tests/, tests/, and backends/cuda/tests/.