Release v0.12.0 · pytorch/ao

Highlights

We are excited to announce the 0.12.0 release of torchao! This release adds support for QAT + Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs!

QAT + Axolotl Integration

TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes! Check out the docs here or run it yourself using the following command:

axolotl train examples/llama-3/3b-qat-fsdp2.yaml
axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml

Initial results for Llama3.2-3B by @SalmanMohammadi (axolotl-ai-cloud/axolotl#2590):

Model/Metric	hellaswag acc	hellaswag acc_norm	wikitext bits_per_byte	wikitext byte_perplexity	wikitext word_perplexity
bfloat16	0.5552	0.7315	0.6410	1.5594	10.7591
bfloat16 PTQ	0.5393	0.7157	0.6613	1.5815	11.6033
qat ptq	0.5423	0.7180	0.6567	1.5764	11.4043
Recovered (qat ptq)	18.87%	14.56%	22.66%	23.08%	23.57%

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

TorchAO now includes prototype support for NVFP4 (NVIDIA's 4-bit floating-point format) and Microscaling (MX) formats on NVIDIA's latest Blackwell GPU architecture. These formats enable efficient inference, achieving up to 61% end-to-end performance improvement in vLLM on Qwen3 models and near 2x speedups for diffusion workloads.

To use:

from torchao.quantization import quantize_ 
from torchao.prototype.mx_formats import (
    MXFPInferenceConfig,
    NVFP4InferenceConfig,
)
# Quantize model with MXFP8 
model = quantize_(model, MXFPInferenceConfig(block_size=32))
# Quantize model to NVFP4 (without double scaling)
model = quantize_(model, NVFP4InferenceConfig())

Note: This is a prototype feature with APIs subject to change. Requires NVIDIA Blackwell GPUs (B200, 5090) with CUDA 12.8+.

BC Breaking

Remove preserve_zero and zero_point_domain from choose_qparams_affine (#2149)
Rename qparams for tinygemm (#2344)
Convert quant_primitives methods private (#2350)
Delete Galore (#2397)
Remove more Galore bits (#2417)
Remove sparsity/prototype/blocksparse (#2205)

Deprecations

Clean up prototype folder (#2232)
Make float8 training's force_recompute_fp8_weight_in_bwd flag do nothing (#2356)

New Features

Enabling MOE Quantization using linear decomposition (#2043)
[PT2E][X86] Migrate fusion passes in Inductor to torchao (#2140)
2:4 activation sparsity packing kernels (#2012)
Add subclass based method for inference w/ MXFP8 (#2132)
Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors (#1763)
Arm_inductor_quantizer for Pt2e quantization (#2139)
Add mx_fp4 path (#2201)
Add support for KleidiAI int4 kernels on aarch64 Linux (#2169)
Add support for fbgemm int4 mm kernel (#2255)
Enable fp16+int4 mixed precission path for int4 xpu path with int zero point (#2240)
Enable range learning for QAT (#2033)
Patch the _is_conv_node function (#2257)
Add support for fbgemm fp8 kernels (#2276)
Add Float8ActInt4WeightQATQuantizer (#2289)
[float8] add _auto_filter_for_recipe to float8 (#2410)
NVfp4 (#2408)
[float8] Prevent quantize_affine_float8/dequantize_affine_float8 decomposed on inductor (#2379)
[CPU] Enable DA8W4 on CPU (#2128)
Add exportable coreml codebook quantization op (#2443)
Add support for Int4GroupwisePreshuffleTensor for fbgemm (#2421)

Improvement

Add serialization support for AOPerModuleConfig (#2186)
Set eps in end-to-end QAT flow (#2180)
Enable {conv3d, conv_transpose3d} + bn fusion in pt2e (#2212)
Update GemLite to support vLLM V1 (#2199)
[sparse] Add fp8 sparse gemm with rowwise scaling for activation sparsity (#2242)
Patch the _is_conv_node function (#2223)
Relax int4wo device mismatch error (#2254)
Rename AOPerModuleConfig to ModuleFqnToConfig (#2243)
[reland2][ROCm] preshuffled weight mm (#2207)
GPTQ updates (#2235)
Fix QAT range learning, ensure scales get gradients (#2280)
Fix slicing and get_plain() in GemLite (#2288)
Add slicing support for fbgemm fp8 and int4 (#2308)
Add support for bmm and to for fbgemm Tensor (#2337)
Add dynamic quantization support to gemlite layout (#2327)
Test PARQ with torchao activation quantization (#2370)
Update index.rst (#2395)
Add inplace quantizer examples (#2345)
Build mxfp4 kernel for sm120a (#2285)
Enable to_mxfp8 cast for DTensor (#2420)
Enable tensor parallelism for MXLinear (#2434)
Graduate debug handle in torchao (#2452)
Switch alignemtn to 8 for cutlass 4 upgrade (#2491)
Mxfp8 training: add TP sharding strategy for dim1 kernel (#2436)

Bug Fixes

[optim] Fix low-bit optim when used with FSDP2+CPUOffload (#2195)
Fix Per Row scaling for inference (#2253)
Fix benchmark_low_bit_adam.py reference (#2287)
[optim] Fix bug when default dtype is BF16 (#2286)
[sparse] marlin fixes (#2305)
Fix ROCM test failures (#2362)
[float8] Add fnuz fp8 dtypes to Float8Layout (#2351)
Fixing ruff format for trunk (#2369)
Fixing trunk - autoquant test failure (#2363)
Remove torchao dependency from torchao build script (#2383)
Fix torchao quantized model in fbcode (#2396)
Gemlite generate.py fix (#2372)
Fixes issue #156414: Fixes bug in implementation of _combine_histogram (Follow up) (#2418)
TorchAO new observers (#2508)
Fix tutorials (#2516)

Performance

Add a triton kernel for swizziling (#2168)

Documentation

Add blockwise fp8 gemm benchmarks to README (#2203)
[float] document e2e training -> inference flow (#2190)
Update Readme (#1526)
Mark QAT range learning as prototype for now (#2272)
Update float8 training readme to include time measurement (#2291)
[BE/docs] Add float8 training api ref to docsite (#2313)
Enable doc build to run on PRs (#2315)
[BE] [docs] Add float8 pretraining tutorial to docsite (#2304)
[BE/docs] Add fp8 rowwise perf table to float8 training readme (#2312)
Update Quantization docs to show newer AOConfigs (#2317)
Update QAT docs, highlight axolotl integration (#2266)
Add static quant tutorial (#2047)
Update README.md to include seamless v2 (#2355)
Add Tutorial on E2E integration into VLLM and minimal Subclass (#2346)
[docs] Replace deprecated configs with Config objects (#2375)
Revamp README (#2374)
Add pt2e tutorials to torchao doc page (#2384)
Add part 2 of end-to-end tutorial: fine-tuning (#2394)
Call out axolotl + QAT integration on README (#2442)
Float8 readme: remove duplication (#2447)
Float8 readme: add key features section (#2448)
Update README.md to include Flux-Fast (#2457)
Inference tutorial - Part 3 of e2e series (#2343)
Update QAT README and API docstrings (#2465)
Fix typo : whic -> which (#2495)
Fix links for torchao tutorials (#2503)
Fix docstrings for quantization API docs (#2471)
Tutorial for benchmarking (#2499)

Developers

New Contributors

@malfet made their first contribution in #2181
@the-tuning-machine made their first contribution in #1763
@choudhary-devang made their first contribution in #2139
@vctrmn made their first contribution in #2169
@yuguo68 made their first contribution in #2225
@liangan1 made their first contribution in #2240
@emmanuel-ferdman made their first contribution in #2250
@odiemm-meta made their first contribution in #2328
@lilianaairhart made their first contribution in #2360
@Gasoonjia made their first contribution in #2390
@zixi-qi made their first contribution in #2396
@shiyang-weng made their first contribution in #2379
@Akabbaj made their first contribution in #2418
@mori360 made their first contribution in #2449
@henrylhtsang made their first contribution in #2491
@namgyu-youn made their first contribution in #2495
@rohansjoshi made their first contribution in #2508

Full Changelog: v0.11.0...v0.12.0-rc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.12.0