Skip ComfyUI safetensors post-processing unless opted in (fix sharded FLUX export) (#1794)

jingyu-ml · claude · kevalmorabia97 · commit faaf9f470db6 · 2026-06-22T21:37:29.000-07:00
### What does this PR do? **Type of change:** Bug fix Fixes diffusers HuggingFace export (`export_hf_checkpoint`) failing with `NotImplementedError: Post-processing sharded safetensors is not supported` for large quantized diffusion models (e.g. the FP8 FLUX.1-dev transformer, ~12 GB, which exceeds the default 10 GB `max_shard_size` and is split into multiple `.safetensors` shards). **Root cause:** `_postprocess_safetensors` runs for every quantized diffusers component. It exists only to build single-file deployment checkpoints (e.g. ComfyUI) — merging with a base checkpoint, NVFP4 padding/swizzling, and embedding quant metadata in the safetensors header — none of which ModelOpt reads back (the diffusers reload uses `config.json`). But it embedded the header metadata **by default** (`enable_layerwise_quant_metadata=True`), and that default-on path hard-raised for any sharded checkpoint, so a plain `--format fp8` FLUX export failed at export time. (Introduced in #1195 / #911; not a transformers/diffusers API change.) **Fix — make the post-processing opt-in:** `_postprocess_safetensors` now returns immediately (no-op) unless the caller opts into one of `merged_base_safetensor_path` / `padding_strategy` / `enable_swizzle_layout` / `enable_layerwise_quant_metadata` (the last default flipped `True → False`). A plain quantized diffusers export — including a sharded one — is left untouched and no longer fails; `config.json` still carries `quantization_config` for the diffusers-native reload. Behavior for callers that opt in is unchanged (sharded + merge/metadata remains unsupported, which is out of scope for this fix). ### Usage ```bash # Default export — now succeeds (no ComfyUI post-processing): python examples/diffusers/quantization/quantize.py --model flux-dev --format fp8 \ --quantized-torch-ckpt-save-path flux-dev-fp8.pt --hf-ckpt-dir flux-dev-fp8 \ --collect-method default --calib-size 128 --quantize-mha \ --model-dtype BFloat16 --trt-high-precision-dtype BFloat16 # Opt into the ComfyUI single-file header metadata: # ... --extra-param enable_layerwise_quant_metadata=true ``` ### Testing - CPU unit tests (`tests/unit/torch/export/test_export_diffusers.py`): - `test_postprocess_default_is_noop` — default (incl. a sharded checkpoint) injects nothing and does not raise. - `test_postprocess_single_file_metadata_when_opted_in` — opt-in single-file injection still works. - `test_postprocess_sharded_opt_in_raises` — an explicit opt-in on a sharded checkpoint still raises (documents the existing, out-of-scope limitation). - GPU test (`tests/gpu/torch/export/test_export_diffusers.py::test_export_diffusers_sharded_default_no_header_metadata`) — real FP8 tiny-FLUX, forced sharding, default → succeeds with a clean header. - Verified on a GB200 in the dev container: 10/10 unit + 11/11 GPU diffusers export tests pass; the regression test fails on the original code (`NotImplementedError`) and passes with the fix. `ruff check` / `ruff format` clean. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ for the diffusers-native reload (quant config still in `config.json`). ⚠️ Behavior change: the ComfyUI **safetensors-header** metadata is no longer written for a plain export — it is now opt-in (`enable_layerwise_quant_metadata=true`, or automatically when using merge/swizzle/padding). - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — scoped as a minor diffusers export-path fix; no changelog entry - Did you get Claude approval on this PR?: ❌ — will run `/claude review`. ### Additional Information Reported against ModelOpt 0.45.0rc0 / TRT-LLM 1.3.0rc17 with FLUX.1-dev FP8 export on RTX 6000 Ada. The safetensors post-processing was added in #1195 (ComfyUI single-file origin in #911).  ## Summary by CodeRabbit * **Changes** * Quantization metadata injection in safetensors exports is now opt-in rather than automatic, with the default behavior changed to exclude metadata. * Sharded FP8 model exports no longer include quantization header metadata by default. * **Tests** * Added regression and unit tests validating quantization metadata opt-in behavior for safetensors exports.  Signed-off-by: Jingyu Xin <jingyux@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/modelopt/torch/export/unified_export_hf.py b/modelopt/torch/export/unified_export_hf.py
@@ -168,6 +168,12 @@ def _postprocess_safetensors(
        ``_quantization_metadata`` so inference runtimes can detect and handle
        quantized layers.
 
+    All of these target single-file deployment runtimes (e.g. ComfyUI) and are
+    opt-in; ModelOpt itself reads the quant config from ``config.json`` on reload. If
+    the caller passes none of ``merged_base_safetensor_path``, ``padding_strategy``,
+    ``enable_swizzle_layout``, or ``enable_layerwise_quant_metadata``, this function
+    does nothing and leaves the standard exported checkpoint untouched.
+
     Args:
         export_dir: Directory containing the saved ``.safetensors`` file(s).
         pipe: The diffusion pipeline / model.  Used to infer the model type
@@ -181,11 +187,11 @@ def _postprocess_safetensors(
                 file to produce a single-file checkpoint compatible with ComfyUI.
                 Value should be the path to a full base model ``.safetensors``
                 file (e.g. ``"path/to/ltx-2-19b-dev.safetensors"``).
-            enable_layerwise_quant_metadata (bool, optional): When True
-                (default), includes per-layer ``_quantization_metadata`` in the
-                checkpoint metadata so that inference runtimes (e.g., ComfyUI)
-                can identify which layers are quantized and in what format. Set
-                to False to skip.
+            enable_layerwise_quant_metadata (bool, optional): When True, embeds
+                ``quantization_config`` and per-layer ``_quantization_metadata`` in the
+                safetensors header so single-file runtimes (e.g., ComfyUI) can identify
+                which layers are quantized and in what format. Defaults to False (no
+                header metadata; this alone leaves the export untouched).
             enable_swizzle_layout (bool, optional): When True, rearranges NVFP4
                 block scales from ModelOpt's flat layout to cuBLAS 2-D tiled
                 layout. Required for runtimes that consume cuBLAS block-scaled
@@ -198,10 +204,23 @@ def _postprocess_safetensors(
 
     """
     merged_base_safetensor_path: str | None = kwargs.get("merged_base_safetensor_path")
-    enable_layerwise_quant_metadata: bool = kwargs.get("enable_layerwise_quant_metadata", True)
+    enable_layerwise_quant_metadata: bool = kwargs.get("enable_layerwise_quant_metadata", False)
     enable_swizzle_layout: bool = kwargs.get("enable_swizzle_layout", False)
     padding_strategy: str | None = kwargs.get("padding_strategy")
 
+    # This post-processing only produces single-file deployment checkpoints (e.g.
+    # ComfyUI): merging with a base checkpoint, NVFP4 padding/swizzling, and embedding
+    # quant metadata in the safetensors header. None of it is read back by ModelOpt
+    # (the diffusers reload uses ``config.json``), so if the user has not opted into any
+    # of these options there is nothing to do — leave the exported checkpoint untouched.
+    if not (
+        merged_base_safetensor_path is not None
+        or padding_strategy is not None
+        or enable_swizzle_layout
+        or enable_layerwise_quant_metadata
+    ):
+        return
+
     safetensor_files = sorted(export_dir.glob("*.safetensors"))
     if not safetensor_files:
         return
diff --git a/tests/gpu/torch/export/test_export_diffusers.py b/tests/gpu/torch/export/test_export_diffusers.py
@@ -17,6 +17,7 @@
 
 import pytest
 from _test_utils.torch.diffusers_models import get_tiny_dit, get_tiny_flux, get_tiny_unet
+from safetensors import safe_open
 
 import modelopt.torch.quantization as mtq
 from modelopt.torch.export.diffusers_utils import generate_diffusion_dummy_inputs
@@ -28,6 +29,13 @@ def _load_config(config_path):
         return json.load(file)
 
 
+def _calib_with_dummy_inputs(m):
+    param = next(m.parameters())
+    dummy_inputs = generate_diffusion_dummy_inputs(m, param.device, param.dtype)
+    assert dummy_inputs is not None
+    m(**dummy_inputs)
+
+
 @pytest.mark.parametrize("model_factory", [get_tiny_unet, get_tiny_dit, get_tiny_flux])
 @pytest.mark.parametrize(
     ("config_id", "quant_cfg"),
@@ -78,3 +86,33 @@ def _calib_fn(m):
 
     config_data = _load_config(config_path)
     assert "quantization_config" in config_data
+
+
+def test_export_diffusers_sharded_default_no_header_metadata(tmp_path):
+    """A default (non-opt-in) sharded FP8 export succeeds and writes no header metadata.
+
+    Regression test for the FLUX FP8 export crash (NotImplementedError on sharded
+    safetensors). A tiny max_shard_size forces the tiny model to split into multiple
+    shards (+ index.json), reproducing the large-model path. With the header quant
+    metadata off by default, post-processing is a no-op: the export must succeed and
+    leave a clean safetensors header (the ComfyUI metadata is opt-in).
+    """
+    model = get_tiny_flux()
+    export_dir = tmp_path / "export_flux_fp8_sharded_default"
+
+    mtq.quantize(model, mtq.FP8_DEFAULT_CFG, forward_loop=_calib_with_dummy_inputs)
+
+    # Tiny shard size forces sharding even for this tiny model.
+    export_hf_checkpoint(model, export_dir=export_dir, max_shard_size="1KB")
+
+    assert list(export_dir.glob("*.safetensors.index.json")), (
+        "expected a sharded export (index.json) with a tiny max_shard_size"
+    )
+    shard_files = sorted(export_dir.glob("*.safetensors"))
+    assert len(shard_files) >= 2, "expected the model to split across multiple shards"
+
+    for shard in shard_files:
+        with safe_open(str(shard), framework="pt") as f:
+            md = f.metadata() or {}
+        assert "quantization_config" not in md, f"unexpected header metadata in {shard.name}"
+        assert "_quantization_metadata" not in md, f"unexpected header metadata in {shard.name}"
diff --git a/tests/unit/torch/export/test_export_diffusers.py b/tests/unit/torch/export/test_export_diffusers.py
@@ -26,17 +26,46 @@
 
 pytest.importorskip("diffusers")
 
+from safetensors import safe_open
+from safetensors.torch import save_file
+
 import modelopt.torch.export.unified_export_hf as unified_export_hf
 from modelopt.torch.export.convert_hf_config import convert_hf_quant_config_format
 from modelopt.torch.export.diffusers_utils import generate_diffusion_dummy_inputs
-from modelopt.torch.export.unified_export_hf import export_hf_checkpoint
+from modelopt.torch.export.unified_export_hf import _postprocess_safetensors, export_hf_checkpoint
 
 
 def _load_config(config_path):
     with open(config_path) as file:
         return json.load(file)
 
 
+def _write_sharded_checkpoint(export_dir, shards):
+    """Write ``shards`` (list of state-dict chunks) as sharded safetensors + index.json.
+
+    Mimics the layout produced by ``save_pretrained`` when a component is split across
+    multiple files because it exceeds ``max_shard_size``.
+    """
+    export_dir.mkdir(parents=True, exist_ok=True)
+    total = len(shards)
+    weight_map = {}
+    total_size = 0
+    for i, shard in enumerate(shards, start=1):
+        filename = f"diffusion_pytorch_model-{i:05d}-of-{total:05d}.safetensors"
+        save_file(shard, str(export_dir / filename))
+        for key, tensor in shard.items():
+            weight_map[key] = filename
+            total_size += tensor.numel() * tensor.element_size()
+    index = {"metadata": {"total_size": total_size}, "weight_map": weight_map}
+    with open(export_dir / "diffusion_pytorch_model.safetensors.index.json", "w") as file:
+        json.dump(index, file)
+
+
+def _read_safetensors_metadata(path):
+    with safe_open(str(path), framework="pt") as file:
+        return dict(file.metadata() or {})
+
+
 @pytest.mark.parametrize(
     "model_factory", [get_tiny_unet, get_tiny_dit, get_tiny_flux, get_tiny_flux2]
 )
@@ -117,3 +146,82 @@ def test_flux2_dummy_inputs_shape():
 
     # guidance_embeds defaults to True for Flux2
     assert "guidance" in inputs
+
+
+@pytest.mark.parametrize(
+    "opt_in_kwargs",
+    [
+        {"enable_layerwise_quant_metadata": True},
+        {"merged_base_safetensor_path": "/tmp/base.safetensors"},
+    ],
+)
+def test_postprocess_sharded_opt_in_raises(tmp_path, opt_in_kwargs):
+    """Opting into ComfyUI post-processing on a sharded checkpoint is unsupported.
+
+    Documents the existing limitation (out of scope for this fix). The bug fix is the
+    default no-op path (see ``test_postprocess_default_is_noop``); only an explicit
+    opt-in reaches this guard.
+    """
+    export_dir = tmp_path / "sharded_opt_in"
+    _write_sharded_checkpoint(
+        export_dir,
+        [
+            {"layer_a.weight": torch.zeros(4, 4), "layer_a.weight_scale": torch.ones(1)},
+            {"layer_b.weight": torch.zeros(4, 4), "layer_b.weight_scale": torch.ones(1)},
+        ],
+    )
+
+    with pytest.raises(NotImplementedError, match="sharded safetensors"):
+        _postprocess_safetensors(
+            export_dir,
+            hf_quant_config={"quant_algo": "FP8"},
+            **opt_in_kwargs,
+        )
+
+
+def test_postprocess_single_file_metadata_when_opted_in(tmp_path):
+    """With the opt-in flag, a non-sharded export injects quant config + per-layer metadata."""
+    export_dir = tmp_path / "single_file"
+    export_dir.mkdir(parents=True, exist_ok=True)
+    save_file(
+        {"layer_a.weight": torch.zeros(4, 4), "layer_a.weight_scale": torch.ones(1)},
+        str(export_dir / "diffusion_pytorch_model.safetensors"),
+    )
+
+    _postprocess_safetensors(
+        export_dir,
+        hf_quant_config={"quant_algo": "FP8"},
+        enable_layerwise_quant_metadata=True,
+    )
+
+    metadata = _read_safetensors_metadata(export_dir / "diffusion_pytorch_model.safetensors")
+    assert "quantization_config" in metadata
+    assert json.loads(metadata["_quantization_metadata"])["layers"] == {
+        "layer_a": {"format": "fp8"}
+    }
+
+
+def test_postprocess_default_is_noop(tmp_path):
+    """By default (no opt-in) nothing is written to the safetensors header.
+
+    The header quant metadata is a single-file deployment (e.g. ComfyUI) feature, so a
+    plain export must leave the checkpoint untouched. This no-op default is also what
+    keeps a default *sharded* export from reaching the unsupported-sharded path that
+    caused the original FP8 FLUX crash.
+    """
+    export_dir = tmp_path / "default_noop"
+    _write_sharded_checkpoint(
+        export_dir,
+        [
+            {"layer_a.weight": torch.zeros(4, 4), "layer_a.weight_scale": torch.ones(1)},
+            {"layer_b.weight": torch.zeros(4, 4), "layer_b.weight_scale": torch.ones(1)},
+        ],
+    )
+
+    # No opt-in kwargs: must not raise (even though sharded) and must inject nothing.
+    _postprocess_safetensors(export_dir, hf_quant_config={"quant_algo": "FP8"})
+
+    for shard in sorted(export_dir.glob("*.safetensors")):
+        metadata = _read_safetensors_metadata(shard)
+        assert "quantization_config" not in metadata
+        assert "_quantization_metadata" not in metadata