[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats by pggPL · Pull Request #3044 · NVIDIA/TransformerEngine

pggPL · 2026-05-26T13:12:27Z

Description

Addresses items 1 and 2 from #2801:

1. Adds `scale_inv_std` for every recipe (`fp8_delayed_scaling`, `fp8_current_scaling`,
   `mxfp8`, `fp8_block_scaling`, `nvfp4`) plus `_columnwise` variants. Min/max alone
   can hide serious clipping when scale spread is wide. Reduction across
   microbatches/ranks uses Welford (population variance, so `std=0` instead of NaN
   for the single-scalar delayed/current-scaling case). NVFP4 also gains
   `scale_inv_min` / `scale_inv_max`.

2. `LogFp8TensorStats` no longer crashes on layers using `NVFP4Quantizer` — bare
   stats are filtered with a warning, FP8-recipe-prefixed stats (e.g. `mxfp8_mse`
   for what-if) are preserved. Combined with #2296 and #2652, the BioNeMo ESM2
   YAML workaround referenced in the issue can now be removed.

Additionally, this PR fixes a pre-existing sign error in `compute_variance`
(parallel-axis Welford combination): the formula subtracted `(mean_i - mean)^2`
instead of adding, yielding negative variance whenever >=2 buffer groups had
different means (multi-microbatch or multi-rank reductions). Single-group
buffers were unaffected (`mean_i - mean = 0`), which is why existing `variance`
/ `std` stats never tripped this in tests. With `scale_inv_std` now reduced
across microbatches/ranks the bug becomes user-visible (`sqrt(negative)` -> NaN).

Item 3 from #2801 (per-block dump for MXFP8/NVFP4) is out of scope here.

Fixes #2801 (partial)

## Type of change

- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

- `stats_computation.py`: `add_scale_inv_stats` registers `scale_inv_std` plus
  Welford helpers; NVFP4 added to the registration loop.
- `stats_computation.py`: fix sign in `compute_variance` — Welford parallel
  combination needs `+ (mean_i - mean)^2`, not `-`.
- `log_fp8_tensor_stats.py`: `inspect_tensor` filters NVFP4-resolved bare stats
  with a warning instead of raising.
- `log_nvfp4_tensor_stats.py`: docstring + `supported_stats` updated for
  `scale_inv_min` / `scale_inv_max` / `scale_inv_std`.
- `test_log.py`: `scale_inv_std` added to `bare_stats`; numerics test validates
  min/max/std against `torch.std(scale_inv, unbiased=False)`.

# Checklist:

- [x] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes

…8TensorStats (NVIDIA#2801) - Register scale_inv_std (plus helper variance/numel/sum buffers using Welford reduction) for all FP8 recipes and NVFP4 in add_scale_inv_stats. Population variance keeps std=0 for delayed/current scaling where scale_inv is a single scalar. - Also wire scale_inv_min/max/std for NVFP4 (was previously only FP8 recipes). - LogFp8TensorStats.inspect_tensor now filters bare stats on NVFP4 layers with a warning instead of raising, so dual LogFp8TensorStats + LogNvfp4TensorStats configs work with overlapping (or catch-all) layer regexes. Recipe-prefixed FP8 stats (e.g. mxfp8_mse) are preserved for what-if comparisons. - Numerics test extended to validate scale_inv_min/max/std against torch.std(scale_inv, unbiased=False). Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

pggPL · 2026-05-26T13:15:03Z

/te-ci pytorch

greptile-apps · 2026-05-26T13:17:43Z

Greptile Summary

This PR addresses two items from issue #2801: adding scale_inv_std (population std of per-block scale inverses) across all recipes including NVFP4, and making LogFp8TensorStats gracefully skip NVFP4 layers instead of crashing. It also fixes a pre-existing sign error in the parallel Welford variance combination (compute_variance) that would have produced negative variance in any multi-microbatch or multi-rank scenario where per-group means differed.

compute_variance sign fix (stats_computation.py): the combine formula now correctly adds n_i · (mean_i − mean)² to each group's M2; this also changes the combined output of the pre-existing variance/std stats from wrong-population to correct-sample variance in multi-group reductions.
scale_inv_std registration (stats_computation.py): variance, numel, and sum helper buffers are registered alongside the user-visible std stat; all helpers correctly use real_scale_inv (zero-filtered) so zero-padded MXFP8/NVFP4 scale_inv slots do not skew the Welford combination.
NVFP4 handling (log_fp8_tensor_stats.py, log_nvfp4_tensor_stats.py): bare NVFP4 stats are now dropped with a warning rather than raising; FP8-recipe-prefixed what-if stats (e.g. mxfp8_mse) are preserved.

Confidence Score: 4/5

Safe to merge after the open discussions from earlier review rounds are resolved; the new Welford implementation and zero-filtering are correct.

The Welford combination fix and the real_scale_inv zero-filtering are both mathematically correct and well-tested by the new microbatch-reduction tests. The scale_inv_max compute still bypasses zero-filtering (harmless in practice since real values are positive, but inconsistent with every other scale_inv stat). Both test helpers that build the reference expected value do so from unfiltered scale_inv tensors, which matches the implementation only because the chosen shapes need no padding, making the tests brittle to future shape changes.

stats_computation.py deserves a second look for the scale_inv_max filtering inconsistency; test_log.py reference-value construction in the two new microbatch-reduction tests should be verified against zero-padded shapes.

Important Files Changed

Filename	Overview
transformer_engine/debug/features/utils/stats_computation.py	Core change: fixes Welford sign error in compute_variance, adds unbiased parameter, and extends add_scale_inv_stats with variance/numel/sum helpers for distributed std reduction.
transformer_engine/debug/features/log_fp8_tensor_stats.py	NVFP4 crash fix: inspect_tensor now filters bare stats with a warning instead of raising; adds scale_inv_std to the supported stats list.
transformer_engine/debug/features/log_nvfp4_tensor_stats.py	Docstring and check_if_stat_is_supported updated to expose scale_inv_min/max/std with optional _columnwise suffix.
tests/pytorch/debug/test_log.py	Adds scale_inv_std to bare_stats and two new microbatch-reduction tests; test tensors are sized to avoid zero-padded scale_inv elements.

_{Reviews (6): Last reviewed commit: "[PyTorch Debug] Drop redundant variance/..." | Re-trigger Greptile}

Parallel-group variance is Sigma n_i*(var_i + (mean_i - mean)^2) / N - the between-group term must be added, not subtracted. Single-group buffers hide the bug (mean_i = mean_global so the term is 0); it surfaces with scale_inv_std reduced across microbatches/ranks, where negative variance flows into sqrt() and yields NaN. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

ptrendx · 2026-06-01T22:47:01Z


    def check_if_stat_is_supported(self, stat: str):
        """Returns True if stat is supported, raises ValueError otherwise."""
+        bare = stat[: -len("_columnwise")] if stat.endswith("_columnwise") else stat


mse_columnwise and underflow%_columnwise should not pass this function but they would with the logic as written.

I added them to exception, even though we should have stats for them I guess, but this is not scope of the PR.

…validation compute_variance/compute_std now combine on M2 with an explicit `unbiased` flag. Previously sample variances (torch.var default, used by variance/std) were fed into a population-style combine, so reductions across microbatches/ranks were silently wrong whenever group means differed. scale_inv_* combinators now pass unbiased=False to match their population feeders. Adds test_stats_computation_microbatch_reduction which sweeps the stats registry and checks every aux-free combinator against its own compute fn over the concatenation. NVFP4 check_if_stat_is_supported no longer accepts mse_columnwise / underflows%_columnwise: only scale_inv_* have a columnwise variant, the others are computed from the single quantized tensor and would crash downstream. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> # Conflicts: # transformer_engine/debug/features/utils/stats_computation.py

…tyle Collapse import/list/call reflows that were split by an accidental black run at the default line length (88) instead of the repo's 100; no functional change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…d reduction test Route every scale_inv stat (min/max/std/variance/numel/sum) through a shared real_scale_inv() helper that strips the zero padding MXFP8/NVFP4 quantizers add, so padding no longer deflates the mean or corrupts the numel-weighted variance reduction across microbatches/ranks. Add a parametrized test driving the aux-dependent parallel-variance combine for scale_inv_std and checking it against std over the concatenated scale_inv values. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

…unbiased convention Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…ch reduction test Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

pggPL · 2026-06-03T12:16:30Z

/te-ci pytorch L1

pggPL and others added 2 commits May 26, 2026 15:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

441a793

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

Comment thread transformer_engine/debug/features/log_fp8_tensor_stats.py

ptrendx reviewed Jun 1, 2026

View reviewed changes

pggPL and others added 3 commits June 3, 2026 12:03

Merge remote-tracking branch 'upstream/main' into debug_log_fp_fixes

0478e3f

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> # Conflicts: # transformer_engine/debug/features/utils/stats_computation.py

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread transformer_engine/debug/features/utils/stats_computation.py

pggPL and others added 4 commits June 3, 2026 13:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

97d6d80

for more information, see https://pre-commit.ci

[PyTorch Debug] Document that compute_variance inputs must match the …

4cdf022

…unbiased convention Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[PyTorch Debug] Drop redundant variance/std sanity assert in microbat…

f211f76

…ch reduction test Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044

[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044
pggPL wants to merge 10 commits into
NVIDIA:mainfrom
pggPL:debug_log_fp_fixes

pggPL commented May 26, 2026 •

edited

Loading

Uh oh!

pggPL commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

ptrendx Jun 1, 2026

Uh oh!

pggPL Jun 3, 2026

Uh oh!

Uh oh!

pggPL commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pggPL commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

pggPL commented May 26, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

ptrendx Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

pggPL Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pggPL commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pggPL commented May 26, 2026 •

edited

Loading

greptile-apps Bot commented May 26, 2026 •

edited

Loading