Skip to content

[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044

Open
pggPL wants to merge 10 commits into
NVIDIA:mainfrom
pggPL:debug_log_fp_fixes
Open

[PyTorch Debug] Add scale_inv_std stat and skip NVFP4 layers in LogFp8TensorStats#3044
pggPL wants to merge 10 commits into
NVIDIA:mainfrom
pggPL:debug_log_fp_fixes

Conversation

@pggPL
Copy link
Copy Markdown
Collaborator

@pggPL pggPL commented May 26, 2026

Description

Addresses items 1 and 2 from #2801:

1. Adds `scale_inv_std` for every recipe (`fp8_delayed_scaling`, `fp8_current_scaling`,
   `mxfp8`, `fp8_block_scaling`, `nvfp4`) plus `_columnwise` variants. Min/max alone
   can hide serious clipping when scale spread is wide. Reduction across
   microbatches/ranks uses Welford (population variance, so `std=0` instead of NaN
   for the single-scalar delayed/current-scaling case). NVFP4 also gains
   `scale_inv_min` / `scale_inv_max`.

2. `LogFp8TensorStats` no longer crashes on layers using `NVFP4Quantizer` — bare
   stats are filtered with a warning, FP8-recipe-prefixed stats (e.g. `mxfp8_mse`
   for what-if) are preserved. Combined with #2296 and #2652, the BioNeMo ESM2
   YAML workaround referenced in the issue can now be removed.

Additionally, this PR fixes a pre-existing sign error in `compute_variance`
(parallel-axis Welford combination): the formula subtracted `(mean_i - mean)^2`
instead of adding, yielding negative variance whenever >=2 buffer groups had
different means (multi-microbatch or multi-rank reductions). Single-group
buffers were unaffected (`mean_i - mean = 0`), which is why existing `variance`
/ `std` stats never tripped this in tests. With `scale_inv_std` now reduced
across microbatches/ranks the bug becomes user-visible (`sqrt(negative)` -> NaN).

Item 3 from #2801 (per-block dump for MXFP8/NVFP4) is out of scope here.

Fixes #2801 (partial)

## Type of change

- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Infra/Build change
- [ ] Code refactoring

## Changes

- `stats_computation.py`: `add_scale_inv_stats` registers `scale_inv_std` plus
  Welford helpers; NVFP4 added to the registration loop.
- `stats_computation.py`: fix sign in `compute_variance` — Welford parallel
  combination needs `+ (mean_i - mean)^2`, not `-`.
- `log_fp8_tensor_stats.py`: `inspect_tensor` filters NVFP4-resolved bare stats
  with a warning instead of raising.
- `log_nvfp4_tensor_stats.py`: docstring + `supported_stats` updated for
  `scale_inv_min` / `scale_inv_max` / `scale_inv_std`.
- `test_log.py`: `scale_inv_std` added to `bare_stats`; numerics test validates
  min/max/std against `torch.std(scale_inv, unbiased=False)`.

# Checklist:

- [x] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [x] The functionality is complete
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes

pggPL and others added 2 commits May 26, 2026 15:08
…8TensorStats (NVIDIA#2801)

- Register scale_inv_std (plus helper variance/numel/sum buffers using
  Welford reduction) for all FP8 recipes and NVFP4 in
  add_scale_inv_stats. Population variance keeps std=0 for delayed/current
  scaling where scale_inv is a single scalar.
- Also wire scale_inv_min/max/std for NVFP4 (was previously only FP8
  recipes).
- LogFp8TensorStats.inspect_tensor now filters bare stats on NVFP4 layers
  with a warning instead of raising, so dual LogFp8TensorStats +
  LogNvfp4TensorStats configs work with overlapping (or catch-all) layer
  regexes. Recipe-prefixed FP8 stats (e.g. mxfp8_mse) are preserved for
  what-if comparisons.
- Numerics test extended to validate scale_inv_min/max/std against
  torch.std(scale_inv, unbiased=False).

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Copy Markdown
Collaborator Author

pggPL commented May 26, 2026

/te-ci pytorch

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR addresses two items from issue #2801: adding scale_inv_std (population std of per-block scale inverses) across all recipes including NVFP4, and making LogFp8TensorStats gracefully skip NVFP4 layers instead of crashing. It also fixes a pre-existing sign error in the parallel Welford variance combination (compute_variance) that would have produced negative variance in any multi-microbatch or multi-rank scenario where per-group means differed.

  • compute_variance sign fix (stats_computation.py): the combine formula now correctly adds n_i · (mean_i − mean)² to each group's M2; this also changes the combined output of the pre-existing variance/std stats from wrong-population to correct-sample variance in multi-group reductions.
  • scale_inv_std registration (stats_computation.py): variance, numel, and sum helper buffers are registered alongside the user-visible std stat; all helpers correctly use real_scale_inv (zero-filtered) so zero-padded MXFP8/NVFP4 scale_inv slots do not skew the Welford combination.
  • NVFP4 handling (log_fp8_tensor_stats.py, log_nvfp4_tensor_stats.py): bare NVFP4 stats are now dropped with a warning rather than raising; FP8-recipe-prefixed what-if stats (e.g. mxfp8_mse) are preserved.

Confidence Score: 4/5

Safe to merge after the open discussions from earlier review rounds are resolved; the new Welford implementation and zero-filtering are correct.

The Welford combination fix and the real_scale_inv zero-filtering are both mathematically correct and well-tested by the new microbatch-reduction tests. The scale_inv_max compute still bypasses zero-filtering (harmless in practice since real values are positive, but inconsistent with every other scale_inv stat). Both test helpers that build the reference expected value do so from unfiltered scale_inv tensors, which matches the implementation only because the chosen shapes need no padding, making the tests brittle to future shape changes.

stats_computation.py deserves a second look for the scale_inv_max filtering inconsistency; test_log.py reference-value construction in the two new microbatch-reduction tests should be verified against zero-padded shapes.

Important Files Changed

Filename Overview
transformer_engine/debug/features/utils/stats_computation.py Core change: fixes Welford sign error in compute_variance, adds unbiased parameter, and extends add_scale_inv_stats with variance/numel/sum helpers for distributed std reduction.
transformer_engine/debug/features/log_fp8_tensor_stats.py NVFP4 crash fix: inspect_tensor now filters bare stats with a warning instead of raising; adds scale_inv_std to the supported stats list.
transformer_engine/debug/features/log_nvfp4_tensor_stats.py Docstring and check_if_stat_is_supported updated to expose scale_inv_min/max/std with optional _columnwise suffix.
tests/pytorch/debug/test_log.py Adds scale_inv_std to bare_stats and two new microbatch-reduction tests; test tensors are sized to avoid zero-padded scale_inv elements.

Reviews (6): Last reviewed commit: "[PyTorch Debug] Drop redundant variance/..." | Re-trigger Greptile

Comment thread transformer_engine/debug/features/log_fp8_tensor_stats.py
Parallel-group variance is Sigma n_i*(var_i + (mean_i - mean)^2) / N -
the between-group term must be added, not subtracted. Single-group
buffers hide the bug (mean_i = mean_global so the term is 0); it
surfaces with scale_inv_std reduced across microbatches/ranks, where
negative variance flows into sqrt() and yields NaN.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

def check_if_stat_is_supported(self, stat: str):
"""Returns True if stat is supported, raises ValueError otherwise."""
bare = stat[: -len("_columnwise")] if stat.endswith("_columnwise") else stat
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mse_columnwise and underflow%_columnwise should not pass this function but they would with the logic as written.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them to exception, even though we should have stats for them I guess, but this is not scope of the PR.

pggPL and others added 3 commits June 3, 2026 12:03
…validation

compute_variance/compute_std now combine on M2 with an explicit `unbiased` flag.
Previously sample variances (torch.var default, used by variance/std) were fed into
a population-style combine, so reductions across microbatches/ranks were silently
wrong whenever group means differed. scale_inv_* combinators now pass unbiased=False
to match their population feeders. Adds test_stats_computation_microbatch_reduction
which sweeps the stats registry and checks every aux-free combinator against its own
compute fn over the concatenation.

NVFP4 check_if_stat_is_supported no longer accepts mse_columnwise /
underflows%_columnwise: only scale_inv_* have a columnwise variant, the others are
computed from the single quantized tensor and would crash downstream.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

# Conflicts:
#	transformer_engine/debug/features/utils/stats_computation.py
…tyle

Collapse import/list/call reflows that were split by an accidental black run at
the default line length (88) instead of the repo's 100; no functional change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Comment thread transformer_engine/debug/features/utils/stats_computation.py
pggPL and others added 4 commits June 3, 2026 13:56
…d reduction test

Route every scale_inv stat (min/max/std/variance/numel/sum) through a shared
real_scale_inv() helper that strips the zero padding MXFP8/NVFP4 quantizers add,
so padding no longer deflates the mean or corrupts the numel-weighted variance
reduction across microbatches/ranks. Add a parametrized test driving the
aux-dependent parallel-variance combine for scale_inv_std and checking it
against std over the concatenated scale_inv values.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…unbiased convention

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…ch reduction test

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Copy Markdown
Collaborator Author

pggPL commented Jun 3, 2026

/te-ci pytorch L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants