[PyTorch] Support `GroupedTensor` torch ops for DDP and distributed optimizer by ksivaman · Pull Request #2736 · NVIDIA/TransformerEngine

ksivaman · 2026-03-05T04:33:56Z

Description

As a follow-up to #2731, adds support for specific operations required for e2e execution using GroupedTensor. Also make some minor optimizations and cleanup.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Implement expand, expand_as, and view for GroupedTensor.
Calculate tensor strides in C++ in order to avoid high CPU overhead.
Add requires_grad option during initialization.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2026-03-05T04:35:07Z

/te-ci pytorch

greptile-apps · 2026-03-05T04:39:30Z

Greptile Summary

This PR extends GroupedTensor to support the subset of torch ops required by DDP and the distributed optimizer: identity expand/expand_as, flat view(-1), and proper detach/alias semantics. It also pre-computes contiguous strides in C++ (stride_from_shape) to reduce Python-side CPU overhead during tensor construction, and refactors GroupedTensorStorage.__init__ into a static _initialize_storage_fields helper so both the C++ construction path and the Python copy path share a single field-population routine.

The implementation is correct and safe for the documented DDP and distributed-optimizer use cases. The expand.default and expand_as.default handlers correctly implement identity operations via dedicated dispatch logic, stride passed from C++ is properly validated and consumed, and _GroupedIdentityFunc is well-scoped for hook plumbing. The C++ stride_from_shape helper is straightforward and correctly propagated to all quantizer creation sites.

Confidence Score: 4/5

Safe to merge. Core functionality (identity expand, view(-1), detach, alias) is correctly implemented for DDP and distributed-optimizer workflows.
The PR implements well-defined operations with clear scope: identity expand/expand_as handlers are explicit and correct, view(-1) intentionally returns flat backing storage for optimizer flattening, and detach/alias semantics preserve metadata. The C++ stride optimization is straightforward. All field initialization and storage propagation is consistent. The implementation is focused, well-commented, and addresses the documented use case without introducing regressions.
No files require special attention. All changes are safe and focused on the documented use case.

Sequence Diagram

sequenceDiagram
    participant DDP as DDP / DistOptimizer
    participant GT as GroupedTensor
    participant Dispatch as __torch_dispatch__
    participant GIF as _GroupedIdentityFunc

    DDP->>GT: param.expand_as(param)
    GT->>GT: expand_as() override\n(other is self)
    GT->>GIF: _GroupedIdentityFunc.apply(self)
    GIF->>GT: forward: tensor.detach()
    GT->>Dispatch: detach.default
    Dispatch->>GT: make_wrapper_like(src, requires_grad=False)
    GT-->>GIF: detached GroupedTensor
    GIF-->>DDP: GroupedTensor with grad_fn (identity)

    DDP->>GT: param.detach().view(-1)
    GT->>Dispatch: detach.default
    Dispatch->>GT: make_wrapper_like(src, requires_grad=False)
    GT-->>DDP: detached GroupedTensor
    DDP->>GT: view(-1)
    GT->>Dispatch: view.default / _unsafe_view.default
    Dispatch->>GT: rowwise_data.view(-1)
    GT-->>DDP: flat 1-D tensor (raw backing storage)

_{Last reviewed commit: d6b758e}

zhongbozhu

LGTM

zhongbozhu · 2026-03-05T04:49:13Z

+    return tuple(stride)
+
+
+class _GroupedIdentityFunc(torch.autograd.Function):


Why do we need this?

mcore distopt does weight.expand_as(param) to force a graph edge (autograd). This is needed to safely implement that for GroupedTensor.

zhongbozhu · 2026-03-05T04:50:21Z

  py::tuple args(0);
-  kwargs["shape"] = py::cast(std::vector<int64_t>{static_cast<int64_t>(logical_first_dim),
-                                                  static_cast<int64_t>(logical_last_dim)});
+  const std::vector<int64_t> grouped_shape = {static_cast<int64_t>(logical_first_dim),


Why do we need to specify that it's a grouped shape?

Since we subclass a torch.Tensor, what is the shape field of a grouped tensor?

The shape of the GroupedTensor is the logical shape.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

…ge.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

zhongbozhu

Pending CI to merge

vthumbe1503 · 2026-03-05T05:47:14Z



-# For now, conservatively ban all shape manipulating ops.
+def _stride_from_shape(shape: Tuple[int, ...]) -> Tuple[int, ...]:


I think this function is already defined in some utils, that you can reuse.

vthumbe1503 · 2026-03-05T06:10:21Z

+            assert isinstance(src, GroupedTensor)
+            expanded_shape = tuple(args[1])


I am curious why all this torch dispatch logic is not needed in MXFP8 tensor. As in how does DDP work even with discrete MXFP8 weights, if MCore uses all this ops?

Same question for expand_as, view, unsafe_view

ksivaman added 2 commits March 5, 2026 04:00

Fix e2e execution of GroupedTensor in distributed settings

5222d75

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Minor fixes

02c732d

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested review from vthumbe1503 and zhongbozhu March 5, 2026 04:33

ksivaman added the MoE label Mar 5, 2026

greptile-apps Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/tensor/grouped_tensor.py

Comment thread transformer_engine/pytorch/tensor/grouped_tensor.py

Comment thread transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py

zhongbozhu previously approved these changes Mar 5, 2026

View reviewed changes

fix

63dd8e5

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman dismissed zhongbozhu’s stale review via 63dd8e5 March 5, 2026 05:02

Update transformer_engine/pytorch/tensor/storage/grouped_tensor_stora…

7f2c127

…ge.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py Outdated

fix greptile commit

d6b758e

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

zhongbozhu approved these changes Mar 5, 2026

View reviewed changes

vthumbe1503 reviewed Mar 5, 2026

View reviewed changes

vthumbe1503 approved these changes Mar 5, 2026

View reviewed changes

ksivaman merged commit d9152b0 into NVIDIA:main Mar 5, 2026
9 of 12 checks passed

ksivaman deleted the fix_e2e_grouped_tensor branch March 5, 2026 07:57

ksivaman mentioned this pull request Mar 9, 2026

[PyTorch] Enable single grouped tensor weight in GroupedLinear #2463

Closed

		return tuple(stride)


		class _GroupedIdentityFunc(torch.autograd.Function):



		# For now, conservatively ban all shape manipulating ops.
		def _stride_from_shape(shape: Tuple[int, ...]) -> Tuple[int, ...]:

		assert isinstance(src, GroupedTensor)
		expanded_shape = tuple(args[1])

Conversation

ksivaman commented Mar 5, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

ksivaman commented Mar 5, 2026

Uh oh!

greptile-apps Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhongbozhu left a comment

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ksivaman Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ksivaman Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhongbozhu left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Mar 5, 2026 •

edited

Loading

ksivaman Mar 5, 2026 •

edited

Loading

vthumbe1503 Mar 5, 2026 •

edited

Loading