[PyTorch] Debug CPU offloading in grouped linear and grouped MLP#3047
Conversation
dba3531 to
53e6511
Compare
Greptile SummaryThis PR wires CPU activation offloading into the fused grouped-MLP forward path and into the standalone
Confidence Score: 3/5The fused-MLP offload path is largely correct, but the standalone GroupedLinear grouped-tensor path has an incomplete offload that silently fails to schedule columnwise activations for CPU transfer in the non-V1 path. The transformer_engine/pytorch/ops/basic/grouped_linear.py — the _fuser_forward_grouped_tensor method needs input_quantizer.internal = True before tex.group_quantize, mirroring what forward_grouped_mlp.py already does for fc1_input_quantizer. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant ForwardGroupedMLP
participant GroupedLinear
participant cpu_offload
Caller->>ForwardGroupedMLP: fuser_forward(input_)
ForwardGroupedMLP->>ForwardGroupedMLP: "fc1_input_quantizer.internal = True"
ForwardGroupedMLP->>ForwardGroupedMLP: group_quantize → GroupedTensorStorage (fc1)
ForwardGroupedMLP->>ForwardGroupedMLP: FC1 GEMM + activation
ForwardGroupedMLP->>ForwardGroupedMLP: group_quantize → grouped_fc2_x (GroupedTensor, NVFP4)
ForwardGroupedMLP->>ForwardGroupedMLP: FC2 GEMM
ForwardGroupedMLP->>cpu_offload: start_offload(grouped_fc1_x, activation_in, grouped_fc2_x)
cpu_offload->>cpu_offload: prepare_for_saving → decompose GroupedTensorStorage
cpu_offload->>cpu_offload: mark component tensors with start_reload_event
cpu_offload->>cpu_offload: restore_from_saved
ForwardGroupedMLP->>cpu_offload: mark_activation_offload(...) [V1 path only]
cpu_offload->>cpu_offload: "get_data_tensors() → set activation_offloading=True"
ForwardGroupedMLP->>ForwardGroupedMLP: save_for_backward(grouped_fc1_x, grouped_fc2_x, ...)
Note over GroupedLinear: standalone (non-fused) path
Caller->>GroupedLinear: _fuser_forward_grouped_tensor
GroupedLinear->>GroupedLinear: "tex.group_quantize → GroupedTensor (no internal=True!)"
GroupedLinear->>cpu_offload: start_offload(grouped_x)
cpu_offload->>cpu_offload: "isinstance(GroupedTensor, Tensor)=True → plain-tensor branch"
cpu_offload->>cpu_offload: start_reload_event on wrapper only (columnwise_data NOT marked)
Reviews (14): Last reviewed commit: "Construct internal grouped tensors withi..." | Re-trigger Greptile |
2c59510 to
6e01d0a
Compare
There was a problem hiding this comment.
Overall looks good, but with one design suggestion.
Followup tasks after merging this PR:
- Enable activation checkpointing in the unfused grouped linear op.
- Update activation checkpointing to support v2 infrastructure from #1762, which is opt-out rather than opt-in.
cherry-pick NVIDIA#3047
6f5ef0a to
c14deb7
Compare
|
/te-ci pytorch L1 |
c14deb7 to
f25a1f5
Compare
| no_offload_fc1_activation = bool(getattr(fc1_op, "no_offload_activation", False)) | ||
| no_offload_moe_activation = bool(getattr(activation_op, "no_offload_activation", False)) |
There was a problem hiding this comment.
Fine, but we should avoid relying on these in Mcore. They should be considered as debugging tools and we don't make guarantees to maintain them.
If we do want to control this in Mcore, we should do it through proper, user-facing APIs. We can add it as an arg to the unfused ops, but we should also make sure the unfused ops actually respect it too.
There was a problem hiding this comment.
Yea, I also thought about the same thing, but didn't want to add too many changes. Now I add offload_activation for all related modules, now the consistence would be better.
There was a problem hiding this comment.
I don't like that this is ballooning the scope of this PR and adds a maintenance burden to so many ops. It just makes it that much more painful to implement a new fusion. I've reverted the new arg so that this PR just fixes the offloading bug in grouped MLP.
For a more general solution to selective offloading, I wonder if we can expand the existing APIs for CPU offloading. If you want to explicitly disable in one layer, what if we allowed nesting like this:
with get_cpu_offload_context(...):
x = layer0(x) # CPU offloading
with get_cpu_offload_context(enabled=False, ...):
x = layer1(x) # No CPU offloading
x = layer2(x) # CPU offloadingThis would work automatically, with no extra logic needed in every op.
| and isinstance(input_quantizer, NVFP4Quantizer) | ||
| ): | ||
| grouped_fc1_x = input_ | ||
| grouped_fc1_x = GroupedTensorStorage( |
There was a problem hiding this comment.
Now that we set the input quantizer with .internal = True, isn't it redundant to repack grouped_fc1_x into a GroupedTensorStorage?
There was a problem hiding this comment.
I think they are targeting different cases. The input quantizer with .interal= True```` takes effects on bf16 input, where we need to quantize it by fc1_input_quantizer. The second case is that the input is already a quantized fp8 tensor, where we need to repack it into a GroupedTensorStorage```.
There was a problem hiding this comment.
How could it already be quantized? The only way to create it is from the quantizer, either just now or from a previous step (e.g. with activation recompute). If the quantizer has .internal=True, it can only be GroupedTensorStorage.
If something is incorrectly producing GroupedTensor, then that's a bug. Fixing it here is papering over the real problem.
Actually, on second thought, it makes sense that input_ can be a GroupedTensor since it comes from outside the op. It would be useful to know what use-case hit this bug though. Activation recompute?
Really the root cause is that CPU offloading doesn't handle GroupedTensor gracefully, but that would be a more involved effort.
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
07f4c97 to
933d64b
Compare
- Revert per-module offload_activation API added in commits 376d28c and 933d64b; that belongs in a separate PR. - ops/basic/grouped_linear: add start_offload on input tensors before the GEMM, and mark_activation_offload / mark_not_offload in fuser_forward_save_ctx for both the split-quantize and grouped-tensor paths. - ops/fused/forward_grouped_mlp: remove no_offload_activation attribute lookups and the activation mark_not_offload calls that gated on them; add start_offload + mark_activation_offload for all saved activation tensors (grouped_fc1_x, activation_in, saved_grouped_fc2_x) and keep mark_not_offload only for weight tensors. Document why grouped_fc1_x is repacked into GroupedTensorStorage. - ops/basic/basic_linear: no change needed beyond the existing mark_activation_offload — unlike te.Linear there is no persistent weight cache, so the quantized weight workspace can be freely offloaded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
for more information, see https://pre-commit.ci
GroupedTensor should only be used when exposed externally. Otherwise GroupedTensorStorage has less CPU overhead. There also seems to be some issue with CPU offloading that has not yet been root-caused. Signed-off-by: Tim Moon <tmoon@nvidia.com>
|
/te-ci pytorch |
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: