[PyTorch] Add op-level activation offload opt-out API#3108
Conversation
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Greptile SummaryThis PR adds a per-op activation offload opt-out API to
Confidence Score: 5/5The change is safe to merge. All production call sites remain guarded by is_cpu_offload_enabled(), the V2 _TE_do_not_offload flag is set before tensors reach the fuser's save_for_backward hook, and the V1 path correctly receives offload=False through mark_not_offload. The core logic in op.py is correct and None-safe. The fused-op wiring in forward_grouped_mlp.py correctly delegates per-op marking before tensors are collected by the fuser. The only finding is a minor inefficiency where opted-out tensors still receive a start_offload CUDA event that is immediately discarded; this does not affect correctness or memory safety. transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py — the start_offload call could be made to exclude opted-out tensors, but the current behavior is correct. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant BasicOperation
participant FusedOp as FusedOperation
participant Fuser as OperationFuser
participant CPUOffload as cpu_offload
User->>BasicOperation: set_activation_offloading(False)
Note over BasicOperation: activation_offloading = False
rect rgb(240, 248, 255)
Note over FusedOp: Forward pass
FusedOp->>CPUOffload: start_offload(fc1_x, act_in, fc2_x)
CPUOffload-->>FusedOp: records CUDA events on tensors
FusedOp->>BasicOperation: fc1_op.maybe_mark_activation_offload(fc1_x)
alt "activation_offloading == True"
BasicOperation->>CPUOffload: mark_activation_offload(fc1_x)
else "activation_offloading == False"
BasicOperation->>CPUOffload: mark_not_offload(fc1_x)
CPUOffload-->>BasicOperation: "sets _TE_do_not_offload=True on components"
end
FusedOp->>Fuser: "ctx.to_save = tensors"
end
rect rgb(255, 248, 240)
Note over Fuser: Fuser collects and saves
Fuser->>CPUOffload: "prepare_for_saving(*to_save)"
CPUOffload-->>Fuser: decomposed component tensors
Fuser->>Fuser: "func_ctx.save_for_backward(*components)"
Fuser->>CPUOffload: push_tensor(fc1_component)
CPUOffload-->>Fuser: "_TE_do_not_offload=True, return tensor (not offloaded)"
Fuser->>CPUOffload: push_tensor(act_component)
CPUOffload-->>Fuser: offloaded, return index
end
Reviews (5): Last reviewed commit: "Patch activation offload test bound symb..." | Re-trigger Greptile |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
| from ..cpu_offload import ( # pylint: disable=import-outside-toplevel | ||
| mark_activation_offload, | ||
| mark_not_offload, | ||
| start_offload, | ||
| ) |
There was a problem hiding this comment.
Why can't you import it once (e.g. at the top of this file)? There is a non-zero CPU overhead from importing the already-imported module.
| if mark: | ||
| mark_activation_offload(*tensors) |
There was a problem hiding this comment.
This would potentiall mark the tensors multiple times as all callsites are just leaving the default value here. Why do you combine these 2 functions rather than having 2 functions?
There was a problem hiding this comment.
Good advice! We should leave start_offload() as it is.
Signed-off-by: hongbinl <hongbinl@nvidia.com>
This is to support the selective offloading for Nemotron model training. If using fused MLP, MCore doesn't know which tensor is from expert_fc1 or moe_act or expert_fc2, so we need to expose an API to manually set offloading strategy for different ops. cc @timmoon10 |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
Signed-off-by: hongbinl <hongbinl@nvidia.com>
|
Ok, but does it actually matter to you that a specific tensor gets offloaded rather than getting the right amount of data to be offloaded? |
Different activation tensors have different amount of data, selectively offloading activations is the way how we control the amount of data to be offloaded. If we offload too much, offloading latency can be exposed, if we offload too few, we get OOM. I think the essence of fine-grained offloading is to allow users to control the offloading of each tensor separately, the we can make the better perf tradeoffs. |
Summary
Follow-up to #3047.
This PR adds an op-level activation offload policy for saved activation tensors so downstream fused grouped MLP users can opt individual TE ops out of activation offloading without changing the surrounding CPU offload context.
Testing