[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul · Pull Request #20577 · pytorch/executorch

ghost · 2026-06-28T16:22:23Z

Stack from ghstack (oldest at bottom):

Make the elementwise add and mul ops serve any live shape from one graph.

Problem: aten.add.Tensor and aten.mul.Tensor baked their element count + param UBO(s) + output shape at build() for the max shape. On a dynamic-shape graph at a smaller live shape they would over-dispatch and leave the output sized at the max.

Solution:

Before: one fixed dispatch sized for the build-time shape.
After: each registers a resize hook on BOTH operands (the dynamic one may be either operand by arg order). The hook recomputes the live element count, rewrites the param UBO(s), updates the dispatch workgroup_count_x, and sets the output cur_dims. Inert until an operand is resized.

Implementation:

add: out follows the larger operand (robust when one input is a static residual and the other is the dynamic-S tensor); rewrites AddParams.
mul: recomputes the broadcast output shape and rebuilds all three TensorMeta UBOs via fill_tensor_meta_broadcast.
Each keeps its uniform buffer(s) alive via own_uniform_buffer instead of releasing at build.
Mirrors Vulkan per-op resize_*_node (recompute sizes + dispatch each execute).

Constraints: Behavior-neutral on static graphs (the hook fires only when an operand's live shape differs from the max). No kernel/WGSL/numerics change.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906093

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:22:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20577

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit e85fad2 with merge base 124b677 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t e59c7216df639173da38821b24cd9bad10c0e1147ad33a6798d4898ce56ad89b /exec failed with exit code 1
pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t 9e6f8bd0ebc186487ebbcf1e7c9169abdf1fe3a06eee98afa65336c87f601bb6 /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-28T16:23:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ghost · 2026-06-29T20:22:20Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:23:03Z

Claude finished @JulianCloudNTH's task in 2m 16s —— View job

Review: Dynamic resize hooks for `add` and `mul`

Read both changed files in full
Inspect WebGPUGraph resize-hook API + shared helpers (TensorMeta.h, WebGPUUtils.h)
Compare add vs mul resize logic for correctness
Identify modularization opportunities

The change is correct and well-scoped. The hooks key off cur_dims and set_cur_dims only re-dirties on an actual change, so the static path stays inert exactly as advertised, and propagate_resize converges. Ownership handoff (own_uniform_buffer instead of wgpuBufferRelease) is right — the dtor frees them, and the bind group still holds its own ref. The pattern matches the existing rms_norm resize hook, so it reads consistently with the rest of the backend.

A few correctness notes and the modularization opportunities you asked about.

Correctness / behavior notes

Both operands dirty → hook fires twice in one pass. Both add and mul register the same closure on in1_id and in2_id. In propagate_resize (WebGPUGraph.cpp:205), a single pass iterates all hooks against the processing set, so if both operands are resized in the same pass the closure runs twice (two redundant wgpuQueueWriteBuffer calls + workgroup recompute). It's idempotent so results are correct — just wasted work. In practice usually only one operand is dynamic, so this rarely triggers, but worth a one-line comment acknowledging it.
add output = larger-by-numel operand is a heuristic, not broadcasting. The add kernel is pure elementwise over num_elements with no per-tensor stride/broadcast metadata, so add cannot broadcast mismatched shapes (e.g. [3,1] + [1,4]) — only the "one operand is the full shape, the other is equal or a scalar-ish subset" case works. Picking n2 > n1 ? d2 : d1 preserves the static build's behavior, so this PR doesn't regress anything, but it inherits that latent limitation. Fine to leave; flagging so it's a conscious choice.
Uninitialized WebGPUTensor temporaries in the mul hook (mul/BinaryOp.cpp:187-190). ta/tb/to are default-constructed and only .dims is set; fill_tensor_meta_broadcast reads only .dims, so this is safe today — but it's an implicit coupling to that function's internals. See the overload suggestion below, which removes the temporaries entirely.

Modularization opportunities

These two *_impl functions now share three reusable primitives that don't currently exist as helpers:

(a) Broadcast output-shape computation. The right-alignment loop in mul/BinaryOp.cpp:178-184 is a standalone PyTorch-broadcast rule with no dependency on WebGPU types. It belongs next to fill_tensor_meta_broadcast in TensorMeta.h:

// TensorMeta.h
inline std::vector<int64_t> broadcast_out_dims(
    const std::vector<int64_t>& a, const std::vector<int64_t>& b) {
  const size_t r = std::max(a.size(), b.size());
  std::vector<int64_t> out(r, 1);
  for (size_t i = 0; i < r; i++) {
    const int64_t av = (i + a.size() < r) ? 1 : a[i - (r - a.size())];
    const int64_t bv = (i + b.size() < r) ? 1 : b[i - (r - b.size())];
    out[i] = av > bv ? av : bv;
  }
  return out;
}

Fix this →

(b) A fill_tensor_meta_broadcast overload taking const std::vector<int64_t>& so the mul hook stops fabricating WebGPUTensor temporaries:

inline void fill_tensor_meta_broadcast(
    const std::vector<int64_t>& dims, uint32_t out_ndim, TensorMeta* m);
// existing WebGPUTensor overload forwards to it via t.dims

This removes ta/tb/to and note (3) above. Fix this →

(c) "Register the same hook on both operands." Both ops do the identical two-line dance:

graph.add_tensor_resize_hook(in1_id, fn);
graph.add_tensor_resize_hook(in2_id, fn);

A small overload add_tensor_resize_hook(std::initializer_list<int> ids, fn) on WebGPUGraph would document intent and also let propagate_resize dedup so the hook fires once even when both operands are dirty (fixes note 1). Lower priority than (a)/(b).

On a full binary-op base class: I'd not go further than (a)–(b) right now. add (single AddParams UBO, numel-max, no broadcast metadata) and mul (three TensorMeta UBOs, full broadcast) differ enough in their bodies that a shared binary_op_impl template would need so many hooks it'd be harder to read than the two explicit functions — which cuts against the repo's "choose simpler, more concise" guidance. Extracting the genuinely shared data helpers (a/b) is the right altitude; collapsing the control flow is not.

Summary

No blocking issues — correct and behavior-neutral on static graphs. Recommended: (a) and (b) to remove the duplicated broadcast math and the fabricated tensor temporaries; (c) optional. Notes 1–2 are worth a comment but need no code change.
· branch gh/JulianCloudNTH/69/head

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

Pull Request resolved: #20577 **Make the elementwise add and mul ops serve any live shape from one graph.** **Problem:** `aten.add.Tensor` and `aten.mul.Tensor` baked their element count + param UBO(s) + output shape at `build()` for the max shape. On a dynamic-shape graph at a smaller live shape they would over-dispatch and leave the output sized at the max. **Solution:** - Before: one fixed dispatch sized for the build-time shape. - After: each registers a resize hook on BOTH operands (the dynamic one may be either operand by arg order). The hook recomputes the live element count, rewrites the param UBO(s), updates the dispatch `workgroup_count_x`, and sets the output `cur_dims`. Inert until an operand is resized. **Implementation:** - `add`: out follows the larger operand (robust when one input is a static residual and the other is the dynamic-S tensor); rewrites `AddParams`. - `mul`: recomputes the broadcast output shape and rebuilds all three `TensorMeta` UBOs via `fill_tensor_meta_broadcast`. - Each keeps its uniform buffer(s) alive via `own_uniform_buffer` instead of releasing at build. - Mirrors Vulkan per-op `resize_*_node` (recompute sizes + dispatch each execute). **Constraints:** Behavior-neutral on static graphs (the hook fires only when an operand's live shape differs from the max). No kernel/WGSL/numerics change. Co-authored-with: Claude Code. ghstack-source-id: 399812828 @exported-using-ghexport Differential Revision: [D109906093](https://our.internmc.facebook.com/intern/diff/D109906093/)

Update

fd2a178

[ghstack-poisoned]

ghost temporarily deployed to cadence June 28, 2026 16:22 — with GitHub Actions Inactive

This was referenced Jun 28, 2026

[ExecuTorch][WebGPU] SymInt arithmetic ops (add/sub/mul/floordiv) for dynamic shapes #20573

Merged

[ExecuTorch][WebGPU] Dynamic tensor-shape resize engine core #20574

Merged

[ExecuTorch][WebGPU] Dynamic resize hooks for rms_norm, embedding, rope #20575

Merged

ghost mentioned this pull request Jun 28, 2026

[ExecuTorch][WebGPU] Dynamic resize hook for linear_q4gsw #20576

Merged

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

911cded

[ghstack-poisoned]

ghost temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

f98a16f

[ghstack-poisoned]

ghost temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Update

8ea5c50

[ghstack-poisoned]

ghost temporarily deployed to cadence June 30, 2026 21:10 — with GitHub Actions Inactive

ghost mentioned this pull request Jun 30, 2026

[ExecuTorch][WebGPU] 2D-fold mul + permute dispatch (lift 65535 1D cap) #20651

Merged

ghost mentioned this pull request Jun 30, 2026

[ExecuTorch][WebGPU] Use requiredFeatures instance API on native + emscripten Dawn #20652

Merged

SS-JIA approved these changes Jul 2, 2026

View reviewed changes

ghost mentioned this pull request Jul 2, 2026

[ExecuTorch][WebGPU] Convert remaining native tests to GTest #20706

Merged

Update

4619e90

[ghstack-poisoned]

ghost had a problem deploying to cadence July 3, 2026 20:28 — with GitHub Actions Error

ghost temporarily deployed to cadence July 3, 2026 20:28 — with GitHub Actions Inactive

Update

e85fad2

[ghstack-poisoned]

ghost temporarily deployed to cadence July 3, 2026 20:51 — with GitHub Actions Inactive

ghost temporarily deployed to cadence July 3, 2026 21:20 — with GitHub Actions Inactive

meta-codesync Bot merged commit 47070dd into gh/JulianCloudNTH/69/base Jul 4, 2026
179 of 183 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/69/head branch July 4, 2026 17:05

meta-codesync Bot temporarily deployed to cherry-pick-bot July 4, 2026 17:05 Inactive

pytorchbot mentioned this pull request Jul 4, 2026

[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul #20716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul#20577

[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul#20577
meta-codesync[bot] merged 6 commits into
gh/JulianCloudNTH/69/basefrom
gh/JulianCloudNTH/69/head

ghost commented Jun 28, 2026 •

edited by ghost

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

ghost commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

SS-JIA left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ghost commented Jun 28, 2026 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20577

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

ghost commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Dynamic resize hooks for add and mul

Correctness / behavior notes

Modularization opportunities

Summary

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghost commented Jun 28, 2026 •

edited by ghost

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading

Review: Dynamic resize hooks for `add` and `mul`