[ExecuTorch][WebGPU] Dynamic resize hook for view_copy by pytorchbot · Pull Request #20718 · pytorch/executorch

pytorchbot · 2026-07-04T17:06:14Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20579 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/71/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/71/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/70/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/71/orig

@diff-train-skip-merge

Pull Request resolved: #20579 **Make `view_copy` track the live sequence length under dynamic shapes.** **Problem:** `view_copy` lowers to a flat DMA buffer copy (`add_buffer_copy`) sized at the build-time max shape. With one dynamic graph serving any seq-len S (prefill S=K, decode S=1), the copy moved the full max-S byte count and the output kept its max dims, so a downstream consumer read a live shape that was too large. **Solution:** register a tensor resize hook on the input so the copy follows the live input numel (a view preserves numel). - Before: `copy_nbytes` and the output dims are fixed at the serialized max. - After: the hook recomputes the live numel from `cur_dims(in)`, scales the single dynamic output dim to preserve numel, sets the output `cur_dims`, and rewrites the Copy dispatch's `copy_nbytes`. **Implementation:** - Keep the existing DMA path (`Kind::Copy`); the hook only rewrites `copy_nbytes` via `dispatch_at`, no new kernel. - Handle the aliased in/out fast path (no copy emitted) by still setting the output `cur_dims` so the resize cascade reaches consumers. - Mirrors Vulkan's `view_buffer` contiguous fast path; numel-preserving like the other dynamic-shape op hooks. **Constraints:** inert on a static graph (`cur_dims == dims`), so byte-identical to the prior behavior; fp32-only and numel-preserving invariants unchanged. Co-authored-with: Claude Code. ghstack-source-id: 399812833 @exported-using-ghexport Differential Revision: [D109906098](https://our.internmc.facebook.com/intern/diff/D109906098/)

pytorch-bot · 2026-07-04T17:06:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20718

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Pull Request resolved: #20580 **Make `sdpa_with_kv_cache` serve any live seq-len S from one graph (batched prefill S=K and decode S=1).** **Problem:** the existing dynamic path only reacted to a live `input_pos` (decode), with S captured at build time. It rewrote the QK dispatch (which depends on `context_len`) but left `update_cache`, softmax, and AV sized for the build-time S. Under a dynamic seq-len S (one graph serving prefill and decode), `kv_numel`, the QK/AV tile grids, and the softmax row count all depend on S and were stale. **Solution:** a single recompute hook driven by either a live S (q tensor resize) or a live `input_pos` (SymInt), recomputing every per-step quantity from the live shape. - Before: hook keyed only on `input_pos`; recomputes ctx + QK count; S fixed. - After: hook keyed on q (always) and `input_pos` (when SymInt); reads live S from `cur_dims(q)` and live pos, recomputes all five dispatches' counts + UBOs (`update_cache` K/V, QK, softmax, AV), and sets the output `cur_dims` to q's. **Implementation:** - Capture the `update_cache`/softmax/AV dispatch indices (previously only QK) so their workgroup counts can be rewritten per step. - QK/AV workgroup counts use the landed register-tiled grids (`Hq*ceil(S/TM)*ceil(ctx-or-D/TN)`); softmax is one workgroup per `Hq*S` row. - Register the hook on q unconditionally — inert until q is resized, so a static graph is byte-identical. - Mirrors Vulkan `DynamicDispatchNode` (recompute workgroups per execute); scratch is sized at build (S=max, ctx=Cmax) so buffers never move and bind groups stay valid. **Constraints:** fp32-only, batch=1, GQA, `is_causal=true`, `D%4==0` invariants unchanged; the static / decode-only paths are unaffected (the q hook never fires without a resize). Co-authored-with: Claude Code. ghstack-source-id: 399812834 @exported-using-ghexport Differential Revision: [D109906097](https://our.internmc.facebook.com/intern/diff/D109906097/)

…t/end) Pull Request resolved: #20581 **Make `slice_copy` support a dynamic gather range so the RoPE-freqs slice `[input_pos : input_pos + S]` works under one dynamic graph.** **Problem:** the static slice handler read `start` via a scalar reader that throws on a SymInt and ignored `end` (output length baked AOT). The RoPE-freqs slice uses a SymInt `input_pos` for start and a live S for the range, so the static op could neither build nor resize for it. **Solution:** read start/end as possibly-dynamic SymInts and add a resize hook that recomputes the gather offset and live output length each step. - Before: `start` is a static scalar (SymInt throws); `end` ignored; output length fixed at the serialized max. - After: `start`/`end` read via a SymInt-aware reader; a hook recomputes `out[dim] = (end - start + step - 1) / step`, rewrites `out_meta`/`in_meta`/`params` UBOs + the dispatch count, and sets the output `cur_dims`. **Implementation:** - Hook registered on the `start`/`end` value-ids when they are SymInts and on the input tensor always (inert until resized, so a static slice is byte-identical). - Output/input `TensorMeta` rebuilt from live dims; `dim`/`step` stay static. - Keep the uniforms alive via `own_uniform_buffer` so the hook can rewrite them. - Mirrors Vulkan `resize_slice_copy_node`. **Constraints:** fp32-only; `dim`/`step` static; numerics + layout unchanged; inert on a static graph. NOTE (stacking): this diff sits on top of the in-review `slice_copy` op (D108793168); rebase onto it once that op lands on master. Co-authored-with: Claude Code. ghstack-source-id: 399812835 @exported-using-ghexport Differential Revision: [D109906092](https://our.internmc.facebook.com/intern/diff/D109906092/)

…+ per-op resize) Pull Request resolved: #20582 **End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.** **Problem:** the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups). **Solution:** a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape. - Cases A-D: dynamic + static `rms_norm` (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged). - Cases F-H: `rms(rms(x))` cascade, `rms(x)+x` (rms->add cascade), `rms(x)*x` (mul). - Cases I-L: dynamic `linear_q4gsw` (GEMM at several M), `sdpa_with_kv_cache` (GQA prefill at several S), `embedding_q4gsw` (int64 ids), `apply_rotary_emb` (two outputs). - Cases M-N: dynamic `sigmoid` (elementwise) and `select_copy(0, -1)` (negative index resolved against the live leading dim each call). - Graph-reuse variants: every dynamic op above (`rms_norm` incl. a grow-first smallest→largest order, the `rms(rms(x))` cascade, `linear_q4gsw`, `embedding_q4gsw`, `apply_rotary_emb`, `sigmoid`, `select_copy`) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize. **Implementation:** - `test/ops/dynamic_shape/test_dynamic_shape_export.py` exports each toy model through `VulkanPartitioner` with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope. - `test/native/test_dynamic_shape.cpp` loads each `.pte`, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single `Module` serves the whole shape sweep. - Multi-output ops select their output by full shape, never numel. **Constraints:** numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if `sym_size.int`/`copy_` op coverage is incomplete (does not fail the suite). Co-authored-with: Claude Code. ghstack-source-id: 399812841 @exported-using-ghexport Differential Revision: [D109906090](https://our.internmc.facebook.com/intern/diff/D109906090/)

… (prefill path) Pull Request resolved: #20583 **Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.** **Problem**: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (`maxComputeWorkgroupsPerDimension`, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill. **Solution**: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from `@builtin(num_workgroups)`. - **Before**: `compute_1d_workgroup_count` throws if `count > limit`; dispatch `(count, 1, 1)`. - **After**: `compute_2d_workgroup_count` returns `{count, 1}` (fast path) or a near-square `{x, y}` (`x = ceil(sqrt(count))` clamped to `limit`, `y = div_up(count, x)`); dispatch `(x, y, 1)`. A flat `{limit, div_up(count, limit)}` split would idle up to ~half the launched workgroups when `count` just exceeds `limit`; the near-square split holds the waste to `O(sqrt(count))` (e.g. 65536 -> `{256, 256}`, 0 inactive). **Implementation**: - `WgCount` + pure `fold_workgroup_count_2d` + `compute_2d_workgroup_count` in `WebGPUUtils.h` (device-free, unit-testable; `queried_max_workgroups` factored out of the 1D path) - `WebGPUDispatch.workgroup_count_y` (default 1, declared last so existing aggregate inits are unchanged); both `dispatchWorkgroups` calls + the profiling record pass `(x, y, 1)` - Per-kernel in-shader reconstruction: thread-form `idx = gid.x + gid.y*(num_workgroups.x*wg_size)` (QK/AV/add); row-form `row_idx = wid.x + wid.y*num_workgroups.x` (softmax — keeps a `valid` predicate, not an early return, so `workgroupBarrier()`s stay uniform) - `Sdpa.cpp`: QK/softmax/AV counts via the 2D helper; the dynamic-`input_pos` resize hook recomputes both x and y for QK - Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX `get_2d_grid_dims` packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste) **Constraints**: - `y=1` fast path keeps every non-folded dispatch byte-identical to the prior 1D path - Scope = prefill path only; `rms_norm`/`embedding`/`lm_head`/`update_cache` are row/token-indexed and never hit the cap, so they keep the 1D path - Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the `uint32` element guard fires first at S~11585) Co-authored-with: Claude Code. ghstack-source-id: 399812920 @exported-using-ghexport Differential Revision: [D109517684](https://our.internmc.facebook.com/intern/diff/D109517684/)

…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 399812923 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)

Pull Request resolved: #20651 **Lift the 65535 workgroup-per-dim cap for `mul` and `permute` so they run at any numel.** `mul.Tensor` and `permute` still used `compute_1d_workgroup_count`, which throws once `numel / wg_size > 65535` — hit by a realistic Llama-3.2-1B LoRA layer (`mul` over `[2048, 8192]` = 262k workgroups; `permute` of `[2048, 2048]` = 65536). `add`/`sub`/`div`/`fill`/`sdpa` already use the 2D fold; this brings `mul` + `permute` in line. Key changes: - `mul/BinaryOp.cpp`, `permute/Permute.cpp` — `compute_1d_workgroup_count` → `compute_2d_workgroup_count` (returns `utils::WgCount`); dispatch + resize hook now set both `workgroup_count_x` and `workgroup_count_y`. - `binary_mul.wgsl`, `permute.wgsl` — `main` takes `@builtin(num_workgroups)`; flat index `gid.x + gid.y * (num_workgroups.x * wg_size)` (regenerated `*_wgsl.h`). Mirrors the landed `add` op fold (`runtime/ops/add/{BinaryOp.cpp,binary_add.wgsl}`). Co-authored-with: Claude Code. ghstack-source-id: 399812930 @exported-using-ghexport Differential Revision: [D110149677](https://our.internmc.facebook.com/intern/diff/D110149677/)

…scripten Dawn Pull Request resolved: #20652 **Key the `timedWaitAny` instance setup to the actual Dawn API instead of `__EMSCRIPTEN__`, so native-rig Dawn and emscripten/emdawnwebgpu use the modern `requiredFeatures` path and only the vendored Dawn uses the legacy `capabilities.*` path.** The instance-descriptor setup was guarded by `#if defined(__EMSCRIPTEN__)`, which routed emscripten (emdawnwebgpu, emcc 4.0.19+) through the legacy `capabilities.*` API that no longer exists there. The guard now keys off the API actually present. Key changes: - `WebGPUDevice.cpp` — `#if defined(__EMSCRIPTEN__)` → `#if defined(WEBGPU_DAWN_INSTANCE_CAPABILITIES)`. The legacy `instance_desc.capabilities.*` path is taken only by the buck-vendored Dawn (which defines the macro); native cmake Dawn and emscripten leave it undefined and take the `requiredFeatures` / `WGPUInstanceFeatureName_TimedWaitAny` path. Co-authored-with: Claude Code. ghstack-source-id: 399812934 @exported-using-ghexport Differential Revision: [D110149678](https://our.internmc.facebook.com/intern/diff/D110149678/)

Pull Request resolved: #20706 Convert the remaining hand-rolled `int main()` + printf/`bool ok` native tests to GTest so the whole `backends/webgpu/test/` suite is uniform, filterable via `--gtest_filter`, and self-reporting (extends the GTest conversion already applied to `test_dynamic_shape`). The five converted files are a harness-only change — every test case, tensor shape, tolerance, artifact filename, and skip condition is preserved 1:1, only the pass/fail reporting mechanism changes — and this diff additionally wires the already-GTest `webgpu_dynamic_shape_test` into the CI runner so the dynamic-shape suite actually executes. Key changes: - `test/test_webgpu_native.cpp`, `test/native/test_dispatch_order.cpp`, `test/native/test_index.cpp`, `test/native/test_scratch_buffer.cpp`, `test/native/test_update_cache.cpp` — `main`+`printf`/`bool ok` accumulator → `TEST()` cases using `EXPECT_*`/`ASSERT_*`; each keeps a custom `main()` that brings up the WebGPU device once then `RUN_ALL_TESTS()` (device-absent still SKIPs by returning 0). `test_index`/`test_webgpu_native` use inclusive `EXPECT_LE(err, tol)` to match the original `err > tol` fail gate exactly. - `CMakeLists.txt` — move every native-test target into the `if(TARGET GTest::gtest)` block, linking `GTest::gtest`. - `scripts/test_webgpu_native_ci.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` to the native-test configure so the now-gtest-gated targets are defined, and wire `webgpu_dynamic_shape_test` into the runner: export its `.pte`s + goldens via `export_dynamic_shape_cases`, add it to the built/run target list behind the same `--target help` probe, and run it guarded (mirroring the `index` test). - `test/test_build_webgpu.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` so the local build script (which builds the now-gtest-gated targets unconditionally) still finds them. ghstack-source-id: 399812941 @exported-using-ghexport Differential Revision: [D110536636](https://our.internmc.facebook.com/intern/diff/D110536636/)

pytorchbot temporarily deployed to cadence July 4, 2026 17:06 — with GitHub Actions Inactive

pytorchbot had a problem deploying to cadence July 4, 2026 17:06 — with GitHub Actions Error

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2026

JCNTH added 8 commits July 4, 2026 10:34

ghost requested review from kirklandsign and larryliu0820 as code owners July 4, 2026 17:34

ghost merged commit ff0fc68 into gh/JulianCloudNTH/70/orig Jul 4, 2026
43 of 44 checks passed

ghost deleted the gh/JulianCloudNTH/71/orig branch July 4, 2026 17:34

ghost temporarily deployed to cadence July 4, 2026 17:34 — with GitHub Actions Inactive

ghost temporarily deployed to cadence July 4, 2026 18:03 — with GitHub Actions Inactive

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic resize hook for view_copy#20718

[ExecuTorch][WebGPU] Dynamic resize hook for view_copy#20718
9 commits merged into
gh/JulianCloudNTH/70/origfrom
gh/JulianCloudNTH/71/orig

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20718

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading