[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize) by pytorchbot · Pull Request #20721 · pytorch/executorch

pytorchbot · 2026-07-04T17:06:35Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20582 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/74/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/74/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/73/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/74/orig

@diff-train-skip-merge

…+ per-op resize) Pull Request resolved: #20582 **End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.** **Problem:** the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups). **Solution:** a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape. - Cases A-D: dynamic + static `rms_norm` (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged). - Cases F-H: `rms(rms(x))` cascade, `rms(x)+x` (rms->add cascade), `rms(x)*x` (mul). - Cases I-L: dynamic `linear_q4gsw` (GEMM at several M), `sdpa_with_kv_cache` (GQA prefill at several S), `embedding_q4gsw` (int64 ids), `apply_rotary_emb` (two outputs). - Cases M-N: dynamic `sigmoid` (elementwise) and `select_copy(0, -1)` (negative index resolved against the live leading dim each call). - Graph-reuse variants: every dynamic op above (`rms_norm` incl. a grow-first smallest→largest order, the `rms(rms(x))` cascade, `linear_q4gsw`, `embedding_q4gsw`, `apply_rotary_emb`, `sigmoid`, `select_copy`) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize. **Implementation:** - `test/ops/dynamic_shape/test_dynamic_shape_export.py` exports each toy model through `VulkanPartitioner` with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope. - `test/native/test_dynamic_shape.cpp` loads each `.pte`, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single `Module` serves the whole shape sweep. - Multi-output ops select their output by full shape, never numel. **Constraints:** numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if `sym_size.int`/`copy_` op coverage is incomplete (does not fail the suite). Co-authored-with: Claude Code. ghstack-source-id: 399812841 @exported-using-ghexport Differential Revision: [D109906090](https://our.internmc.facebook.com/intern/diff/D109906090/)

pytorch-bot · 2026-07-04T17:06:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20721

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… (prefill path) Pull Request resolved: #20583 **Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.** **Problem**: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (`maxComputeWorkgroupsPerDimension`, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill. **Solution**: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from `@builtin(num_workgroups)`. - **Before**: `compute_1d_workgroup_count` throws if `count > limit`; dispatch `(count, 1, 1)`. - **After**: `compute_2d_workgroup_count` returns `{count, 1}` (fast path) or a near-square `{x, y}` (`x = ceil(sqrt(count))` clamped to `limit`, `y = div_up(count, x)`); dispatch `(x, y, 1)`. A flat `{limit, div_up(count, limit)}` split would idle up to ~half the launched workgroups when `count` just exceeds `limit`; the near-square split holds the waste to `O(sqrt(count))` (e.g. 65536 -> `{256, 256}`, 0 inactive). **Implementation**: - `WgCount` + pure `fold_workgroup_count_2d` + `compute_2d_workgroup_count` in `WebGPUUtils.h` (device-free, unit-testable; `queried_max_workgroups` factored out of the 1D path) - `WebGPUDispatch.workgroup_count_y` (default 1, declared last so existing aggregate inits are unchanged); both `dispatchWorkgroups` calls + the profiling record pass `(x, y, 1)` - Per-kernel in-shader reconstruction: thread-form `idx = gid.x + gid.y*(num_workgroups.x*wg_size)` (QK/AV/add); row-form `row_idx = wid.x + wid.y*num_workgroups.x` (softmax — keeps a `valid` predicate, not an early return, so `workgroupBarrier()`s stay uniform) - `Sdpa.cpp`: QK/softmax/AV counts via the 2D helper; the dynamic-`input_pos` resize hook recomputes both x and y for QK - Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX `get_2d_grid_dims` packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste) **Constraints**: - `y=1` fast path keeps every non-folded dispatch byte-identical to the prior 1D path - Scope = prefill path only; `rms_norm`/`embedding`/`lm_head`/`update_cache` are row/token-indexed and never hit the cap, so they keep the 1D path - Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the `uint32` element guard fires first at S~11585) Co-authored-with: Claude Code. ghstack-source-id: 399812920 @exported-using-ghexport Differential Revision: [D109517684](https://our.internmc.facebook.com/intern/diff/D109517684/)

…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 399812923 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)

Pull Request resolved: #20651 **Lift the 65535 workgroup-per-dim cap for `mul` and `permute` so they run at any numel.** `mul.Tensor` and `permute` still used `compute_1d_workgroup_count`, which throws once `numel / wg_size > 65535` — hit by a realistic Llama-3.2-1B LoRA layer (`mul` over `[2048, 8192]` = 262k workgroups; `permute` of `[2048, 2048]` = 65536). `add`/`sub`/`div`/`fill`/`sdpa` already use the 2D fold; this brings `mul` + `permute` in line. Key changes: - `mul/BinaryOp.cpp`, `permute/Permute.cpp` — `compute_1d_workgroup_count` → `compute_2d_workgroup_count` (returns `utils::WgCount`); dispatch + resize hook now set both `workgroup_count_x` and `workgroup_count_y`. - `binary_mul.wgsl`, `permute.wgsl` — `main` takes `@builtin(num_workgroups)`; flat index `gid.x + gid.y * (num_workgroups.x * wg_size)` (regenerated `*_wgsl.h`). Mirrors the landed `add` op fold (`runtime/ops/add/{BinaryOp.cpp,binary_add.wgsl}`). Co-authored-with: Claude Code. ghstack-source-id: 399812930 @exported-using-ghexport Differential Revision: [D110149677](https://our.internmc.facebook.com/intern/diff/D110149677/)

…scripten Dawn Pull Request resolved: #20652 **Key the `timedWaitAny` instance setup to the actual Dawn API instead of `__EMSCRIPTEN__`, so native-rig Dawn and emscripten/emdawnwebgpu use the modern `requiredFeatures` path and only the vendored Dawn uses the legacy `capabilities.*` path.** The instance-descriptor setup was guarded by `#if defined(__EMSCRIPTEN__)`, which routed emscripten (emdawnwebgpu, emcc 4.0.19+) through the legacy `capabilities.*` API that no longer exists there. The guard now keys off the API actually present. Key changes: - `WebGPUDevice.cpp` — `#if defined(__EMSCRIPTEN__)` → `#if defined(WEBGPU_DAWN_INSTANCE_CAPABILITIES)`. The legacy `instance_desc.capabilities.*` path is taken only by the buck-vendored Dawn (which defines the macro); native cmake Dawn and emscripten leave it undefined and take the `requiredFeatures` / `WGPUInstanceFeatureName_TimedWaitAny` path. Co-authored-with: Claude Code. ghstack-source-id: 399812934 @exported-using-ghexport Differential Revision: [D110149678](https://our.internmc.facebook.com/intern/diff/D110149678/)

Pull Request resolved: #20706 Convert the remaining hand-rolled `int main()` + printf/`bool ok` native tests to GTest so the whole `backends/webgpu/test/` suite is uniform, filterable via `--gtest_filter`, and self-reporting (extends the GTest conversion already applied to `test_dynamic_shape`). The five converted files are a harness-only change — every test case, tensor shape, tolerance, artifact filename, and skip condition is preserved 1:1, only the pass/fail reporting mechanism changes — and this diff additionally wires the already-GTest `webgpu_dynamic_shape_test` into the CI runner so the dynamic-shape suite actually executes. Key changes: - `test/test_webgpu_native.cpp`, `test/native/test_dispatch_order.cpp`, `test/native/test_index.cpp`, `test/native/test_scratch_buffer.cpp`, `test/native/test_update_cache.cpp` — `main`+`printf`/`bool ok` accumulator → `TEST()` cases using `EXPECT_*`/`ASSERT_*`; each keeps a custom `main()` that brings up the WebGPU device once then `RUN_ALL_TESTS()` (device-absent still SKIPs by returning 0). `test_index`/`test_webgpu_native` use inclusive `EXPECT_LE(err, tol)` to match the original `err > tol` fail gate exactly. - `CMakeLists.txt` — move every native-test target into the `if(TARGET GTest::gtest)` block, linking `GTest::gtest`. - `scripts/test_webgpu_native_ci.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` to the native-test configure so the now-gtest-gated targets are defined, and wire `webgpu_dynamic_shape_test` into the runner: export its `.pte`s + goldens via `export_dynamic_shape_cases`, add it to the built/run target list behind the same `--target help` probe, and run it guarded (mirroring the `index` test). - `test/test_build_webgpu.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` so the local build script (which builds the now-gtest-gated targets unconditionally) still finds them. ghstack-source-id: 399812941 @exported-using-ghexport Differential Revision: [D110536636](https://our.internmc.facebook.com/intern/diff/D110536636/)

pytorchbot requested review from kirklandsign and larryliu0820 as code owners July 4, 2026 17:06

pytorchbot had a problem deploying to cadence July 4, 2026 17:06 — with GitHub Actions Error

pytorchbot temporarily deployed to cadence July 4, 2026 17:06 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2026

JCNTH added 5 commits July 4, 2026 10:33

ghost merged commit ce95117 into gh/JulianCloudNTH/73/orig Jul 4, 2026
28 of 29 checks passed

ghost deleted the gh/JulianCloudNTH/74/orig branch July 4, 2026 17:33

ghost temporarily deployed to cadence July 4, 2026 17:33 — with GitHub Actions Inactive

ghost temporarily deployed to cadence July 4, 2026 18:03 — with GitHub Actions Inactive

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20721

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20721
6 commits merged into
gh/JulianCloudNTH/73/origfrom
gh/JulianCloudNTH/74/orig

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20721

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading