[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize) · Pull Request #20582 · pytorch/executorch

ghost · 2026-06-28T16:22:47Z

Stack from ghstack (oldest at bottom):

End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.

Problem: the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups).

Solution: a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape.

Cases A-D: dynamic + static rms_norm (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged).
Cases F-H: rms(rms(x)) cascade, rms(x)+x (rms->add cascade), rms(x)*x (mul).
Cases I-L: dynamic linear_q4gsw (GEMM at several M), sdpa_with_kv_cache (GQA prefill at several S), embedding_q4gsw (int64 ids), apply_rotary_emb (two outputs).
Cases M-N: dynamic sigmoid (elementwise) and select_copy(0, -1) (negative index resolved against the live leading dim each call).
Graph-reuse variants: every dynamic op above (rms_norm incl. a grow-first smallest→largest order, the rms(rms(x)) cascade, linear_q4gsw, embedding_q4gsw, apply_rotary_emb, sigmoid, select_copy) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize.

Implementation:

test/ops/dynamic_shape/test_dynamic_shape_export.py exports each toy model through VulkanPartitioner with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope.
test/native/test_dynamic_shape.cpp loads each .pte, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single Module serves the whole shape sweep.
Multi-output ops select their output by full shape, never numel.

Constraints: numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if sym_size.int/copy_ op coverage is incomplete (does not fail the suite).

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906090

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:22:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20582

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 85f0845 with merge base 124b677 ():

NEW FAILURE - The following job has failed:

pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t 804a97034239d92e3dcea3bbe01f66807bf8ea8937314bb50fa0ee51a596404c /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-28T16:23:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

ghost · 2026-06-29T20:21:35Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:22:27Z

Claude finished @JulianCloudNTH's task in 1m 28s —— View job

Review: dynamic-shape integration test

Read changed files
Review for correctness & best practices
Identify modularization opportunities

Solid end-to-end test — building one graph at S=MAXS and replaying smaller live shapes against a torch golden is exactly the right oracle for the allocate-at-max + per-op-resize engine, and selecting multi-output tensors by full shape (not numel) is the correct call. The graceful SKIP paths (no WebGPU device, SDPA op-coverage PENDING) keep the suite green where appropriate. Below are the modularization opportunities you asked about, plus a few minor notes.

Modularization — Python (`test_dynamic_shape_export.py`)

1. The lower → assert-delegated → write-buffer block is copy-pasted 5×. _export (lines 97-108) already does this, but _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope each re-implement the identical to_edge_transform_and_lower(...).to_executorch() + any(d.id == "VulkanBackend" ...) + open().write(et.buffer) sequence (e.g. lines 210-218, 253-261, 292-299, 335-343). Since the only thing that differs between callers is how the ExportedProgram is built, split _export so everyone shares the tail:

def _lower_and_write(ep, path: str) -> None:
    et = to_edge_transform_and_lower(ep, partitioner=[VulkanPartitioner()]).to_executorch()
    assert any(
        d.id == "VulkanBackend"
        for plan in et.executorch_program.execution_plan
        for d in plan.delegates
    ), f"Expected VulkanBackend delegate in {path}"
    with open(path, "wb") as f:
        f.write(et.buffer)
    print(f"Exported {path}")

_export then becomes torch.export.export(...) + _lower_and_write(ep, path), and the four specialized exporters drop ~6 duplicated lines each. Fix this →

2. The tensor → little-endian f32 → file incantation appears ~10×. t.detach().cpu().numpy().astype("<f4").tofile(os.path.join(out_dir, ...)) is repeated in _write_goldens, _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope. A one-liner _write_f32(t, path) (and a sibling _write_i64 for the embedding idx at line 307) would centralize the dtype contract that the native side depends on, so a future dtype change is one edit instead of ten.

3. SDPA and RoPE golden loops are structurally identical (lines 266-270, 347-351): iterate [(name, tensor), ...] and dump each to {prefix}.S{s}.{name}.bin. Once _write_f32 exists, both collapse to a shared _write_named(prefix, s, pairs, out_dir).

Modularization — C++ (`test_dynamic_shape.cpp`)

4. The four op-specific checkers share a fixed skeleton. check_linear, check_sdpa, check_embedding, and check_rope each repeat: construct Module, load_forward() with FAIL/ok=false handling, read input bins, make_tensor_ptr, forward, result .ok()/isTensor guards, slice to numel, max_err, and the printf(... PASS/FAIL); ok = ok && pass tail. The per-op delta is really just (a) input tensor construction, (b) output selection, and (c) tolerance. Worth extracting the invariant pieces:

forward_and_check(got, golden, label, s, tol, ok) — the max_err + PASS/FAIL print + ok &= pass tail (currently duplicated in check_s, check_linear, check_embedding, check_rope, check_sdpa).
find_output_by_shape(result, predicate) — the shape-matching output-selection loop is written twice (SDPA lines 207-217, RoPE lines 309-321) with the same structure.

5. The main()-level "load fresh module per S and check_s" loop is repeated 4× for cases A/B, F, G, H (lines 360-369, 398-407, 410-419, 422-431) — identical except the .pte name and the S list. A helper:

void run_dyn_cases(const std::string& dir, const std::string& prefix,
                   std::initializer_list<int> ss, bool& ok) {
  for (int s : ss) {
    Module m(dir + "/" + prefix + ".pte");
    if (m.load_forward() != Error::Ok) { printf("  FAIL load %s.pte\n", prefix.c_str()); ok = false; return; }
    check_s(m, dir, prefix, s, ok);
  }
}

turns each case into a single call. Fix this →

6. Templatize the binary reader. read_bin only handles float, so check_embedding (lines 246-255) hand-rolls the int64 read. A read_bin<T>(path) would remove that special case and keep one file-reading code path.

Minor notes (non-blocking)

Magic tolerances. 1e-3f, 5e-3f, 2e-3f are inlined at each call site (lines 106, 155, 230, 270, 331). Named constants (kTolRms, kTolQuant, kTolSdpa) would document intent and pair naturally with the tol parameter in suggestion Add unlifting pass under private config #4.
Header comment lists A–F but cases extend to L (lines 9-21). The block comment stops at case F while the suite runs through L — worth completing or trimming so it doesn't drift.
max_err returns 1e30f as a size-mismatch sentinel (line 57). It works because every comparison is < tol, but an explicit got.empty() || size mismatch → FAIL at the call site reads more clearly than relying on a magic large float comparing false.

None of these are correctness blockers — the test logic itself looks right. The Python _lower_and_write extraction (#1) is the highest-value change; the C++ helpers (#4/#5) are next.

[ghstack-poisoned]

…+ per-op resize) Pull Request resolved: #20582 **End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.** **Problem:** the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups). **Solution:** a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape. - Cases A-D: dynamic + static `rms_norm` (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged). - Cases F-H: `rms(rms(x))` cascade, `rms(x)+x` (rms->add cascade), `rms(x)*x` (mul). - Cases I-L: dynamic `linear_q4gsw` (GEMM at several M), `sdpa_with_kv_cache` (GQA prefill at several S), `embedding_q4gsw` (int64 ids), `apply_rotary_emb` (two outputs). - Cases M-N: dynamic `sigmoid` (elementwise) and `select_copy(0, -1)` (negative index resolved against the live leading dim each call). - Graph-reuse variants: every dynamic op above (`rms_norm` incl. a grow-first smallest→largest order, the `rms(rms(x))` cascade, `linear_q4gsw`, `embedding_q4gsw`, `apply_rotary_emb`, `sigmoid`, `select_copy`) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize. **Implementation:** - `test/ops/dynamic_shape/test_dynamic_shape_export.py` exports each toy model through `VulkanPartitioner` with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope. - `test/native/test_dynamic_shape.cpp` loads each `.pte`, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single `Module` serves the whole shape sweep. - Multi-output ops select their output by full shape, never numel. **Constraints:** numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if `sym_size.int`/`copy_` op coverage is incomplete (does not fail the suite). Co-authored-with: Claude Code. ghstack-source-id: 399812841 @exported-using-ghexport Differential Revision: [D109906090](https://our.internmc.facebook.com/intern/diff/D109906090/)

Update

29ec110

[ghstack-poisoned]

ghost requested review from kirklandsign and larryliu0820 as code owners June 28, 2026 16:22

ghost temporarily deployed to cadence June 28, 2026 16:22 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

0c7b54b

[ghstack-poisoned]

ghost temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

9a9981f

[ghstack-poisoned]

ghost temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Update

c3cd768

[ghstack-poisoned]

ghost temporarily deployed to cadence June 30, 2026 21:10 — with GitHub Actions Inactive

Update

766b9c8

[ghstack-poisoned]

ghost temporarily deployed to cadence July 2, 2026 23:14 — with GitHub Actions Inactive

ghost mentioned this pull request Jul 2, 2026

[ExecuTorch][WebGPU] Convert remaining native tests to GTest #20706

Merged

Update

7e10497

[ghstack-poisoned]

ghost had a problem deploying to cadence July 3, 2026 20:28 — with GitHub Actions Error

ghost temporarily deployed to cadence July 3, 2026 20:28 — with GitHub Actions Inactive

Update

49f913e

[ghstack-poisoned]

ghost temporarily deployed to cadence July 3, 2026 20:52 — with GitHub Actions Inactive

ghost had a problem deploying to cadence July 3, 2026 21:19 — with GitHub Actions Error

ghost requested a review from psiddh July 3, 2026 21:26

Update

85f0845

[ghstack-poisoned]

ghost temporarily deployed to cadence July 3, 2026 21:37 — with GitHub Actions Inactive

ghost temporarily deployed to cadence July 3, 2026 22:05 — with GitHub Actions Inactive

psiddh approved these changes Jul 4, 2026

View reviewed changes

meta-codesync Bot merged commit ce3dc23 into gh/JulianCloudNTH/74/base Jul 4, 2026
180 of 183 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/74/head branch July 4, 2026 17:06

meta-codesync Bot temporarily deployed to cherry-pick-bot July 4, 2026 17:06 Inactive

pytorchbot mentioned this pull request Jul 4, 2026

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize) #20721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582
meta-codesync[bot] merged 9 commits into
gh/JulianCloudNTH/74/basefrom
gh/JulianCloudNTH/74/head

ghost commented Jun 28, 2026 •

edited by ghost

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

ghost commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ghost commented Jun 28, 2026 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20582

❌ 1 New Failure

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

ghost commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: dynamic-shape integration test

Modularization — Python (test_dynamic_shape_export.py)

Modularization — C++ (test_dynamic_shape.cpp)

Minor notes (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghost commented Jun 28, 2026 •

edited by ghost

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading

Modularization — Python (`test_dynamic_shape_export.py`)

Modularization — C++ (`test_dynamic_shape.cpp`)