[ExecuTorch][WebGPU] Dynamic tensor-shape resize engine core#20713
Merged
14 commits merged intoJul 4, 2026
Merged
Conversation
Pull Request resolved: #20574 The WebGPU backend baked static tensor shapes at build time, so a dynamic `.pte` needed a separate graph for each shape (prefill vs. decode). This adds a tensor-shape resize engine mirroring Vulkan: tensors carry live `cur_dims` ≤ max, inputs resize per call, and a bounded-fixpoint propagates tensor-level resize hooks. **Key changes:** - `WebGPUTensor`: add `cur_dims`/`cur_nbytes` (live sizes ≤ max allocation), initialized to max at build - `WebGPUGraph`: `resize_input`/`set_cur_dims` validate live dims fit max, `propagate_resize` runs tensor hooks for dirty shapes - `update_symints_from_inputs` reads live `cur_dims`; adds `sym_size.int` dim source path - `copy_inputs` uploads only live bytes; `WebGPUBackend::execute` shrinks inputs and resizes outputs to live shapes Static graphs stay byte-identical: `cur == max` forever, no hooks fire, no reallocations. ghstack-source-id: 399812823 @exported-using-ghexport Differential Revision: [D109906091](https://our.internmc.facebook.com/intern/diff/D109906091/)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20713
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Pull Request resolved: #20575 These ops baked their dispatch count, param UBO, and output dims at `build()` for the max seq-len. On a dynamic-shape graph at a smaller live S they would over-dispatch and leave the output sized at the max, so the resize engine could not actually shrink them. This adds tensor resize hooks to rms_norm, embedding_q4gsw, and apply_rotary_emb. When an input is resized, each hook recomputes the live row/token count, rewrites the param UBO, updates the dispatch `workgroup_count_x`, and sets the output's `cur_dims`. The hook is inert until a resize happens, so static graphs are byte-identical. Implementation: - `rms_norm`: recompute `num_rows` from live `cur_dims`; out dims follow the input. - `embedding_q4gsw`: recompute `num_indices`/`total_blocks`; out dims = indices dims + `[embed_dim]`. - `apply_rotary_emb`: `add_rope_dispatch` now returns its uniform handle; one hook rewrites both the xq and xk dispatches/UBOs for the live S and sets both outputs. - Each keeps its uniform buffer alive via `own_uniform_buffer` (the hook rewrites it) instead of releasing it at build. Mirrors Vulkan per-op `resize_*_node` (recompute sizes + dispatch each execute). No kernel/WGSL/numerics change. Behavior-neutral on static graphs (hook only fires when live dims differ from max). `quantized_linear` and SDPA resize hooks land in following diffs; `prepack` needs none (constants are fixed-size). ghstack-source-id: 399812824 @exported-using-ghexport Differential Revision: [D109906096](https://our.internmc.facebook.com/intern/diff/D109906096/)
Pull Request resolved: #20576 **Make the 4-bit quantized linear serve any live M (rows) from one graph, so a dynamic prefill+decode graph computes correct-size outputs.** **Problem:** `linear_q4gsw` baked its dispatch count, `params.M`, and output shape at `build()` for the max M. On a dynamic-shape graph at a smaller live M (e.g. decode M=1 vs prefill M=S) it would over-dispatch and leave the output sized at the max. **Solution:** - Before: one fixed dispatch sized for the build-time M. - After: a tensor resize hook on the input recomputes the live M from `cur_dims`, rewrites `params.M`, updates the dispatch `workgroup_count_x` for the SAME kernel chosen at build (bicol GEMV / shmem GEMM / register-tiled), and sets the output `cur_dims` (= input dims with the last dim replaced by N). Inert until the input is resized. **Implementation:** - The build-time kernel select (bicol GEMV for M==1, else shmem GEMM for large K/N, else register-tiled) is fixed at build; the hook re-runs `compute_q4gsw_workgroup_count` for whichever of the three the build chose and rewrites the param UBO + output dims for the live M — it does not switch kernels (runtime M-switching is a separate optimization). - `own_uniform_buffer` keeps the param UBO alive so the hook can rewrite it. - Mirrors Vulkan `resize_q4gsw_linear_node` (recompute M-derived dispatch each execute). **Constraints:** Behavior-neutral on static graphs (hook fires only when the input's live M differs from the max). No kernel/WGSL/numerics change. Runtime M-based kernel switching is deliberately out of scope (a later opt diff). Co-authored-with: Claude Code. ghstack-source-id: 399812825 @exported-using-ghexport Differential Revision: [D109906094](https://our.internmc.facebook.com/intern/diff/D109906094/)
Pull Request resolved: #20577 **Make the elementwise add and mul ops serve any live shape from one graph.** **Problem:** `aten.add.Tensor` and `aten.mul.Tensor` baked their element count + param UBO(s) + output shape at `build()` for the max shape. On a dynamic-shape graph at a smaller live shape they would over-dispatch and leave the output sized at the max. **Solution:** - Before: one fixed dispatch sized for the build-time shape. - After: each registers a resize hook on BOTH operands (the dynamic one may be either operand by arg order). The hook recomputes the live element count, rewrites the param UBO(s), updates the dispatch `workgroup_count_x`, and sets the output `cur_dims`. Inert until an operand is resized. **Implementation:** - `add`: out follows the larger operand (robust when one input is a static residual and the other is the dynamic-S tensor); rewrites `AddParams`. - `mul`: recomputes the broadcast output shape and rebuilds all three `TensorMeta` UBOs via `fill_tensor_meta_broadcast`. - Each keeps its uniform buffer(s) alive via `own_uniform_buffer` instead of releasing at build. - Mirrors Vulkan per-op `resize_*_node` (recompute sizes + dispatch each execute). **Constraints:** Behavior-neutral on static graphs (the hook fires only when an operand's live shape differs from the max). No kernel/WGSL/numerics change. Co-authored-with: Claude Code. ghstack-source-id: 399812828 @exported-using-ghexport Differential Revision: [D109906093](https://our.internmc.facebook.com/intern/diff/D109906093/)
Pull Request resolved: #20578 **Make sigmoid and select_copy serve any live shape from one graph; fix select's last-token index under dynamic shapes.** **Problem:** Both ops baked their dispatch/params/output shape at `build()` for the max shape. `select_copy` was worse: a negative index (e.g. `-1` for the last token) was normalized against the build-time MAX dim, so at a smaller live S it selected a stale/zero position past the live data — producing wrong (often zero) output. **Solution:** - `sigmoid` (generic `add_unary_op`): a resize hook recomputes `num_elements`/dispatch and sets the output `cur_dims` (shape-preserving). - `select_copy`: KEEP the raw (possibly negative) index at build; a resize hook re-resolves it against the LIVE dim, recomputes the output dims (= input minus `dim`), rebuilds the out/in `TensorMeta` UBOs and the dispatch. - Both keep their uniform buffer(s) alive via `own_uniform_buffer`. **Implementation:** - The select out/in meta is rebuilt from synthetic `WebGPUTensor{dims}` via `fill_tensor_meta` (reads only `.dims`). - Mirrors Vulkan per-op `resize_*_node`. **Constraints:** Behavior-neutral on static graphs (hooks fire only when an input's live shape differs from the max). No kernel/WGSL/numerics change. Co-authored-with: Claude Code. ghstack-source-id: 399812832 @exported-using-ghexport Differential Revision: [D109906095](https://our.internmc.facebook.com/intern/diff/D109906095/)
Pull Request resolved: #20579 **Make `view_copy` track the live sequence length under dynamic shapes.** **Problem:** `view_copy` lowers to a flat DMA buffer copy (`add_buffer_copy`) sized at the build-time max shape. With one dynamic graph serving any seq-len S (prefill S=K, decode S=1), the copy moved the full max-S byte count and the output kept its max dims, so a downstream consumer read a live shape that was too large. **Solution:** register a tensor resize hook on the input so the copy follows the live input numel (a view preserves numel). - Before: `copy_nbytes` and the output dims are fixed at the serialized max. - After: the hook recomputes the live numel from `cur_dims(in)`, scales the single dynamic output dim to preserve numel, sets the output `cur_dims`, and rewrites the Copy dispatch's `copy_nbytes`. **Implementation:** - Keep the existing DMA path (`Kind::Copy`); the hook only rewrites `copy_nbytes` via `dispatch_at`, no new kernel. - Handle the aliased in/out fast path (no copy emitted) by still setting the output `cur_dims` so the resize cascade reaches consumers. - Mirrors Vulkan's `view_buffer` contiguous fast path; numel-preserving like the other dynamic-shape op hooks. **Constraints:** inert on a static graph (`cur_dims == dims`), so byte-identical to the prior behavior; fp32-only and numel-preserving invariants unchanged. Co-authored-with: Claude Code. ghstack-source-id: 399812833 @exported-using-ghexport Differential Revision: [D109906098](https://our.internmc.facebook.com/intern/diff/D109906098/)
Pull Request resolved: #20580 **Make `sdpa_with_kv_cache` serve any live seq-len S from one graph (batched prefill S=K and decode S=1).** **Problem:** the existing dynamic path only reacted to a live `input_pos` (decode), with S captured at build time. It rewrote the QK dispatch (which depends on `context_len`) but left `update_cache`, softmax, and AV sized for the build-time S. Under a dynamic seq-len S (one graph serving prefill and decode), `kv_numel`, the QK/AV tile grids, and the softmax row count all depend on S and were stale. **Solution:** a single recompute hook driven by either a live S (q tensor resize) or a live `input_pos` (SymInt), recomputing every per-step quantity from the live shape. - Before: hook keyed only on `input_pos`; recomputes ctx + QK count; S fixed. - After: hook keyed on q (always) and `input_pos` (when SymInt); reads live S from `cur_dims(q)` and live pos, recomputes all five dispatches' counts + UBOs (`update_cache` K/V, QK, softmax, AV), and sets the output `cur_dims` to q's. **Implementation:** - Capture the `update_cache`/softmax/AV dispatch indices (previously only QK) so their workgroup counts can be rewritten per step. - QK/AV workgroup counts use the landed register-tiled grids (`Hq*ceil(S/TM)*ceil(ctx-or-D/TN)`); softmax is one workgroup per `Hq*S` row. - Register the hook on q unconditionally — inert until q is resized, so a static graph is byte-identical. - Mirrors Vulkan `DynamicDispatchNode` (recompute workgroups per execute); scratch is sized at build (S=max, ctx=Cmax) so buffers never move and bind groups stay valid. **Constraints:** fp32-only, batch=1, GQA, `is_causal=true`, `D%4==0` invariants unchanged; the static / decode-only paths are unaffected (the q hook never fires without a resize). Co-authored-with: Claude Code. ghstack-source-id: 399812834 @exported-using-ghexport Differential Revision: [D109906097](https://our.internmc.facebook.com/intern/diff/D109906097/)
…t/end) Pull Request resolved: #20581 **Make `slice_copy` support a dynamic gather range so the RoPE-freqs slice `[input_pos : input_pos + S]` works under one dynamic graph.** **Problem:** the static slice handler read `start` via a scalar reader that throws on a SymInt and ignored `end` (output length baked AOT). The RoPE-freqs slice uses a SymInt `input_pos` for start and a live S for the range, so the static op could neither build nor resize for it. **Solution:** read start/end as possibly-dynamic SymInts and add a resize hook that recomputes the gather offset and live output length each step. - Before: `start` is a static scalar (SymInt throws); `end` ignored; output length fixed at the serialized max. - After: `start`/`end` read via a SymInt-aware reader; a hook recomputes `out[dim] = (end - start + step - 1) / step`, rewrites `out_meta`/`in_meta`/`params` UBOs + the dispatch count, and sets the output `cur_dims`. **Implementation:** - Hook registered on the `start`/`end` value-ids when they are SymInts and on the input tensor always (inert until resized, so a static slice is byte-identical). - Output/input `TensorMeta` rebuilt from live dims; `dim`/`step` stay static. - Keep the uniforms alive via `own_uniform_buffer` so the hook can rewrite them. - Mirrors Vulkan `resize_slice_copy_node`. **Constraints:** fp32-only; `dim`/`step` static; numerics + layout unchanged; inert on a static graph. NOTE (stacking): this diff sits on top of the in-review `slice_copy` op (D108793168); rebase onto it once that op lands on master. Co-authored-with: Claude Code. ghstack-source-id: 399812835 @exported-using-ghexport Differential Revision: [D109906092](https://our.internmc.facebook.com/intern/diff/D109906092/)
…+ per-op resize) Pull Request resolved: #20582 **End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.** **Problem:** the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups). **Solution:** a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape. - Cases A-D: dynamic + static `rms_norm` (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged). - Cases F-H: `rms(rms(x))` cascade, `rms(x)+x` (rms->add cascade), `rms(x)*x` (mul). - Cases I-L: dynamic `linear_q4gsw` (GEMM at several M), `sdpa_with_kv_cache` (GQA prefill at several S), `embedding_q4gsw` (int64 ids), `apply_rotary_emb` (two outputs). - Cases M-N: dynamic `sigmoid` (elementwise) and `select_copy(0, -1)` (negative index resolved against the live leading dim each call). - Graph-reuse variants: every dynamic op above (`rms_norm` incl. a grow-first smallest→largest order, the `rms(rms(x))` cascade, `linear_q4gsw`, `embedding_q4gsw`, `apply_rotary_emb`, `sigmoid`, `select_copy`) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize. **Implementation:** - `test/ops/dynamic_shape/test_dynamic_shape_export.py` exports each toy model through `VulkanPartitioner` with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope. - `test/native/test_dynamic_shape.cpp` loads each `.pte`, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single `Module` serves the whole shape sweep. - Multi-output ops select their output by full shape, never numel. **Constraints:** numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if `sym_size.int`/`copy_` op coverage is incomplete (does not fail the suite). Co-authored-with: Claude Code. ghstack-source-id: 399812841 @exported-using-ghexport Differential Revision: [D109906090](https://our.internmc.facebook.com/intern/diff/D109906090/)
… (prefill path) Pull Request resolved: #20583 **Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.** **Problem**: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (`maxComputeWorkgroupsPerDimension`, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill. **Solution**: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from `@builtin(num_workgroups)`. - **Before**: `compute_1d_workgroup_count` throws if `count > limit`; dispatch `(count, 1, 1)`. - **After**: `compute_2d_workgroup_count` returns `{count, 1}` (fast path) or a near-square `{x, y}` (`x = ceil(sqrt(count))` clamped to `limit`, `y = div_up(count, x)`); dispatch `(x, y, 1)`. A flat `{limit, div_up(count, limit)}` split would idle up to ~half the launched workgroups when `count` just exceeds `limit`; the near-square split holds the waste to `O(sqrt(count))` (e.g. 65536 -> `{256, 256}`, 0 inactive). **Implementation**: - `WgCount` + pure `fold_workgroup_count_2d` + `compute_2d_workgroup_count` in `WebGPUUtils.h` (device-free, unit-testable; `queried_max_workgroups` factored out of the 1D path) - `WebGPUDispatch.workgroup_count_y` (default 1, declared last so existing aggregate inits are unchanged); both `dispatchWorkgroups` calls + the profiling record pass `(x, y, 1)` - Per-kernel in-shader reconstruction: thread-form `idx = gid.x + gid.y*(num_workgroups.x*wg_size)` (QK/AV/add); row-form `row_idx = wid.x + wid.y*num_workgroups.x` (softmax — keeps a `valid` predicate, not an early return, so `workgroupBarrier()`s stay uniform) - `Sdpa.cpp`: QK/softmax/AV counts via the 2D helper; the dynamic-`input_pos` resize hook recomputes both x and y for QK - Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX `get_2d_grid_dims` packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste) **Constraints**: - `y=1` fast path keeps every non-folded dispatch byte-identical to the prior 1D path - Scope = prefill path only; `rms_norm`/`embedding`/`lm_head`/`update_cache` are row/token-indexed and never hit the cap, so they keep the 1D path - Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the `uint32` element guard fires first at S~11585) Co-authored-with: Claude Code. ghstack-source-id: 399812920 @exported-using-ghexport Differential Revision: [D109517684](https://our.internmc.facebook.com/intern/diff/D109517684/)
…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 399812923 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
Pull Request resolved: #20651 **Lift the 65535 workgroup-per-dim cap for `mul` and `permute` so they run at any numel.** `mul.Tensor` and `permute` still used `compute_1d_workgroup_count`, which throws once `numel / wg_size > 65535` — hit by a realistic Llama-3.2-1B LoRA layer (`mul` over `[2048, 8192]` = 262k workgroups; `permute` of `[2048, 2048]` = 65536). `add`/`sub`/`div`/`fill`/`sdpa` already use the 2D fold; this brings `mul` + `permute` in line. Key changes: - `mul/BinaryOp.cpp`, `permute/Permute.cpp` — `compute_1d_workgroup_count` → `compute_2d_workgroup_count` (returns `utils::WgCount`); dispatch + resize hook now set both `workgroup_count_x` and `workgroup_count_y`. - `binary_mul.wgsl`, `permute.wgsl` — `main` takes `@builtin(num_workgroups)`; flat index `gid.x + gid.y * (num_workgroups.x * wg_size)` (regenerated `*_wgsl.h`). Mirrors the landed `add` op fold (`runtime/ops/add/{BinaryOp.cpp,binary_add.wgsl}`). Co-authored-with: Claude Code. ghstack-source-id: 399812930 @exported-using-ghexport Differential Revision: [D110149677](https://our.internmc.facebook.com/intern/diff/D110149677/)
…scripten Dawn Pull Request resolved: #20652 **Key the `timedWaitAny` instance setup to the actual Dawn API instead of `__EMSCRIPTEN__`, so native-rig Dawn and emscripten/emdawnwebgpu use the modern `requiredFeatures` path and only the vendored Dawn uses the legacy `capabilities.*` path.** The instance-descriptor setup was guarded by `#if defined(__EMSCRIPTEN__)`, which routed emscripten (emdawnwebgpu, emcc 4.0.19+) through the legacy `capabilities.*` API that no longer exists there. The guard now keys off the API actually present. Key changes: - `WebGPUDevice.cpp` — `#if defined(__EMSCRIPTEN__)` → `#if defined(WEBGPU_DAWN_INSTANCE_CAPABILITIES)`. The legacy `instance_desc.capabilities.*` path is taken only by the buck-vendored Dawn (which defines the macro); native cmake Dawn and emscripten leave it undefined and take the `requiredFeatures` / `WGPUInstanceFeatureName_TimedWaitAny` path. Co-authored-with: Claude Code. ghstack-source-id: 399812934 @exported-using-ghexport Differential Revision: [D110149678](https://our.internmc.facebook.com/intern/diff/D110149678/)
Pull Request resolved: #20706 Convert the remaining hand-rolled `int main()` + printf/`bool ok` native tests to GTest so the whole `backends/webgpu/test/` suite is uniform, filterable via `--gtest_filter`, and self-reporting (extends the GTest conversion already applied to `test_dynamic_shape`). The five converted files are a harness-only change — every test case, tensor shape, tolerance, artifact filename, and skip condition is preserved 1:1, only the pass/fail reporting mechanism changes — and this diff additionally wires the already-GTest `webgpu_dynamic_shape_test` into the CI runner so the dynamic-shape suite actually executes. Key changes: - `test/test_webgpu_native.cpp`, `test/native/test_dispatch_order.cpp`, `test/native/test_index.cpp`, `test/native/test_scratch_buffer.cpp`, `test/native/test_update_cache.cpp` — `main`+`printf`/`bool ok` accumulator → `TEST()` cases using `EXPECT_*`/`ASSERT_*`; each keeps a custom `main()` that brings up the WebGPU device once then `RUN_ALL_TESTS()` (device-absent still SKIPs by returning 0). `test_index`/`test_webgpu_native` use inclusive `EXPECT_LE(err, tol)` to match the original `err > tol` fail gate exactly. - `CMakeLists.txt` — move every native-test target into the `if(TARGET GTest::gtest)` block, linking `GTest::gtest`. - `scripts/test_webgpu_native_ci.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` to the native-test configure so the now-gtest-gated targets are defined, and wire `webgpu_dynamic_shape_test` into the runner: export its `.pte`s + goldens via `export_dynamic_shape_cases`, add it to the built/run target list behind the same `--target help` probe, and run it guarded (mirroring the `index` test). - `test/test_build_webgpu.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` so the local build script (which builds the now-gtest-gated targets unconditionally) still finds them. ghstack-source-id: 399812941 @exported-using-ghexport Differential Revision: [D110536636](https://our.internmc.facebook.com/intern/diff/D110536636/)
This pull request was closed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20574 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/66/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/66/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/65/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/66/orig
@diff-train-skip-merge