Rebase release/1.3 onto 0cafcb20 by JacobSzwejbka · Pull Request #19456 · pytorch/executorch

JacobSzwejbka · 2026-05-11T16:33:16Z

Updates release/1.3 to 0cafcb2.\n\nThe current release/1.3 head is an ancestor of that commit, so this is a fast-forward update by 58 commits.

Several Arm operator tests were creating random inputs at module import time. The Arm test seed is applied later by an autouse pytest fixture, so those tensors were not actually controlled by ARM_TEST_SEED. That made tests nondeterministic across fresh pytest processes and could expose different quantization behavior from run to run. Generate the affected inputs lazily inside each test case so the existing seed fixture makes them reproducible and ARM_TEST_SEED=RANDOM can rerandomize the intended data. Signed-off-by: Per Held <[email protected]> Change-Id: Ic4414da5e84b7fb19275e04399634289b10a0a19

@main

### Summary pytorch/test-infra's setup-miniconda action pre-installs cmake=3.22 ninja=1.10 pkg-config=0.29 wheel=0.37 from the anaconda defaults channel into the conda env it sets up for macOS jobs. Our own setup-conda.sh then installs cmake=3.31.2 and friends from conda-forge into the same env, and reconciling the two channels' transitive deps (e.g. zlib=1.2.13 vs libzlib>=1.3.1, rhash=1.4.3 vs rhash>=1.4.5) has been intermittently failing the libmamba solver. The companion test-infra PR exposes a default-packages input on macos_job.yml. Pass an empty string from every macos_job.yml callsite in this repo so the conda env created by setup-miniconda no longer pre-pollutes the env with defaults-channel packages we don't use, and our subsequent conda-forge install resolves cleanly. This change has no effect until the [test-infra PR](pytorch/test-infra#8033) lands. Once it's merged on test-infra@main, the workflows here pick it up automatically because executorch tracks @main for all test-infra references. Authored with Claude Code. ### Test plan CI

@metascroy

## Summary Adds an MLX delegate handler for `aten.roll`, mapping `torch.roll` onto `mlx::core::roll` via a new `RollNode` in the schema. Replaces the default decomposition (`index_select + arange + cat`) with a single native kernel — needed by Swin Transformer's shift-window attention. Flat roll (`dims=[]`) raises `NotImplementedError` for now; no known consumer needs it yet. Generated files (`MLXLoader.*`, `schema_generated.h`, `mlx_graph_schema.py`, `_generated_serializers.py`, `_generated_inspector.py`, `_generated/`) are regenerated from `schema.fbs` by `backends/mlx/CMakeLists.txt` at build time and are deliberately not committed. Fixes #18919. ## Test plan - `python backends/mlx/serialization/generate.py` — regenerates cleanly with `RollNode` in all expected outputs. - `lintrunner --skip MYPY --paths-cmd 'git diff --name-only upstream/main'` — no issues. - End-to-end `run_all_tests -k roll` not run locally (no executorch build on this machine); relying on CI. Happy to push fixes if it finds anything. cc @metascroy Co-authored-by: Ishan Godawatta <[email protected]>

number of think tokens is a little flakey and I dont think its super material for now so relaxing

### Summary The QNN backend test workflows have been flaking because the download from softwarecenter.qualcomm.com aborts mid-stream with `curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR`, or returns a short error body that curl treats as a successful 200 — letting unzip choke on the not-a-zip with exit 9. The previous `curl --retry 3` only covered a narrow set of transient errors and never validated the archive, so neither failure was retried. Wrap the download in a five-attempt loop using `curl --fail --retry-all-errors` and validate each attempt with `unzip -t` before proceeding, with the on-disk file size logged on failure so a tiny error body is unambiguous in the log. Authored with Claude Code. ### Test plan CI

…matmul (#19300) Replace torch.allclose(atol/rtol) with an SNR (signal-to-noise ratio) assertion across all int4_matmul / int4_matvec / dequant-vs-fused tests. Why: - test_prefill_short was flaking on CI (A10G) with max_abs_err=1.0000. Root cause: bf16 GEMM with K=2048 reduction produces output magnitudes up to ~200; at that scale, the bf16 ULP gap is 0.5-1.0. Triton fused kernel and cuBLAS reduce in different orders (and Triton autotune picks different tile configs on different hardware), so 1-ULP element-wise differences are unavoidable. atol/rtol false-fails on these outliers; SNR averages them out. - atol/rtol thresholds also depend on size: a value tuned for K=2048 is too loose for K=64 and too tight for K=4096. SNR is size-invariant (||signal|| and ||noise|| both scale with sqrt(N) and sqrt(K), canceling in the ratio). What: - Add _assert_snr(test_case, actual, expected, label) helper that asserts 20*log10(||expected|| / ||actual-expected||) >= 50 dB. - Replace 4 call sites: TestInt4Matmul, TestInt4Matvec (x2), TestDequantThenMatmul. - 50 dB ~ 0.3% RMS error: well below observed clean noise (80-90 dB) and well above any real functional bug (<20 dB SNR for wrong stride / flipped nibble / off-by-one group_idx / missing mask). Test plan: python -m pytest backends/cuda/tests/test_int4_matmul.py -v -> 35/35 passed

@digantdesai

Summary: D102325968 added an import of `executorch.backends.test.program_builder` in `test_fuse_constant_ops_pass.py` but only updated the xplat `targets.bzl` with the corresponding Buck dependency. The fbcode `targets.bzl` was missing this dependency, causing test listing failures for `fbcode//executorch/backends/arm/test:fuse_constant_ops_pass`. Add `//executorch/backends/test:program_builder` to the deps list. Differential Revision: D103456950 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Disabling animations sometimes cant happen right after a slow boot. Moving disabling animations to the script where we can retry. Did some general cleaning of other possible sources of flakiness as well. Authored with codex

Really should retry the web requests in optimum instead of just retrying the whole export here.

@digantdesai

…ownloads (#19309) ### Summary The Test ARM Backend workflow has been intermittently failing with `curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR` during the FVP corstone download from developer.arm.com's CDN. The toolchain download in the same setup uses the same bare-curl pattern and fails the same way when the CDN flakes. In both cases the previous flow was a single `curl --output ...` followed by a fatal `verify_md5`, so neither a transient HTTP/2 reset nor a short error body that curl treats as a successful 200 was retried. Factor out a `download_with_retry` helper in utils.sh that wraps the download in a five-attempt outer loop using `curl --fail --retry-all-errors` and validates each attempt against the published MD5 before proceeding, with the on-disk file size logged on failure for diagnosis. Switch verify_md5's mismatch path from `exit 2` to `return 2` so the helper can treat a bad checksum as a retryable failure; existing callers (`verify_md5 ... || exit 1`) keep the same fatal-on-mismatch behavior since the function still returns non-zero on a bad checksum. Use the helper from both fvp_utils.sh and toolchain_utils.sh in place of the bare `curl` + `verify_md5` pair. Authored with Claude Code. ### Test plan CI cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Differential Revision: D103690468 Pull Request resolved: #19286

Seeing small numerical flakeyness causing diverging output. The output is roughly equivalent though. Current thought is different cpu architectures causing different xnnpack kernels to trigger causing minor difference in output.

With these giant 500+ op tests we often will get a flakey 1/500 failure. Just adding retries to make this a little less noisy. Failure is something like FAILED backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_conv2d_with_linear - AssertionError: Pass validation failed for pass ReplaceTrivialConvWithLinear. Output tensor 0 differs by max 1.525879e-05. Expected rtol=2e-05, atol=1e-06. Original output: tensor([[[[ 6.5604]],

) ### Summary a79521b ("Add LongRoPE support and fp64 RoPE precompute for Phi-3 / Phi-4 family") unconditionally moved hf_precompute_freqs_cis to fp64 cos/sin precompute with a final cast to fp32. That works for the Phi-4 device validation that motivated the commit, but it broke test_static_attention.py::test_within_transformer on the Linux unittest runners (pull, pull-editable, trunk-release have been 100% red since the commit landed). The test compares mha_transformer (built with use_hf_rope=False, taking the pure-fp32 precompute_freqs_cis path) against static_transformer (built with use_hf_rope=True, taking hf_precompute_freqs_cis) at rtol=1e-3, with shared weights. Before a79521b, both paths produced bit-identical fp32 cos/sin tables (verified empirically: 0/192 entries differed). After the commit, HF cos/sin diverge from non-HF by ~1 ULP in 38/192 entries; that drift compounds across 4 transformer layers and tips past rtol=1e-3 on the CI runners (Python 3.10, source-built torch). Local Python 3.12 stayed just barely within tolerance, which is why review missed it. Gate the fp64 precompute on the property the original commit was actually protecting: a non-trivial cos/sin scale being applied. That is either LongRoPE active (Phi-3 / Phi-4 set short_factor and long_factor via config) or an explicit attention_factor != 1.0 passed through. Both cases preserve fp64; vanilla HF RoPE (Llama family, the test config) goes back to fp32 throughout and re-establishes bit-identical agreement with the non-HF path. Authored with Claude Code. ### Test plan CI

…es (#19314) Summary: The 6 type-trait checks below were defined as TEST_F(CUDAGuardTest, ...) and TEST_F(CUDAStreamGuardTest, ...). Both fixtures' SetUp() calls GTEST_SKIP() when no CUDA device is available, so on every test host without an attached GPU these tests skip instead of running: CUDAGuardTest.CopyConstructorDeleted CUDAGuardTest.CopyAssignmentDeleted CUDAGuardTest.MoveAssignmentDeleted CUDAStreamGuardTest.CopyConstructorDeleted CUDAStreamGuardTest.CopyAssignmentDeleted CUDAStreamGuardTest.MoveAssignmentDeleted Because they never produced a successful run (Passes: 0 across 173 / 23 runs, all skips), TestX auto-disabled them and they show up as DISABLED on the executorch dashboard. These are pure compile-time static_assert checks. They do not need a CUDA device or any runtime state — if the file compiles, they pass. Move them into a separate non-fixture test suite (CUDAGuardCompileTimeTest / CUDAStreamGuardCompileTimeTest) so they run unconditionally. The remaining 15 fixture-based tests still need a real CUDA device and will be addressed separately (fixing the gpu-remote-execution platform deps so cudaGetDeviceCount returns a non-zero value). Reviewed By: Gasoonjia Differential Revision: D103937761

Summary: `xplat/executorch/extension/training/examples/XOR/BUCK` invokes `define_common_targets()` for both fbcode (`fbcode_target`) and xplat (`non_fbcode_target`). The python targets in this example (`model`, `export_model_lib`, `export_model`) depend on `//caffe2:torch` and `//executorch/exir:lib`, neither of which is defined as an xplat target — `xplat/executorch/exir/BUCK` only declares the `:lib` target via `fbcode_target(...)`. As a result the xplat configuration of `fbsource//xplat/executorch/extension/training/examples/XOR:export_model` fails analysis with: Unknown target `lib` from package `fbsource//xplat/executorch/exir`. Did you mean one of the 0 targets in fbsource//xplat/executorch/exir:BUCK? This produced 218/218 BUILD_RULE failures on the `fbsource//xplat/executorch/extension/training/examples/XOR:export_model` target with no successful run on record (linked to T168807700). Wrap the three python rules with `if not is_xplat():` so they only register when called from fbcode, matching the established precedent in `xplat/executorch/kernels/portable/test/targets.bzl`. The `train_xor` C++ binary continues to be defined for both cells since its dependencies are xplat-compatible. Differential Revision: D103951555

@digantdesai

Similar to the cadence retries. The numerics tests are a little too strict likely and a little flakey. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Job is a little flakey as sometimes the runner doesn't contain pip. This adds it as an explicit dep.

Differential Revision: D103935830 Pull Request resolved: #19318

@digantdesai

Summary: Forward fix on top of D103817917 (`Arm backend: Cleanup dim-order and permute handling` — DiffTrain import of upstream PR #19278). D103817917 reverted an internally-applied test split: the bool permute case (`rank2_bool`) is U55-rejected and was already moved out of the U55-delegating test on master into a separate `test_data_suite_u55_reject` set with a dedicated `test_permute_u55_INT_not_delegated` test using `OpNotSupportedPipeline`. Upstream PR #19278 doesn't include that split, so the DiffTrain import wipes it out and brings back the combined `test_data_suite` + special-cased bool branch in `test_permute_u55_INT`. That regresses CI: `test_permute_u55_INT[rank2_bool]` is reported as a critical LAND_BLOCKING failure on D103817917. Re-apply the split so trunk returns to the clean form after D103817917 lands: - Add `OpNotSupportedPipeline` import. - Move `rank2_bool` out of `test_data_suite` into a new `test_data_suite_u55_reject` dict. - Drop the dead `if test_data[0].dtype == torch.bool: ...` workaround block from `test_permute_u55_INT` (no bool flows through this test anymore). - Add `test_permute_u55_INT_not_delegated` parametrized over `test_data_suite_u55_reject`, exercising `OpNotSupportedPipeline` with `u55_subset=True`. The `test_data_suite_u55` dict introduced by D103817917 (large permutes that only U55 needs to exercise) is preserved unchanged. Differential Revision: D103963260 cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Several macos jobs have been timing out really bad since this change blocking viablestrict. We could potentially just increase the runtime but until I actually get a viable strict bump I dont want to be sat waiting 1-2hrs for these jobs to run so reverting to get back to the 30-40m runtimes.

Summary: Fixes 3 `-Werror` diagnostics that broke the qualcomm llama runner build on `cfg:android-arm64-clang19-no-san` and disabled the following test infra targets: - `xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib_static` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner` - `xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner_static` Three diagnostics fixed: 1. `-Wreorder-ctor` in `runner.cpp`: `attention_sink_rope_module_` is declared as the 2nd field of `Runner<T>` (right after `module_`) but the constructor initializer list appended it last, after `tokenizer_`. Moved it to the correct position in the init list to match declaration order. Recent regression introduced in the attention-sink diff (#16574). 2. `-Woverloaded-virtual` in `lhd_token_generator.h` and `multimodal_lhd_token_generator.h`: the derived classes define a `prepare_io(std::vector<uint64_t>, std::vector<int32_t>)` overload that hides the base class virtual `prepare_io(uint64_t, int64_t)`. Added a `using TokenGenerator<T>::prepare_io;` (and equivalent for the multimodal hierarchy) declaration so the base virtual stays in scope and the warning is silenced without changing behavior. Latent bug surfaced by the clang19 toolchain bump. 3. `-Wdelete-non-abstract-non-virtual-dtor` in `prompt_processor.h`: `PromptProcessor<T>` has virtual member functions but no virtual destructor, so deleting via `std::unique_ptr<PromptProcessor<T>>` in `Runner` was undefined behavior under strict warnings. Added `virtual ~PromptProcessor() = default;` mirroring the pattern already used in `TokenGenerator` (`token_generator.h`). Also transitively fixes `MultimodalPromptProcessor<T>`. Reviewed By: rascani Differential Revision: D103991803

Differential Revision: D104074211 Pull Request resolved: #19337

Jobs been timing out since the first attempt

Missing dimension check which was breaking test.

Differential Revision: D103455766 Pull Request resolved: #19264

HuggingFace's Xet storage backend stalls mid-download on CI runners, causing the 90-minute job timeout to fire before model weights finish downloading. Force standard HTTP downloads instead. (from debug logs in #19352)

Summary: The 13 XCTestCase methods in `xplat/executorch/extension/llm/apple:ExecuTorchLLMTests` (testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their reset variants) regularly hit the 1800-second per-test ceiling enforced by `fbobjc/Tools/xctest_runner` for the `long_running` label. LLM inference on iOS-sim CPU (1B-class models, 128-768 token sequences, each test calls `generate()` twice) routinely exceeds 30 minutes per test method, producing spurious "Test timed out after 1800 seconds" flakes on the test-issues dashboard for owner `ai_infra_mobile_platform`. Per the runner formula `TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`: | label | multiplier | per-XCTestCase budget | |----------------|-----------:|----------------------:| | long_running | x10 | 1800s | | glacial (here) | x30 | 5400s | Switching to `glacial` (the highest tier supported by the runner) gives each test 90 minutes. Adding `test_test_rule_timeout_ms = 14400000` sets the bundle-level wall-clock budget to 4h, which is comfortable headroom for ~5 testcases at 90 min each plus xctest setup/teardown. Note: this diff is unrelated to T269848646. T269848646 tracks a separate cluster of 446 iOS-sim test-run *cancellations* (`duration: 0.00`, "test execution was cancelled because the test run was cancelled") that is owned by testinfra and is not addressed here. Reviewed By: shoumikhin Differential Revision: D104147313

Differential Revision: D104297130 Pull Request resolved: #19378

### Summary It should work now that google/pthreadpool#92 is merged.

Fixes #11723 ## Summary `torch.split` fails with `RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations` when used with `to_edge_transform_and_lower` and a partitioner that requests op preservation. **Root cause**: `_remove_invalid_ops_for_not_decompose` relies on `torchgen`'s `aliased_return_names()` to detect ops with aliased returns. However, for ops returning lists of aliased tensors (e.g., `split.Tensor` returns `Tensor(a)[]`), `aliased_return_names()` returns `[None]`, failing to detect the alias annotation. This lets `split.Tensor` pass through into the `EDGE_DO_NOT_DECOMP` namespace, where functionalization fails. **Fix**: Add a fallback check using `op._schema.returns` directly, which correctly reports `alias_info` on list return types. This also fixes the same latent issue for `chunk.default` and `tensor_split.sections`. ## Test plan - Added `test_remove_invalid_ops_filters_aliased_list_returns` regression test - Run: `pytest exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns -xvs` - Verified existing split-related test still passes: `test_to_out_variant_singleon_tensor_list` - Verified existing broken ops test still passes: `test_compile_fix_broken_ops` <details> <summary>Before fix</summary> ``` ==================== BEFORE FIX ==================== RESULT: FAILED RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: EDGE_DO_NOT_DECOMP::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[]. We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub. While executing %split : [num_users=3] = call_function[target=torch.ops.EDGE_DO_NOT_DECOMP.split.Tensor](args = (%x, 2), kwargs = {}) Original traceback: None Use tlparse to see full graph. (https://github.com/pytorch/tlparse?tab=readme-ov-file#tlparse-parse-structured-pt2-logs) ``` </details> <details> <summary>After fix</summary> ``` ==================== AFTER FIX ==================== WARNING:root:Op aten.split.Tensor was requested for preservation by partitioner. This request is ignored because it aliases output. Test 1: to_edge (no partitioner) RESULT: SUCCESS - outputs match Test 2: to_edge_transform_and_lower with split.Tensor preservation RESULT: SUCCESS - split.Tensor correctly filtered from EDGE_DO_NOT_DECOMP (AttributeError from dummy partitioner partition(), not from split bug) Test 3: _remove_invalid_ops_for_not_decompose filter check aten::split.Tensor -> FILTERED (correct) aten::chunk -> FILTERED (correct) aten::tensor_split.sections -> FILTERED (correct) ``` </details> <details> <summary>Unit test output</summary> ``` $ pytest exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns -xvs ============================= test session starts ============================== platform linux -- Python 3.12.12, pytest-8.4.2 collected 1 item exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns PASSED ============================== 1 passed in 6.83s =============================== $ pytest exir/tests/test_passes.py::TestPasses::test_to_out_variant_singleon_tensor_list -xvs PASSED $ pytest exir/tests/test_passes.py::TestPasses::test_compile_fix_broken_ops -xvs PASSED ``` </details> This PR was authored with the assistance of Claude. --------- Signed-off-by: Lidang-Jiang <[email protected]>

Differential Revision: D103667823 Pull Request resolved: #19319

<< DO NOT EDIT BELOW THIS LINE >> @diff-train-skip-merge

Differential Revision: D104429023 Pull Request resolved: #19400

Summary: Three test methods in `fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py` have been auto-disabled as flaky on the test-issues dashboard (owner ai_infra_mobile_platform): - test_upsample_bilinear2d_aa_aten_parity_u8 - test_upsample_bilinear2d_aa_aggressive_downsampling - test_upsample_bilinear2d_aa_align_corners_downsampling Root cause: each test builds its input via `torch.randint(...)` or `torch.randn(...)` with no seed pinned, so each run sees a different sample. The configured `atol` was tight enough that on some draws the ATen-vs-ExecuTorch divergence (driven by separable-vs-direct anti-aliased interpolation differences) crossed the threshold and the test flipped to FAIL. The kernel implementations themselves are not changing across runs. Fix: 1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same input tensor and the same divergence, eliminating the run-to-run FAIL/PASS oscillation. 2. Bump two atol thresholds to cover the worst-case observed divergence with the now-pinned input: - u8 parity: 3.5 -> 5 (observed max abs error 4 / 255) - aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error ~0.59 for N(0,1) input) 3. The pre-existing `atol=0.25` on align_corners_downsampling is left unchanged - with seed 0 it now passes consistently. The relaxed tolerances are still well below any change that would indicate an actual kernel regression; the comprehensive C++ test suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the kernel under tighter constraints. Reviewed By: rascani Differential Revision: D104150928

@mergennachin

Fixes #19348 ### Summary - Add `extension/module` and `extension/tensor` headers to the Doxygen inputs used by the runtime API reference. - Expand the module namespace macros so Breathe can resolve the documented extension classes with stable namespace names. - Add runtime API reference sections for `Module`, `BundledModule`, and the tensor extension namespace. ### Test plan - `git diff --check origin/main..HEAD` - `python -m py_compile docs/source/conf.py` - `cd docs && doxygen source/Doxyfile` - Isolated Breathe/Sphinx build of `executorch-runtime-api-reference.rst` against the generated Doxygen XML - Verified the rendered runtime API page contains the new Module Extension and Tensor Extension entries cc @mergennachin @AlannaBurke

@digantdesai

## Context The original K-loop did `tl.max(tl.abs(a))` + INT8 cast on every tile (16 tiles × 16 rows = 256 reductions per program). Hoisting eliminates this redundant work and halves activation HBM bandwidth in the GEMM (bf16 → int8). ## Improvement Pre-quantize activations to INT8 once into a dedicated buffer (with per-row-per-tile FP32 scales) **before** the W4A8 batched MoE GEMMs, instead of re-quantizing inside the K-loop on every tile. ## Perf (1600-token prefill) | Metric | Baseline (`gh/digantdesai/53/head`) | Optimized | Speedup | |---|---|---|---| | Prefill | 5727 tok/s (5296–5963) | **6171** tok/s (5941–6313) | **1.08×** | ## Correctness 7/7 microbenchmark configs (incl. qwen3.5-like M=128, K=2048, gs=128) pass with relative diff <1.5% vs BF16 reference — within INT8 quantization noise. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Add `include_torch` parameter (default False) to `define_custom_op_test_binary()`. None of the custom op test binaries directly include torch/ATen/c10 headers, so libtorch was unnecessary baggage. Dropping it reduces the q4gsw_linear_adreno binary from ~1 GB to 74 MB. Differential Revision: [D104456804](https://our.internmc.facebook.com/intern/diff/D104456804/) ghstack-source-id: 379498992 Pull Request resolved: #19402

Adds infrastructure for querying GPU subgroup capabilities and pinning required subgroup size at pipeline creation time, sourced from the existing `SUBGROUP_SIZE` yaml template parameter. This is the foundation for writing subgroup-using shaders (e.g. cooperative GEMV variants) that remain portable across GPUs with different subgroup widths (Adreno=64, Mali=16, NVIDIA=32, etc.). `PhysicalDevice` now chains `VkPhysicalDeviceSubgroupProperties` and `VkPhysicalDeviceSubgroupSizeControlProperties` into `vkGetPhysicalDeviceProperties2`, plus `VkPhysicalDeviceSubgroupSizeControlFeatures` into `vkGetPhysicalDeviceFeatures2`. The `Adapter` exposes accessors for subgroup_size, supported subgroup ops/stages, [min,max] subgroup size range, and whether the driver supports per-pipeline required subgroup size for the COMPUTE stage. `VK_EXT_subgroup_size_control` is added to the requested extension list and the size-control features are chained into device-create pNext when supported. `ComputePipeline::Descriptor` gains a `required_subgroup_size` field that, when nonzero, chains `VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT` into pipeline creation (both the on-demand `retrieve` path and the batch `create_pipelines` path). The pipeline cache key includes the field so pipelines compiled for different subgroup widths cache independently. `ShaderInfo` carries the same field so it can be plumbed from shader yaml through to the pipeline descriptor. The existing `SUBGROUP_SIZE` yaml template parameter is now the single source of truth: `gen_vulkan_spv.py` substitutes it into GLSL as before AND emits it as `ShaderInfo::required_subgroup_size`. At dispatch, `vkapi::resolve_required_subgroup_size` validates the value is within the adapter's `[min, max]` range and throws `ShaderNotSupportedError` if the extension is unsupported or the value is out of range, surfacing a clear failure rather than silently miscompiling a shader whose algorithm depends on the pinned subgroup width. No shader yamls are modified by this change; subsequent commits opt their shaders into the pinning by declaring `SUBGROUP_SIZE` in their yamls. Differential Revision: [D104456803](https://our.internmc.facebook.com/intern/diff/D104456803/) ghstack-source-id: 379498994 Pull Request resolved: #19403

pytorch-bot · 2026-05-11T16:33:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19456

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 114 Pending, 3 Unrelated Failures

As of commit 0cafcb2 with merge base 0cafcb2 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / gather-models (gh) (trunk failure)
trunk / test-models-macos-coreml (ic3) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
trunk / test-models-macos-cpu (ic4, xnnpack-quantization-delegation) / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

perheld and others added 30 commits May 5, 2026 18:17

Lora fix (#19304)

9915faf

number of think tokens is a little flakey and I dont think its super material for now so relaxing

change ffmpeg install path away from conda on linux (#19306)

15da1d1

Improve Android Emulator Robustness (#19310)

10a0c91

Disabling animations sometimes cant happen right after a slow boot. Moving disabling animations to the script where we can retry. Did some general cleaning of other possible sources of flakiness as well. Authored with codex

Improve huggingface robustness (#19311)

fe2ce06

Really should retry the web requests in optimum instead of just retrying the whole export here.

runner fix to mitigate the numerical issue (#19286)

9b95dd2

Differential Revision: D103690468 Pull Request resolved: #19286

Relax lora string test (#19312)

165ac2e

Seeing small numerical flakeyness causing diverging output. The output is roughly equivalent though. Current thought is different cpu architectures causing different xnnpack kernels to trigger causing minor difference in output.

Retry op numeric tests Arm (#19321)

6a8d341

Similar to the cadence retries. The numerics tests are a little too strict likely and a little flakey. cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

Declare pip as explicit dep (#19322)

5b337e9

Job is a little flakey as sometimes the runner doesn't contain pip. This adds it as an explicit dep.

Fix FuseMMWithAdd returning False after graph mutation

8ae05c2

Differential Revision: D103935830 Pull Request resolved: #19318

Fix retry variable on mac (#19333)

c7e8628

Fix ExecutorTorch → ExecuTorch in comments only

3a4c3a1

Differential Revision: D104074211 Pull Request resolved: #19337

Limit ARM retries to operator tests (#19343)

3c4ec8f

Jobs been timing out since the first attempt

Fix missing check (#19340)

851cffb

Missing dimension check which was breaking test.

route EthosU input/output memcpy through overridable hook (#19264)

af90130

Differential Revision: D103455766 Pull Request resolved: #19264

Disable HF Xet storage to fix CI export timeouts (#19358)

1414bc1

HuggingFace's Xet storage backend stalls mid-download on CI runners, causing the 90-minute job timeout to fire before model weights finish downloading. Force standard HTTP downloads instead. (from debug logs in #19352)

rascani and others added 11 commits May 8, 2026 17:59

Use gpu_cpp_unittest for slim CUDA guard tests

c564936

Differential Revision: D104297130 Pull Request resolved: #19378

Re-land XNNPACK update (#19237)

b57ac03

### Summary It should work now that google/pthreadpool#92 is merged.

Add a16w8 reduce_sum FVP coverage for Ethos-U85 (#19319)

7e16433

Differential Revision: D103667823 Pull Request resolved: #19319

Replace external_deps with deps for prettytable (#19401)

b3baac5

<< DO NOT EDIT BELOW THIS LINE >> @diff-train-skip-merge

Remove Vulkan shader DotSlash label

9889c7c

Differential Revision: D104429023 Pull Request resolved: #19400

JacobSzwejbka requested review from Gasoonjia, GregoryComer, SS-JIA, abhinaykukkadapu, digantdesai, kirklandsign, larryliu0820, lucylq, manuelcandales and mergennachin as code owners May 11, 2026 16:33

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 11, 2026

github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 11, 2026

shoumikhin approved these changes May 11, 2026

View reviewed changes

JacobSzwejbka merged commit 7539fa1 into release/1.3 May 11, 2026
1026 of 1177 checks passed

JacobSzwejbka deleted the jakeszwe/release-1.3-rebase-0cafcb20 branch May 11, 2026 16:43

JacobSzwejbka temporarily deployed to cherry-pick-bot May 11, 2026 17:30 — with GitHub Actions Inactive

JacobSzwejbka temporarily deployed to upload-benchmark-results May 13, 2026 03:31 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase release/1.3 onto 0cafcb20#19456

Rebase release/1.3 onto 0cafcb20#19456
JacobSzwejbka merged 58 commits into
release/1.3from
jakeszwe/release-1.3-rebase-0cafcb20

JacobSzwejbka commented May 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

JacobSzwejbka commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19456

⏳ 114 Pending, 3 Unrelated Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

JacobSzwejbka commented May 11, 2026 •

edited

Loading

pytorch-bot Bot commented May 11, 2026 •

edited

Loading