Skip to content

Rebase release/1.3 onto 0cafcb20#19456

Merged
JacobSzwejbka merged 58 commits into
release/1.3from
jakeszwe/release-1.3-rebase-0cafcb20
May 11, 2026
Merged

Rebase release/1.3 onto 0cafcb20#19456
JacobSzwejbka merged 58 commits into
release/1.3from
jakeszwe/release-1.3-rebase-0cafcb20

Conversation

@JacobSzwejbka
Copy link
Copy Markdown
Contributor

@JacobSzwejbka JacobSzwejbka commented May 11, 2026

Updates release/1.3 to 0cafcb2.\n\nThe current release/1.3 head is an ancestor of that commit, so this is a fast-forward update by 58 commits.

perheld and others added 30 commits May 5, 2026 18:17
Several Arm operator tests were creating random inputs at module import
time. The Arm test seed is applied later by an autouse pytest fixture,
so those tensors were not actually controlled by ARM_TEST_SEED.

That made tests nondeterministic across fresh pytest processes and could
expose different quantization behavior from run to run. Generate the
affected inputs lazily inside each test case so the existing seed
fixture makes them reproducible and ARM_TEST_SEED=RANDOM can rerandomize
the intended data.

Signed-off-by: Per Held <[email protected]>
Change-Id: Ic4414da5e84b7fb19275e04399634289b10a0a19
### Summary
pytorch/test-infra's setup-miniconda action pre-installs cmake=3.22
ninja=1.10 pkg-config=0.29 wheel=0.37 from the anaconda defaults channel
into the conda env it sets up for macOS jobs. Our own setup-conda.sh
then installs cmake=3.31.2 and friends from conda-forge into the same
env, and reconciling the two channels' transitive deps (e.g. zlib=1.2.13
vs libzlib>=1.3.1, rhash=1.4.3 vs rhash>=1.4.5) has been intermittently
failing the libmamba solver.

The companion test-infra PR exposes a default-packages input on
macos_job.yml. Pass an empty string from every macos_job.yml callsite in
this repo so the conda env created by setup-miniconda no longer
pre-pollutes the env with defaults-channel packages we don't use, and
our subsequent conda-forge install resolves cleanly.

This change has no effect until the [test-infra
PR](pytorch/test-infra#8033) lands. Once it's
merged on test-infra@main, the workflows here pick it up automatically
because executorch tracks @main for all test-infra references.

Authored with Claude Code.

### Test plan
CI
## Summary

Adds an MLX delegate handler for `aten.roll`, mapping `torch.roll` onto
`mlx::core::roll` via a new `RollNode` in the schema. Replaces the
default decomposition (`index_select + arange + cat`) with a single
native kernel — needed by Swin Transformer's shift-window attention.

Flat roll (`dims=[]`) raises `NotImplementedError` for now; no known
consumer needs it yet.

Generated files (`MLXLoader.*`, `schema_generated.h`,
`mlx_graph_schema.py`, `_generated_serializers.py`,
`_generated_inspector.py`, `_generated/`) are regenerated from
`schema.fbs` by `backends/mlx/CMakeLists.txt` at build time and are
deliberately not committed.

Fixes #18919.

## Test plan

- `python backends/mlx/serialization/generate.py` — regenerates cleanly
with `RollNode` in all expected outputs.
- `lintrunner --skip MYPY --paths-cmd 'git diff --name-only
upstream/main'` — no issues.
- End-to-end `run_all_tests -k roll` not run locally (no executorch
build on this machine); relying on CI. Happy to push fixes if it finds
anything.

cc @metascroy

Co-authored-by: Ishan Godawatta <[email protected]>
number of think tokens is a little flakey and I dont think its super
material for now so relaxing
### Summary
The QNN backend test workflows have been flaking because the download
from softwarecenter.qualcomm.com aborts mid-stream with `curl: (92)
HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR`, or returns a
short error body that curl treats as a successful 200 — letting unzip
choke on the not-a-zip with exit 9. The previous `curl --retry 3` only
covered a narrow set of transient errors and never validated the
archive, so neither failure was retried. Wrap the download in a
five-attempt loop using `curl --fail --retry-all-errors` and validate
each attempt with `unzip -t` before proceeding, with the on-disk file
size logged on failure so a tiny error body is unambiguous in the log.

Authored with Claude Code.

### Test plan
CI
…matmul (#19300)

Replace torch.allclose(atol/rtol) with an SNR (signal-to-noise ratio)
assertion across all int4_matmul / int4_matvec / dequant-vs-fused tests.

Why:
- test_prefill_short was flaking on CI (A10G) with max_abs_err=1.0000.
Root cause: bf16 GEMM with K=2048 reduction produces output magnitudes
up to ~200; at that scale, the bf16 ULP gap is 0.5-1.0. Triton fused
kernel and cuBLAS reduce in different orders (and Triton autotune picks
different tile configs on different hardware), so 1-ULP element-wise
differences are unavoidable. atol/rtol false-fails on these outliers;
SNR averages them out.
- atol/rtol thresholds also depend on size: a value tuned for K=2048 is
too loose for K=64 and too tight for K=4096. SNR is size-invariant
(||signal|| and ||noise|| both scale with sqrt(N) and sqrt(K), canceling
in the ratio).

What:
- Add _assert_snr(test_case, actual, expected, label) helper that
asserts 20*log10(||expected|| / ||actual-expected||) >= 50 dB.
- Replace 4 call sites: TestInt4Matmul, TestInt4Matvec (x2),
TestDequantThenMatmul.
- 50 dB ~ 0.3% RMS error: well below observed clean noise (80-90 dB) and
well above any real functional bug (<20 dB SNR for wrong stride /
flipped nibble / off-by-one group_idx / missing mask).

Test plan:
  python -m pytest backends/cuda/tests/test_int4_matmul.py -v
  -> 35/35 passed
Summary:
D102325968 added an import of `executorch.backends.test.program_builder`
in `test_fuse_constant_ops_pass.py` but only updated the xplat
`targets.bzl` with the corresponding Buck dependency. The fbcode
`targets.bzl` was missing this dependency, causing test listing failures
for `fbcode//executorch/backends/arm/test:fuse_constant_ops_pass`.

Add `//executorch/backends/test:program_builder` to the deps list.

Differential Revision: D103456950




cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Disabling animations sometimes cant happen right after a slow boot.
Moving disabling animations to the script where we can retry. Did some
general cleaning of other possible sources of flakiness as well.

Authored with codex
Really should retry the web requests in optimum instead of just retrying
the whole export here.
…ownloads (#19309)

### Summary

The Test ARM Backend workflow has been intermittently failing with
`curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR`
during the FVP corstone download from developer.arm.com's CDN. The
toolchain download in the same setup uses the same bare-curl pattern and
fails the same way when the CDN flakes. In both cases the previous flow
was a single `curl --output ...` followed by a fatal `verify_md5`, so
neither a transient HTTP/2 reset nor a short error body that curl treats
as a successful 200 was retried.

Factor out a `download_with_retry` helper in utils.sh that wraps the
download in a five-attempt outer loop using
`curl --fail --retry-all-errors` and validates each attempt against the
published MD5 before proceeding, with the on-disk file size logged on
failure for diagnosis. Switch verify_md5's mismatch path from `exit 2`
to `return 2` so the helper can treat a bad checksum as a retryable
failure; existing callers (`verify_md5 ... || exit 1`) keep the same
fatal-on-mismatch behavior since the function still returns non-zero on
a bad checksum.

Use the helper from both fvp_utils.sh and toolchain_utils.sh in place of
the bare `curl` + `verify_md5` pair.

Authored with Claude Code.

### Test plan
CI

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Differential Revision: D103690468

Pull Request resolved: #19286
Seeing small numerical flakeyness causing diverging output. The output
is roughly equivalent though. Current thought is different cpu
architectures causing different xnnpack kernels to trigger causing minor
difference in output.
With these giant 500+ op tests we often will get a flakey 1/500 failure.
Just adding retries to make this a little less noisy.

Failure is something like FAILED
backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_conv2d_with_linear
- AssertionError: Pass validation failed for pass
ReplaceTrivialConvWithLinear. Output tensor 0 differs by max
1.525879e-05. Expected rtol=2e-05, atol=1e-06. Original output:
tensor([[[[ 6.5604]],
)

### Summary
a79521b ("Add LongRoPE support and fp64 RoPE precompute for Phi-3 /
Phi-4 family") unconditionally moved hf_precompute_freqs_cis to fp64
cos/sin precompute with a final cast to fp32. That works for the Phi-4
device validation that motivated the commit, but it broke
test_static_attention.py::test_within_transformer on the Linux unittest
runners (pull, pull-editable, trunk-release have been 100% red since the
commit landed).

The test compares mha_transformer (built with use_hf_rope=False, taking
the pure-fp32 precompute_freqs_cis path) against static_transformer
(built with use_hf_rope=True, taking hf_precompute_freqs_cis) at
rtol=1e-3, with shared weights. Before a79521b, both paths produced
bit-identical fp32 cos/sin tables (verified empirically: 0/192 entries
differed). After the commit, HF cos/sin diverge from non-HF by ~1 ULP in
38/192 entries; that drift compounds across 4 transformer layers and
tips past rtol=1e-3 on the CI runners (Python 3.10, source-built torch).
Local Python 3.12 stayed just barely within tolerance, which is why
review missed it.

Gate the fp64 precompute on the property the original commit was
actually protecting: a non-trivial cos/sin scale being applied. That is
either LongRoPE active (Phi-3 / Phi-4 set short_factor and long_factor
via config) or an explicit attention_factor != 1.0 passed through. Both
cases preserve fp64; vanilla HF RoPE (Llama family, the test config)
goes back to fp32 throughout and re-establishes bit-identical agreement
with the non-HF path.

Authored with Claude Code.

### Test plan
CI
…es (#19314)

Summary:
The 6 type-trait checks below were defined as TEST_F(CUDAGuardTest, ...)
and TEST_F(CUDAStreamGuardTest, ...). Both fixtures' SetUp() calls
GTEST_SKIP() when no CUDA device is available, so on every test host
without an attached GPU these tests skip instead of running:

  CUDAGuardTest.CopyConstructorDeleted
  CUDAGuardTest.CopyAssignmentDeleted
  CUDAGuardTest.MoveAssignmentDeleted
  CUDAStreamGuardTest.CopyConstructorDeleted
  CUDAStreamGuardTest.CopyAssignmentDeleted
  CUDAStreamGuardTest.MoveAssignmentDeleted

Because they never produced a successful run (Passes: 0 across 173 / 23
runs, all skips), TestX auto-disabled them and they show up as DISABLED
on the executorch dashboard.

These are pure compile-time static_assert checks. They do not need a
CUDA device or any runtime state — if the file compiles, they pass.
Move them into a separate non-fixture test suite
(CUDAGuardCompileTimeTest /
CUDAStreamGuardCompileTimeTest) so they run unconditionally.

The remaining 15 fixture-based tests still need a real CUDA device and
will be addressed separately (fixing the gpu-remote-execution platform
deps so cudaGetDeviceCount returns a non-zero value).

Reviewed By: Gasoonjia

Differential Revision: D103937761
Summary:
`xplat/executorch/extension/training/examples/XOR/BUCK` invokes
`define_common_targets()` for both fbcode (`fbcode_target`) and xplat
(`non_fbcode_target`). The python targets in this example
(`model`, `export_model_lib`, `export_model`) depend on
`//caffe2:torch` and `//executorch/exir:lib`, neither of which is
defined as an xplat target — `xplat/executorch/exir/BUCK` only
declares the `:lib` target via `fbcode_target(...)`. As a result the
xplat configuration of
`fbsource//xplat/executorch/extension/training/examples/XOR:export_model`
fails analysis with:

  Unknown target `lib` from package `fbsource//xplat/executorch/exir`.
Did you mean one of the 0 targets in
fbsource//xplat/executorch/exir:BUCK?

This produced 218/218 BUILD_RULE failures on the

`fbsource//xplat/executorch/extension/training/examples/XOR:export_model`
target with no successful run on record (linked to T168807700).

Wrap the three python rules with `if not is_xplat():` so they only
register when called from fbcode, matching the established precedent
in `xplat/executorch/kernels/portable/test/targets.bzl`. The
`train_xor` C++ binary continues to be defined for both cells since
its dependencies are xplat-compatible.

Differential Revision: D103951555
Similar to the cadence retries. The numerics tests are a little too
strict likely and a little flakey.

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Job is a little flakey as sometimes the runner doesn't contain pip. This
adds it as an explicit dep.
Differential Revision: D103935830

Pull Request resolved: #19318
Summary:
Forward fix on top of D103817917 (`Arm backend: Cleanup dim-order
and permute handling` — DiffTrain import of upstream PR #19278).

D103817917 reverted an internally-applied test split: the bool
permute case (`rank2_bool`) is U55-rejected and was already moved
out of the U55-delegating test on master into a separate
`test_data_suite_u55_reject` set with a dedicated
`test_permute_u55_INT_not_delegated` test using
`OpNotSupportedPipeline`. Upstream PR #19278 doesn't include that
split, so the DiffTrain import wipes it out and brings back the
combined `test_data_suite` + special-cased bool branch in
`test_permute_u55_INT`. That regresses CI:
`test_permute_u55_INT[rank2_bool]` is reported as a critical
LAND_BLOCKING failure on D103817917.

Re-apply the split so trunk returns to the clean form after
D103817917 lands:
- Add `OpNotSupportedPipeline` import.
- Move `rank2_bool` out of `test_data_suite` into a new
  `test_data_suite_u55_reject` dict.
- Drop the dead `if test_data[0].dtype == torch.bool: ...`
  workaround block from `test_permute_u55_INT` (no bool flows
  through this test anymore).
- Add `test_permute_u55_INT_not_delegated` parametrized over
  `test_data_suite_u55_reject`, exercising
  `OpNotSupportedPipeline` with `u55_subset=True`.

The `test_data_suite_u55` dict introduced by D103817917 (large
permutes that only U55 needs to exercise) is preserved unchanged.

Differential Revision: D103963260




cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Several macos jobs have been timing out really bad since this change
blocking viablestrict. We could potentially just increase the runtime
but until I actually get a viable strict bump I dont want to be sat
waiting 1-2hrs for these jobs to run so reverting to get back to the
30-40m runtimes.
Summary:

Fixes 3 `-Werror` diagnostics that broke the qualcomm llama runner build
on
`cfg:android-arm64-clang19-no-san` and disabled the following test infra
targets:

- `xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib`
-
`xplat/executorch/examples/qualcomm/oss_scripts/llama:runner_lib_static`
-
`xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner`
-
`xplat/executorch/examples/qualcomm/oss_scripts/llama:qnn_llama_runner_static`

Three diagnostics fixed:

1. `-Wreorder-ctor` in `runner.cpp`: `attention_sink_rope_module_` is
declared as the 2nd field of `Runner<T>` (right after `module_`) but the
constructor initializer list appended it last, after `tokenizer_`. Moved
it to the correct position in the init list to match declaration order.
   Recent regression introduced in the attention-sink diff (#16574).

2. `-Woverloaded-virtual` in `lhd_token_generator.h` and
   `multimodal_lhd_token_generator.h`: the derived classes define a
`prepare_io(std::vector<uint64_t>, std::vector<int32_t>)` overload that
   hides the base class virtual `prepare_io(uint64_t, int64_t)`. Added a
   `using TokenGenerator<T>::prepare_io;` (and equivalent for the
multimodal hierarchy) declaration so the base virtual stays in scope and
the warning is silenced without changing behavior. Latent bug surfaced
   by the clang19 toolchain bump.

3. `-Wdelete-non-abstract-non-virtual-dtor` in `prompt_processor.h`:
   `PromptProcessor<T>` has virtual member functions but no virtual
   destructor, so deleting via `std::unique_ptr<PromptProcessor<T>>` in
   `Runner` was undefined behavior under strict warnings. Added
   `virtual ~PromptProcessor() = default;` mirroring the pattern already
   used in `TokenGenerator` (`token_generator.h`). Also transitively
   fixes `MultimodalPromptProcessor<T>`.

Reviewed By: rascani

Differential Revision: D103991803
Differential Revision: D104074211

Pull Request resolved: #19337
Jobs been timing out since the first attempt
Missing dimension check which was breaking test.
Differential Revision: D103455766

Pull Request resolved: #19264
HuggingFace's Xet storage backend stalls mid-download on CI runners,
causing the 90-minute job timeout to fire before model weights finish
downloading. Force standard HTTP downloads instead.

(from debug logs in #19352)
Summary:
The 13 XCTestCase methods in
`xplat/executorch/extension/llm/apple:ExecuTorchLLMTests`
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by `fbobjc/Tools/xctest_runner` for the `long_running`
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls `generate()` twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner `ai_infra_mobile_platform`.

Per the runner formula
`TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`:

| label          | multiplier | per-XCTestCase budget |
|----------------|-----------:|----------------------:|
| long_running   |        x10 |                 1800s |
| glacial (here) |        x30 |                 5400s |

Switching to `glacial` (the highest tier supported by the runner)
gives each test 90 minutes. Adding
`test_test_rule_timeout_ms = 14400000` sets the bundle-level
wall-clock budget to 4h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run *cancellations*
(`duration: 0.00`, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Reviewed By: shoumikhin

Differential Revision: D104147313
rascani and others added 11 commits May 8, 2026 17:59
Differential Revision: D104297130

Pull Request resolved: #19378
### Summary
It should work now that google/pthreadpool#92 is
merged.
Fixes #11723

## Summary

`torch.split` fails with `RuntimeError: Found a custom (non-ATen)
operator whose output has alias annotations` when used with
`to_edge_transform_and_lower` and a partitioner that requests op
preservation.

**Root cause**: `_remove_invalid_ops_for_not_decompose` relies on
`torchgen`'s `aliased_return_names()` to detect ops with aliased
returns. However, for ops returning lists of aliased tensors (e.g.,
`split.Tensor` returns `Tensor(a)[]`), `aliased_return_names()` returns
`[None]`, failing to detect the alias annotation. This lets
`split.Tensor` pass through into the `EDGE_DO_NOT_DECOMP` namespace,
where functionalization fails.

**Fix**: Add a fallback check using `op._schema.returns` directly, which
correctly reports `alias_info` on list return types. This also fixes the
same latent issue for `chunk.default` and `tensor_split.sections`.

## Test plan

- Added `test_remove_invalid_ops_filters_aliased_list_returns`
regression test
- Run: `pytest
exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns
-xvs`
- Verified existing split-related test still passes:
`test_to_out_variant_singleon_tensor_list`
- Verified existing broken ops test still passes:
`test_compile_fix_broken_ops`

<details>
<summary>Before fix</summary>

```
==================== BEFORE FIX ====================
RESULT: FAILED
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: EDGE_DO_NOT_DECOMP::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[]. We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.

While executing %split : [num_users=3] = call_function[target=torch.ops.EDGE_DO_NOT_DECOMP.split.Tensor](args = (%x, 2), kwargs = {})
Original traceback:
None
Use tlparse to see full graph. (https://github.com/pytorch/tlparse?tab=readme-ov-file#tlparse-parse-structured-pt2-logs)
```

</details>

<details>
<summary>After fix</summary>

```
==================== AFTER FIX ====================
WARNING:root:Op aten.split.Tensor was requested for preservation by partitioner.  This request is ignored because it aliases output.

Test 1: to_edge (no partitioner)
RESULT: SUCCESS - outputs match

Test 2: to_edge_transform_and_lower with split.Tensor preservation
RESULT: SUCCESS - split.Tensor correctly filtered from EDGE_DO_NOT_DECOMP
         (AttributeError from dummy partitioner partition(), not from split bug)

Test 3: _remove_invalid_ops_for_not_decompose filter check
  aten::split.Tensor                            -> FILTERED (correct)
  aten::chunk                                   -> FILTERED (correct)
  aten::tensor_split.sections                   -> FILTERED (correct)
```

</details>

<details>
<summary>Unit test output</summary>

```
$ pytest exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns -xvs

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-8.4.2
collected 1 item

exir/tests/test_passes.py::TestPasses::test_remove_invalid_ops_filters_aliased_list_returns PASSED

============================== 1 passed in 6.83s ===============================

$ pytest exir/tests/test_passes.py::TestPasses::test_to_out_variant_singleon_tensor_list -xvs
PASSED

$ pytest exir/tests/test_passes.py::TestPasses::test_compile_fix_broken_ops -xvs
PASSED
```

</details>

This PR was authored with the assistance of Claude.

---------

Signed-off-by: Lidang-Jiang <[email protected]>
Differential Revision: D103667823

Pull Request resolved: #19319
<< DO NOT EDIT BELOW THIS LINE >>
@diff-train-skip-merge
Differential Revision: D104429023

Pull Request resolved: #19400
Summary:
Three test methods in

`fbcode/executorch/kernels/portable/test/op_upsample_bilinear2d_aa_test.py`
have been auto-disabled as flaky on the test-issues dashboard
(owner ai_infra_mobile_platform):

- test_upsample_bilinear2d_aa_aten_parity_u8
- test_upsample_bilinear2d_aa_aggressive_downsampling
- test_upsample_bilinear2d_aa_align_corners_downsampling

Root cause: each test builds its input via `torch.randint(...)` or
`torch.randn(...)` with no seed pinned, so each run sees a different
sample. The configured `atol` was tight enough that on some draws the
ATen-vs-ExecuTorch divergence (driven by separable-vs-direct
anti-aliased interpolation differences) crossed the threshold and the
test flipped to FAIL. The kernel implementations themselves are not
changing across runs.

Fix:

1. Add `setUp(self): torch.manual_seed(0)` so every run sees the same
   input tensor and the same divergence, eliminating the run-to-run
   FAIL/PASS oscillation.
2. Bump two atol thresholds to cover the worst-case observed
   divergence with the now-pinned input:
   - u8 parity: 3.5 -> 5 (observed max abs error 4 / 255)
   - aggressive 4x downsampling: 0.4 -> 1.0 (observed max abs error
     ~0.59 for N(0,1) input)
3. The pre-existing `atol=0.25` on align_corners_downsampling is left
   unchanged - with seed 0 it now passes consistently.

The relaxed tolerances are still well below any change that would
indicate an actual kernel regression; the comprehensive C++ test
suite in `op_upsample_bilinear2d_aa_test.cpp` still validates the
kernel under tighter constraints.

Reviewed By: rascani

Differential Revision: D104150928
Fixes #19348

### Summary

- Add `extension/module` and `extension/tensor` headers to the Doxygen
inputs used by the runtime API reference.
- Expand the module namespace macros so Breathe can resolve the
documented extension classes with stable namespace names.
- Add runtime API reference sections for `Module`, `BundledModule`, and
the tensor extension namespace.

### Test plan

- `git diff --check origin/main..HEAD`
- `python -m py_compile docs/source/conf.py`
- `cd docs && doxygen source/Doxyfile`
- Isolated Breathe/Sphinx build of
`executorch-runtime-api-reference.rst` against the generated Doxygen XML
- Verified the rendered runtime API page contains the new Module
Extension and Tensor Extension entries

cc @mergennachin @AlannaBurke
## Context
The original K-loop did `tl.max(tl.abs(a))` + INT8 cast on every tile
(16 tiles × 16 rows = 256 reductions per program). Hoisting eliminates
this redundant work and halves activation HBM bandwidth in the GEMM
(bf16 → int8).

## Improvement
Pre-quantize activations to INT8 once into a dedicated buffer (with
per-row-per-tile FP32 scales) **before** the W4A8 batched MoE GEMMs,
instead of re-quantizing inside the K-loop on every tile.

## Perf (1600-token prefill)

| Metric | Baseline (`gh/digantdesai/53/head`) | Optimized | Speedup |
|---|---|---|---|
| Prefill | 5727 tok/s (5296–5963) | **6171** tok/s (5941–6313) |
**1.08×** |

## Correctness
7/7 microbenchmark configs (incl. qwen3.5-like M=128, K=2048, gs=128)
pass with relative diff <1.5% vs BF16 reference — within INT8
quantization noise.


cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils
@Sebastian-Larsson @robell
Add `include_torch` parameter (default False) to
`define_custom_op_test_binary()`. None of the custom op test binaries
directly include torch/ATen/c10 headers, so libtorch was unnecessary
baggage. Dropping it reduces the q4gsw_linear_adreno binary from ~1 GB
to 74 MB.

Differential Revision: [D104456804](https://our.internmc.facebook.com/intern/diff/D104456804/)

ghstack-source-id: 379498992
Pull Request resolved: #19402
Adds infrastructure for querying GPU subgroup capabilities and pinning required subgroup size at pipeline creation time, sourced from the existing `SUBGROUP_SIZE` yaml template parameter. This is the foundation for writing subgroup-using shaders (e.g. cooperative GEMV variants) that remain portable across GPUs with different subgroup widths (Adreno=64, Mali=16, NVIDIA=32, etc.).

`PhysicalDevice` now chains `VkPhysicalDeviceSubgroupProperties` and `VkPhysicalDeviceSubgroupSizeControlProperties` into `vkGetPhysicalDeviceProperties2`, plus `VkPhysicalDeviceSubgroupSizeControlFeatures` into `vkGetPhysicalDeviceFeatures2`. The `Adapter` exposes accessors for subgroup_size, supported subgroup ops/stages, [min,max] subgroup size range, and whether the driver supports per-pipeline required subgroup size for the COMPUTE stage. `VK_EXT_subgroup_size_control` is added to the requested extension list and the size-control features are chained into device-create pNext when supported.

`ComputePipeline::Descriptor` gains a `required_subgroup_size` field that, when nonzero, chains `VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT` into pipeline creation (both the on-demand `retrieve` path and the batch `create_pipelines` path). The pipeline cache key includes the field so pipelines compiled for different subgroup widths cache independently. `ShaderInfo` carries the same field so it can be plumbed from shader yaml through to the pipeline descriptor.

The existing `SUBGROUP_SIZE` yaml template parameter is now the single source of truth: `gen_vulkan_spv.py` substitutes it into GLSL as before AND emits it as `ShaderInfo::required_subgroup_size`. At dispatch, `vkapi::resolve_required_subgroup_size` validates the value is within the adapter's `[min, max]` range and throws `ShaderNotSupportedError` if the extension is unsupported or the value is out of range, surfacing a clear failure rather than silently miscompiling a shader whose algorithm depends on the pinned subgroup width.

No shader yamls are modified by this change; subsequent commits opt their shaders into the pinning by declaring `SUBGROUP_SIZE` in their yamls.

Differential Revision: [D104456803](https://our.internmc.facebook.com/intern/diff/D104456803/)

ghstack-source-id: 379498994
Pull Request resolved: #19403
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19456

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 114 Pending, 3 Unrelated Failures

As of commit 0cafcb2 with merge base 0cafcb2 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 11, 2026
@github-actions github-actions Bot added ciflow/trunk module: arm Issues related to arm backend labels May 11, 2026
@JacobSzwejbka JacobSzwejbka merged commit 7539fa1 into release/1.3 May 11, 2026
1026 of 1177 checks passed
@JacobSzwejbka JacobSzwejbka deleted the jakeszwe/release-1.3-rebase-0cafcb20 branch May 11, 2026 16:43
@JacobSzwejbka JacobSzwejbka temporarily deployed to upload-benchmark-results May 13, 2026 03:31 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: arm Issues related to arm backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.