Skip to content

Bump iOS XCTest timeout for ExecuTorchLLMTests#19354

Open
psiddh wants to merge 2 commits intopytorch:mainfrom
psiddh:export-D104147313
Open

Bump iOS XCTest timeout for ExecuTorchLLMTests#19354
psiddh wants to merge 2 commits intopytorch:mainfrom
psiddh:export-D104147313

Conversation

@psiddh
Copy link
Copy Markdown
Contributor

@psiddh psiddh commented May 7, 2026

Summary:
The 13 XCTestCase methods in
xplat/executorch/extension/llm/apple:ExecuTorchLLMTests
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by fbobjc/Tools/xctest_runner for the long_running
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls generate() twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner ai_infra_mobile_platform.

Per the runner formula
TEST_CASE_TIMEOUT(60s) * label_multiplier * 3:

label multiplier per-XCTestCase budget
long_running x10 1800s
glacial (here) x30 5400s

Switching to glacial (the highest tier supported by the runner)
gives each test 90 minutes. Adding
test_test_rule_timeout_ms = 14400000 sets the bundle-level
wall-clock budget to 4h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run cancellations
(duration: 0.00, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Reviewed By: shoumikhin

Differential Revision: D104147313

Copilot AI review requested due to automatic review settings May 7, 2026 00:12
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19354

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 17 Unrelated Failures

As of commit 879d3a8 with merge base af90130 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 7, 2026

@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104147313.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the Buck test configuration for the iOS LLM XCTest bundle to reduce spurious timeouts during long CPU-based simulator inference runs.

Changes:

  • Switches the test label from long_running to glacial to increase the per-XCTestCase timeout tier.
  • Sets a larger rule-level wall-clock timeout for the generated test bundle via test_test_rule_timeout_ms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@psiddh psiddh force-pushed the export-D104147313 branch from 1a14fa8 to 24581b5 Compare May 7, 2026 00:18
@meta-codesync meta-codesync Bot changed the title Bump iOS XCTest timeout for ExecuTorchLLMTests Bump iOS XCTest timeout for ExecuTorchLLMTests (#19354) May 7, 2026
psiddh added a commit to psiddh/executorch that referenced this pull request May 7, 2026
Summary:

The 13 XCTestCase methods in
`xplat/executorch/extension/llm/apple:ExecuTorchLLMTests`
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by `fbobjc/Tools/xctest_runner` for the `long_running`
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls `generate()` twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner `ai_infra_mobile_platform`.

Per the runner formula
`TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`:

| label          | multiplier | per-XCTestCase budget |
|----------------|-----------:|----------------------:|
| long_running   |        x10 |                 1800s |
| glacial (here) |        x30 |                 5400s |

Switching to `glacial` (the highest tier supported by the runner)
gives each test 90 minutes. Adding
`test_test_rule_timeout_ms = 28800000` sets the bundle-level
wall-clock budget to 8h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run *cancellations*
(`duration: 0.00`, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Differential Revision: D104147313
@psiddh psiddh force-pushed the export-D104147313 branch from 24581b5 to b3b6a27 Compare May 7, 2026 00:45
@psiddh psiddh requested a review from shoumikhin May 7, 2026 00:45
psiddh added a commit to psiddh/executorch that referenced this pull request May 7, 2026
Summary:

The 13 XCTestCase methods in
`xplat/executorch/extension/llm/apple:ExecuTorchLLMTests`
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by `fbobjc/Tools/xctest_runner` for the `long_running`
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls `generate()` twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner `ai_infra_mobile_platform`.

Per the runner formula
`TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`:

| label          | multiplier | per-XCTestCase budget |
|----------------|-----------:|----------------------:|
| long_running   |        x10 |                 1800s |
| glacial (here) |        x30 |                 5400s |

Switching to `glacial` (the highest tier supported by the runner)
gives each test 90 minutes. Adding
`test_test_rule_timeout_ms = 28800000` sets the bundle-level
wall-clock budget to 8h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run *cancellations*
(`duration: 0.00`, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Differential Revision: D104147313
@psiddh psiddh force-pushed the export-D104147313 branch from b3b6a27 to 611181f Compare May 7, 2026 00:48
psiddh added a commit to psiddh/executorch that referenced this pull request May 7, 2026
Summary:

The 13 XCTestCase methods in
`xplat/executorch/extension/llm/apple:ExecuTorchLLMTests`
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by `fbobjc/Tools/xctest_runner` for the `long_running`
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls `generate()` twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner `ai_infra_mobile_platform`.

Per the runner formula
`TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`:

| label          | multiplier | per-XCTestCase budget |
|----------------|-----------:|----------------------:|
| long_running   |        x10 |                 1800s |
| glacial (here) |        x30 |                 5400s |

Switching to `glacial` (the highest tier supported by the runner)
gives each test 90 minutes. Adding
`test_test_rule_timeout_ms = 28800000` sets the bundle-level
wall-clock budget to 8h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run *cancellations*
(`duration: 0.00`, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Differential Revision: D104147313
Summary:
The 13 XCTestCase methods in
`xplat/executorch/extension/llm/apple:ExecuTorchLLMTests`
(testLLaMA, testPhi4, testGemma, testLLaVA, testVoxtral and their
reset variants) regularly hit the 1800-second per-test ceiling
enforced by `fbobjc/Tools/xctest_runner` for the `long_running`
label. LLM inference on iOS-sim CPU (1B-class models,
128-768 token sequences, each test calls `generate()` twice)
routinely exceeds 30 minutes per test method, producing spurious
"Test timed out after 1800 seconds" flakes on the test-issues
dashboard for owner `ai_infra_mobile_platform`.

Per the runner formula
`TEST_CASE_TIMEOUT(60s) * label_multiplier * 3`:

| label          | multiplier | per-XCTestCase budget |
|----------------|-----------:|----------------------:|
| long_running   |        x10 |                 1800s |
| glacial (here) |        x30 |                 5400s |

Switching to `glacial` (the highest tier supported by the runner)
gives each test 90 minutes. Adding
`test_test_rule_timeout_ms = 14400000` sets the bundle-level
wall-clock budget to 4h, which is comfortable headroom for ~5
testcases at 90 min each plus xctest setup/teardown.

Note: this diff is unrelated to T269848646. T269848646 tracks a
separate cluster of 446 iOS-sim test-run *cancellations*
(`duration: 0.00`, "test execution was cancelled because the test
run was cancelled") that is owned by testinfra and is not
addressed here.

Reviewed By: shoumikhin

Differential Revision: D104147313
@meta-codesync meta-codesync Bot changed the title Bump iOS XCTest timeout for ExecuTorchLLMTests (#19354) Bump iOS XCTest timeout for ExecuTorchLLMTests May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 03:31
@psiddh psiddh force-pushed the export-D104147313 branch from 611181f to 46906d4 Compare May 7, 2026 03:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread extension/llm/apple/BUCK Outdated
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
Copilot AI review requested due to automatic review settings May 7, 2026 04:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread extension/llm/apple/BUCK
Comment on lines +24 to +27
# Rule-level wall-clock for the whole auto-generated test bundle:
# ExecuTorchLLMTests currently contains 13 XCTestCase methods, and
# individual methods can exceed 30 minutes on iOS-sim CPU. This 4h
# budget is intended as the total bundle/shard wall-clock, including
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants