-
Notifications
You must be signed in to change notification settings - Fork 39
feat(e2e): parallel test runs via org pool #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ralphbean
wants to merge
19
commits into
main
Choose a base branch
from
feat/e2e-parallel-orgs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
4e149b3
feat(e2e): add org pool for parallel test runs
ralphbean 2e20e80
docs: add ADR 0039 for org pool parallel e2e tests
ralphbean f5ca637
fix(e2e): harden org pool locking and add acquireOrg tests
ralphbean a4e1257
fix(e2e): address review findings for org pool
ralphbean 8ab00a0
feat(e2e): use shared mint and fullsend-ai app set
ralphbean cff31bc
feat(e2e): replace programmatic install/uninstall with CLI subprocess
ralphbean 806eca8
ci(e2e): add E2E_GCP_PROJECT_ID secret to workflow
ralphbean 567b1ed
ci(e2e): increase job timeout to 30 minutes
ralphbean c924c10
chore(e2e): restrict org pool to halfsend-01 (only enrolled org)
ralphbean e99cc66
fix(e2e): set CLI subprocess working directory to module root
ralphbean af6d689
fix(e2e): assert roles instead of agents in config.yaml
ralphbean d6a44ee
fix(e2e): update defaultRoles to match current defaults
ralphbean 574b5e8
feat(e2e): add setup script for new e2e pool orgs
ralphbean 4ce4bb5
fix(e2e): expand workflow trigger paths and add halfsend-02 to org pool
ralphbean ea62202
fix(e2e): increase default lock timeout to 20 minutes
ralphbean d693556
feat(e2e): add keyless GCP auth via Workload Identity Federation
ralphbean 9a6d507
fix(e2e): move service account to secret reference
ralphbean 4e2968e
fix(e2e): address review findings for org pool
ralphbean 74963c0
feat(e2e): download triage run artifacts on failure
ralphbean File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| title: "40. Org pool for parallel e2e tests" | ||
| status: Accepted | ||
| relates_to: | ||
| - testing-agents | ||
| topics: | ||
| - e2e | ||
| - ci | ||
| - parallelism | ||
| --- | ||
|
|
||
| # 40. Org pool for parallel e2e tests | ||
|
|
||
| Date: 2026-05-19 | ||
|
|
||
| ## Status | ||
|
|
||
| Accepted | ||
|
|
||
| ## Context | ||
|
|
||
| The e2e tests exercise the full admin install/uninstall flow against a live | ||
| GitHub org using Playwright browser automation | ||
| ([ADR 0010](0010-stored-session-for-e2e-browser-auth.md)). Each run creates | ||
| GitHub Apps, repos, secrets, variables, and enrollment PRs — then tears them | ||
| all down. These operations are destructive and non-reentrant: two concurrent | ||
| runs targeting the same org will collide on shared resources (`.fullsend` repo, | ||
| org secrets, app slugs) and fail unpredictably. | ||
|
|
||
| With a single test org, CI runs are serialized. A push to `main` and an | ||
| in-flight PR both trigger e2e, but only one can proceed; the other waits or | ||
| fails. As the contributor count grows this becomes a bottleneck. | ||
|
|
||
| ## Decision | ||
|
|
||
| Maintain a pool of identically-configured GitHub orgs (currently `halfsend-01` | ||
| through `halfsend-06`). Each e2e run acquires exclusive access to one org | ||
| before proceeding, using a lightweight distributed lock implemented as a | ||
| purpose-built repo (`e2e-lock`) within each org. | ||
|
|
||
| **Acquisition:** The test runner scans the pool in order, attempting to create | ||
| the `e2e-lock` repo in each org. Repo creation is atomic on GitHub — if it | ||
| succeeds, the caller holds the lock. A `README.md` in the lock repo contains | ||
| the run's UUID for ownership verification. If all orgs are locked, the runner | ||
| falls back to polling with a configurable timeout (`E2E_LOCK_TIMEOUT`, | ||
| default 2 minutes). | ||
|
|
||
| **Release:** On test completion (pass or fail), the runner deletes the | ||
| `e2e-lock` repo, but only after verifying the UUID matches — preventing a | ||
| run from releasing another run's lock. | ||
|
|
||
| **Staleness:** If a runner crashes without releasing its lock, the lock repo's | ||
| `created_at` timestamp provides an age signal. A fresh lock (under 1 minute | ||
| old) resets the wait timer; an old lock is assumed stale and can be | ||
| force-acquired by a subsequent run. | ||
|
|
||
| Adding new orgs to the pool requires only provisioning the GitHub org (with | ||
| the shared test account as owner) and appending its name to the `orgPool` | ||
| slice in the test code. No architectural changes are needed. | ||
|
|
||
| ## Consequences | ||
|
|
||
| - Up to N e2e runs execute in parallel, where N is the pool size. | ||
| - Each org must be pre-provisioned with the test account as owner and a | ||
| `test-repo` for enrollment testing. | ||
| - A crashed run leaves a stale lock that self-heals via the age-based | ||
| staleness check. | ||
| - The single `botsend` test account and its stored browser session are shared | ||
| across all orgs; session export and PAT creation remain per-run. | ||
| - Pool expansion is an operational task (provision org, update one slice | ||
| literal), not an architectural change. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[LOW] No workflow concurrency control after removing the global group
With only 2 active orgs, 3+ concurrent workflow runs will waste CI resources waiting for locks (up to 20 min each). Consider adding a concurrency group matching the pool size:
This cancels superseded commits on the same branch/PR while still allowing parallel runs across different PRs.
Flagged by 4/9 review agents