|
| 1 | +# Ephemeral Brev E2E Test Infrastructure |
| 2 | + |
| 3 | +## Overview & Objectives |
| 4 | + |
| 5 | +### Problem Statement |
| 6 | + |
| 7 | +Security-critical PRs (e.g., #156 — credential sanitization for migration snapshots) cannot be properly tested in CI today. The existing test infrastructure has gaps: |
| 8 | + |
| 9 | +- **Docker sandbox E2E** (PR workflow): Simulated environment using `node:22-slim`. No real OpenShell gateway, no actual sandbox isolation. |
| 10 | +- **Nightly cloud E2E** (GitHub-hosted runner): Tests live inference via NVIDIA API but runs on `ubuntu-latest` — no OpenShell, no sandbox creation, no migration flow. |
| 11 | +- **Unit tests**: PR #156's tests re-implement production logic locally rather than exercising the real `createSnapshotBundle()` code path. |
| 12 | + |
| 13 | +There is no automated way to test the full stack: host OpenClaw credentials → migration snapshot → credential sanitization → sandbox bundle verification. |
| 14 | + |
| 15 | +### Goals |
| 16 | + |
| 17 | +1. Enable maintainers to trigger full-stack E2E tests against any PR branch |
| 18 | +2. Use **ephemeral** Brev instances — create a fresh box per test run, tear it down when done |
| 19 | +3. Report results back to the PR (check run + comment) |
| 20 | +4. Zero persistent infrastructure to maintain between runs |
| 21 | + |
| 22 | +### Non-Goals |
| 23 | + |
| 24 | +- Replacing the existing Docker sandbox E2E or nightly cloud E2E (they remain as-is) |
| 25 | +- Making this a required check on all PRs (it's on-demand, triggered by maintainers) |
| 26 | +- GPU testing — OpenShell gateway runs fine on CPU (confirmed) |
| 27 | +- Local inference / vLLM — not needed for sandbox and migration testing |
| 28 | + |
| 29 | +## Current State Analysis |
| 30 | + |
| 31 | +### Confirmed: Brev CLI Supports Full Ephemeral Lifecycle |
| 32 | + |
| 33 | +All of the following have been validated: |
| 34 | + |
| 35 | +| Capability | Command | Confirmed | |
| 36 | +|-----------|---------|-----------| |
| 37 | +| Headless auth | `brev login --token <REFRESH_TOKEN>` | Yes — refresh token from `~/.brev/credentials.json` works | |
| 38 | +| Create instance | `brev create <name> --cpu 4x16` | Yes — non-interactive, returns when ready | |
| 39 | +| SSH config update | `brev refresh` | Yes — writes `~/.ssh/config` entries, resolves instance name to IP | |
| 40 | +| SSH access | `ssh <name> "..."` | Yes — works after `brev refresh` | |
| 41 | +| Delete instance | `brev delete <name>` | Yes — non-interactive | |
| 42 | + |
| 43 | +### Confirmed: CPU-Only Bootstrap Works |
| 44 | + |
| 45 | +Validated on `agent-sandbox-f41b14` (Ubuntu 22.04, CPU-only, no GPU): |
| 46 | + |
| 47 | +- `scripts/brev-setup.sh` installs Node.js v22, Docker, OpenShell v0.0.14 |
| 48 | +- OpenShell gateway starts without `--gpu` flag (CPU-only mode) |
| 49 | +- Sandbox created and reaches Ready state |
| 50 | +- No GPU driver, NVIDIA toolkit, or vLLM needed |
| 51 | + |
| 52 | +### Required Env Vars for Bootstrap |
| 53 | + |
| 54 | +| Variable | Purpose | |
| 55 | +|----------|---------| |
| 56 | +| `NVIDIA_API_KEY` | Inference provider configuration during onboarding | |
| 57 | +| `GITHUB_TOKEN` | `gh release download` of OpenShell binary from NVIDIA/OpenShell | |
| 58 | +| `NEMOCLAW_NON_INTERACTIVE=1` | Skip interactive prompts | |
| 59 | +| `NEMOCLAW_SANDBOX_NAME` | Sandbox name for non-interactive onboarding | |
| 60 | + |
| 61 | +### Existing Patterns to Reuse |
| 62 | + |
| 63 | +- `bin/nemoclaw.js` deploy function (lines 136-206): rsync, SSH poll loop, `.env` SCP pattern |
| 64 | +- `test/e2e/test-full-e2e.sh` (355 lines): 6-phase E2E test with pass/fail harness |
| 65 | +- `test/e2e-test.sh` test #8 (lines 123-212): calls `createSnapshotBundle()` via real TypeScript plugin |
| 66 | +- `pr-limit.yaml`: PR comment interaction pattern, maintainer allowlist |
| 67 | + |
| 68 | +## Architecture Design |
| 69 | + |
| 70 | +### Ephemeral Box Per Test Run |
| 71 | + |
| 72 | +```mermaid |
| 73 | +sequenceDiagram |
| 74 | + participant M as Maintainer |
| 75 | + participant GH as GitHub Actions (ubuntu-latest) |
| 76 | + participant BV as Brev (new instance) |
| 77 | +
|
| 78 | + M->>GH: Comment /test-hardware on PR |
| 79 | + GH->>GH: Validate maintainer, extract branch |
| 80 | + GH->>GH: brev login --token $BREV_API_TOKEN |
| 81 | + GH->>BV: brev create pr-<number> --cpu 4x16 |
| 82 | + GH->>GH: brev refresh (SSH config updated) |
| 83 | + GH->>BV: rsync PR branch code |
| 84 | + GH->>BV: SSH: pass secrets + run brev-setup.sh |
| 85 | + GH->>BV: SSH: run test suite |
| 86 | + BV-->>GH: exit code + streamed logs |
| 87 | + GH->>GH: brev delete pr-<number> (always) |
| 88 | + GH->>GH: Post check run + PR comment |
| 89 | +``` |
| 90 | + |
| 91 | +**Why ephemeral (not persistent):** |
| 92 | +- No state to clean between runs — every test starts fresh |
| 93 | +- No idle cost — box only exists during the test |
| 94 | +- Reproducible — same bootstrap every time, no drift |
| 95 | +- Replaceable — no "the CI box is broken" problems |
| 96 | + |
| 97 | +**Why not a self-hosted runner:** |
| 98 | +- Security: fork PRs could execute arbitrary code |
| 99 | +- Complexity: runner agent requires maintenance |
| 100 | +- Cost: must be always-on to pick up jobs |
| 101 | + |
| 102 | +### Auth Chain |
| 103 | + |
| 104 | +``` |
| 105 | +GitHub Actions runner (ubuntu-latest) |
| 106 | + ├── brev login --token $BREV_API_TOKEN (refresh token from ~/.brev/credentials.json) |
| 107 | + ├── brev create pr-<N> --cpu 4x16 (new ephemeral instance) |
| 108 | + ├── brev refresh (SSH config updated with new IP) |
| 109 | + └── ssh pr-<N> "export NVIDIA_API_KEY=... && export GITHUB_TOKEN=... && ..." |
| 110 | +``` |
| 111 | + |
| 112 | +Only 2 secrets needed: `BREV_API_TOKEN` and `NVIDIA_API_KEY`. `GITHUB_TOKEN` is the default workflow token. |
| 113 | + |
| 114 | +## Configuration & Deployment Changes |
| 115 | + |
| 116 | +### Secrets (GitHub Repository) |
| 117 | + |
| 118 | +| Secret | Purpose | How to obtain | |
| 119 | +|--------|---------|---------------| |
| 120 | +| `NVIDIA_API_KEY` | Inference config during onboarding | Already exists | |
| 121 | +| `BREV_API_TOKEN` | Brev CLI headless auth | `brev login` locally, then: `python3 -c "import json; print(json.load(open('$HOME/.brev/credentials.json'))['refresh_token'])"` | |
| 122 | + |
| 123 | +`GITHUB_TOKEN` is automatically available in workflows as `github.token`. |
| 124 | + |
| 125 | +### Dependencies |
| 126 | + |
| 127 | +None. Brev CLI installed on the GitHub runner via curl. All scripts use bash, ssh, rsync. |
| 128 | + |
| 129 | +## Implementation Phases |
| 130 | + |
| 131 | +## Phase 1: Ephemeral Box Orchestration Script |
| 132 | + |
| 133 | +- **Description**: Single bash script that manages the full lifecycle of an ephemeral Brev instance — create, bootstrap, run tests, destroy. |
| 134 | +- **Core Functionality**: Create fresh Brev box, sync code, bootstrap, run test suite, capture results, tear down. |
| 135 | +- **Dependencies**: Brev CLI available on PATH, `BREV_API_TOKEN` and `NVIDIA_API_KEY` env vars set. |
| 136 | +- **New file**: `scripts/e2e-brev-test.sh` |
| 137 | +- **Design**: |
| 138 | + - Inputs (env vars): |
| 139 | + - `BREV_API_TOKEN` — for `brev login --token` |
| 140 | + - `NVIDIA_API_KEY` — passed to VM for bootstrap |
| 141 | + - `GITHUB_TOKEN` — passed to VM for OpenShell download |
| 142 | + - `INSTANCE_NAME` — e.g. `pr-156-test` (caller provides, derived from PR number) |
| 143 | + - `TEST_SUITE` — which test to run (default: `full`) |
| 144 | + - Steps: |
| 145 | + 1. `brev login --token $BREV_API_TOKEN` |
| 146 | + 2. `brev create $INSTANCE_NAME --cpu 4x16 --detached` |
| 147 | + 3. `brev refresh` then SSH poll loop (60 attempts, 5s apart) until `ssh $INSTANCE_NAME "echo ok"` succeeds |
| 148 | + 4. `rsync -az --delete --exclude node_modules --exclude .git --exclude dist --exclude .venv ./ $INSTANCE_NAME:/home/ubuntu/nemoclaw/` |
| 149 | + 5. SSH: `export NVIDIA_API_KEY=... && export GITHUB_TOKEN=... && export NEMOCLAW_NON_INTERACTIVE=1 && export NEMOCLAW_SANDBOX_NAME=e2e-test && bash scripts/brev-setup.sh` |
| 150 | + 6. SSH: run selected test script, tee output to local log file |
| 151 | + 7. Capture remote exit code |
| 152 | + 8. (always, via trap) `brev delete $INSTANCE_NAME` |
| 153 | + 9. Exit with captured code |
| 154 | + - Uses `trap` to ensure `brev delete` runs even on script failure/interrupt |
| 155 | + - Reuses patterns from `bin/nemoclaw.js` deploy function (lines 136-206) |
| 156 | +- **Unit Test Requirements**: |
| 157 | + |
| 158 | + Tests to Write: |
| 159 | + |
| 160 | + 1. **test_creates_instance_with_correct_cpu_spec**: Verify `brev create` called with `--cpu 4x16` |
| 161 | + 2. **test_ssh_poll_retries_on_failure**: Verify poll loop retries when SSH fails |
| 162 | + 3. **test_ssh_poll_fails_after_timeout**: Verify script exits non-zero after 60 failed attempts |
| 163 | + 4. **test_cleanup_runs_on_failure**: Verify `brev delete` is called even when test script fails |
| 164 | + 5. **test_cleanup_runs_on_interrupt**: Verify `brev delete` is called on SIGINT/SIGTERM |
| 165 | + 6. **test_exit_code_propagation**: Verify remote test failure becomes local exit code |
| 166 | + 7. **test_rsync_excludes_correct_dirs**: Verify node_modules, .git, dist, .venv excluded |
| 167 | + |
| 168 | +- **Acceptance Criteria**: |
| 169 | + - Running `scripts/e2e-brev-test.sh` creates a Brev instance, bootstraps it, runs tests, and deletes it |
| 170 | + - Instance is always deleted — even on test failure, bootstrap failure, or interrupt |
| 171 | + - Test output is streamed to stdout and saved to a local log file |
| 172 | + - Non-zero exit from remote test propagates to local exit code |
| 173 | + - `brev ls` shows no leftover instances after the script completes |
| 174 | + |
| 175 | +## Phase 2: GitHub Workflow — Hardware E2E |
| 176 | + |
| 177 | +- **Description**: GitHub Actions workflow that triggers the ephemeral test run and reports results back to a PR. |
| 178 | +- **Core Functionality**: Trigger via dispatch or reusable call, run `e2e-brev-test.sh`, report results. |
| 179 | +- **Dependencies**: Phase 1 (script exists). `BREV_API_TOKEN` secret configured. |
| 180 | +- **New file**: `.github/workflows/hardware-e2e.yaml` |
| 181 | +- **Design**: |
| 182 | + - Triggers: `workflow_dispatch` (with inputs: branch, pr_number, test_suite) + `workflow_call` (same inputs, for reuse) |
| 183 | + - Permissions: `contents: read`, `checks: write`, `pull-requests: write` |
| 184 | + - Concurrency: single group `hardware-e2e`, cancel-in-progress |
| 185 | + - Guard: `if: github.repository == 'NVIDIA/NemoClaw'` |
| 186 | + - Timeout: 30 minutes (instance creation ~3min + bootstrap ~10min + tests ~15min) |
| 187 | + - Steps: |
| 188 | + 1. Checkout `inputs.branch` |
| 189 | + 2. If `pr_number` provided: create GitHub check run (in_progress) via `gh api repos/.../check-runs` |
| 190 | + 3. Install Brev CLI: `curl -fsSL https://raw.githubusercontent.com/brevdev/brev-cli/main/bin/install-latest.sh | bash` |
| 191 | + 4. Run `scripts/e2e-brev-test.sh` with env vars: |
| 192 | + - `BREV_API_TOKEN: ${{ secrets.BREV_API_TOKEN }}` |
| 193 | + - `NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}` |
| 194 | + - `GITHUB_TOKEN: ${{ github.token }}` |
| 195 | + - `INSTANCE_NAME: pr-${{ inputs.pr_number || github.run_id }}` |
| 196 | + - `TEST_SUITE: ${{ inputs.test_suite }}` |
| 197 | + 5. (always) Update check run to completed with pass/fail conclusion |
| 198 | + 6. (always) Post PR comment: "Hardware E2E (suite): PASSED/FAILED on branch `X` [See logs](url)" |
| 199 | + 7. Upload logs artifact on failure |
| 200 | +- **Unit Test Requirements**: |
| 201 | + |
| 202 | + Tests to Write: |
| 203 | + |
| 204 | + 1. **test_workflow_dispatch_triggers_correctly**: Verify workflow accepts branch, pr_number, test_suite inputs |
| 205 | + 2. **test_check_run_created_for_pr**: Verify check run is created when pr_number is provided |
| 206 | + 3. **test_check_run_skipped_without_pr**: Verify no check run when pr_number is empty |
| 207 | + 4. **test_comment_posted_on_success**: Verify PR comment with PASSED status |
| 208 | + 5. **test_comment_posted_on_failure**: Verify PR comment with FAILED status and logs link |
| 209 | + |
| 210 | +- **Acceptance Criteria**: |
| 211 | + - `workflow_dispatch` from GitHub UI with `branch: main` runs the full cycle |
| 212 | + - PR check run shows in_progress → completed with correct conclusion |
| 213 | + - PR comment includes pass/fail, branch name, and link to workflow logs |
| 214 | + - No leftover Brev instances after the workflow completes |
| 215 | + |
| 216 | +## Phase 3: PR Comment Trigger [COMPLETED: 80a49d5] |
| 217 | + |
| 218 | +- **Description**: Workflow that lets named maintainers trigger hardware E2E via `/test-hardware` PR comment. |
| 219 | +- **Core Functionality**: Parse comment, validate author, dispatch hardware-e2e workflow. |
| 220 | +- **Dependencies**: Phase 2 (hardware-e2e workflow exists). |
| 221 | +- **New file**: `.github/workflows/hardware-e2e-trigger.yaml` |
| 222 | +- **Design**: |
| 223 | + - Trigger: `issue_comment` (created) |
| 224 | + - Guard: `github.event.issue.pull_request` exists AND comment contains `/test-hardware` |
| 225 | + - Maintainer allowlist (bash variable in workflow): `ericksoa kjw3 jacobtomlinson cv jyaunches` |
| 226 | + - Steps: |
| 227 | + 1. Check commenter login against allowlist |
| 228 | + 2. Extract PR head branch via `gh pr view` |
| 229 | + 3. `gh workflow run hardware-e2e.yaml -f branch=$BRANCH -f pr_number=$PR_NUMBER -f test_suite=all` |
| 230 | + - Non-maintainer comments silently ignored |
| 231 | +- **Unit Test Requirements**: |
| 232 | + |
| 233 | + Tests to Write: |
| 234 | + |
| 235 | + 1. **test_maintainer_triggers_workflow**: Verify allowed user's comment dispatches hardware-e2e |
| 236 | + 2. **test_non_maintainer_ignored**: Verify non-allowed user's comment does nothing |
| 237 | + 3. **test_non_pr_comment_ignored**: Verify comment on issue (not PR) does nothing |
| 238 | + 4. **test_correct_branch_extracted**: Verify PR head branch is passed to workflow dispatch |
| 239 | + |
| 240 | +- **Acceptance Criteria**: |
| 241 | + - Maintainer commenting `/test-hardware` triggers hardware-e2e with correct branch and PR number |
| 242 | + - Non-maintainer commenting `/test-hardware` does nothing (no error, no noise) |
| 243 | + - PR number is passed through so results report back to the correct PR |
| 244 | + |
| 245 | +## Phase 4: Clean the House |
| 246 | + |
| 247 | +- **Description**: Post-implementation cleanup and documentation. |
| 248 | +- **Tasks**: |
| 249 | + 1. **Remove Dead Code**: Remove any temporary test scripts created during development |
| 250 | + 2. **Update README.md**: Add section on hardware E2E — how to trigger (`/test-hardware`), what it tests, who can trigger, cost implications |
| 251 | + 3. **Update CONTRIBUTING.md**: Note that security PRs should be validated with `/test-hardware` before merge |
| 252 | + 4. **Document Brev token refresh**: How to regenerate `BREV_API_TOKEN` if it expires (re-run `brev login`, extract refresh token) |
| 253 | +- **Acceptance Criteria**: |
| 254 | + - No commented-out code blocks remain |
| 255 | + - Documentation reflects current state of infrastructure |
| 256 | + - Token refresh process is documented for future maintainers |
| 257 | + - All TODOs from implementation are resolved or documented |
0 commit comments