Skip to content

Commit 93e9f20

Browse files
committed
spec: warm pool E2E security test infrastructure
Adds specification for a warm-pool model that eliminates the 40+ min cold-start Docker build time from Brev E2E test runs. Key design: - Pre-warmed Brev instances using NemoClaw launchable - Naming convention (e2e-warm-*) as state — no external tracking - PR comment trigger (/test-brev <suites>) for maintainers - Parallel test execution across pool instances - Daily health check + 24h instance cycling - Fallback to ephemeral mode if pool is empty 6 phases: pool script → warmer workflow → test runner → PR trigger → run security tests → cleanup Also includes the original NVIDIA#813 hardware E2E spec for reference. Refs: NVIDIA#813, NVIDIA#118, NVIDIA#119, NVIDIA#156
1 parent 4cfc3b3 commit 93e9f20

3 files changed

Lines changed: 788 additions & 1 deletion

File tree

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Thumbs.db
1515

1616
# Project-specific
1717
draft_newsletter_*
18-
specs/
18+
# specs/ — tracked for warm-pool-e2e spec
1919
vdr-notes/
2020

2121
# Security: secrets, credentials, and keys
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Ephemeral Brev E2E Test Infrastructure
2+
3+
## Overview & Objectives
4+
5+
### Problem Statement
6+
7+
Security-critical PRs (e.g., #156 — credential sanitization for migration snapshots) cannot be properly tested in CI today. The existing test infrastructure has gaps:
8+
9+
- **Docker sandbox E2E** (PR workflow): Simulated environment using `node:22-slim`. No real OpenShell gateway, no actual sandbox isolation.
10+
- **Nightly cloud E2E** (GitHub-hosted runner): Tests live inference via NVIDIA API but runs on `ubuntu-latest` — no OpenShell, no sandbox creation, no migration flow.
11+
- **Unit tests**: PR #156's tests re-implement production logic locally rather than exercising the real `createSnapshotBundle()` code path.
12+
13+
There is no automated way to test the full stack: host OpenClaw credentials → migration snapshot → credential sanitization → sandbox bundle verification.
14+
15+
### Goals
16+
17+
1. Enable maintainers to trigger full-stack E2E tests against any PR branch
18+
2. Use **ephemeral** Brev instances — create a fresh box per test run, tear it down when done
19+
3. Report results back to the PR (check run + comment)
20+
4. Zero persistent infrastructure to maintain between runs
21+
22+
### Non-Goals
23+
24+
- Replacing the existing Docker sandbox E2E or nightly cloud E2E (they remain as-is)
25+
- Making this a required check on all PRs (it's on-demand, triggered by maintainers)
26+
- GPU testing — OpenShell gateway runs fine on CPU (confirmed)
27+
- Local inference / vLLM — not needed for sandbox and migration testing
28+
29+
## Current State Analysis
30+
31+
### Confirmed: Brev CLI Supports Full Ephemeral Lifecycle
32+
33+
All of the following have been validated:
34+
35+
| Capability | Command | Confirmed |
36+
|-----------|---------|-----------|
37+
| Headless auth | `brev login --token <REFRESH_TOKEN>` | Yes — refresh token from `~/.brev/credentials.json` works |
38+
| Create instance | `brev create <name> --cpu 4x16` | Yes — non-interactive, returns when ready |
39+
| SSH config update | `brev refresh` | Yes — writes `~/.ssh/config` entries, resolves instance name to IP |
40+
| SSH access | `ssh <name> "..."` | Yes — works after `brev refresh` |
41+
| Delete instance | `brev delete <name>` | Yes — non-interactive |
42+
43+
### Confirmed: CPU-Only Bootstrap Works
44+
45+
Validated on `agent-sandbox-f41b14` (Ubuntu 22.04, CPU-only, no GPU):
46+
47+
- `scripts/brev-setup.sh` installs Node.js v22, Docker, OpenShell v0.0.14
48+
- OpenShell gateway starts without `--gpu` flag (CPU-only mode)
49+
- Sandbox created and reaches Ready state
50+
- No GPU driver, NVIDIA toolkit, or vLLM needed
51+
52+
### Required Env Vars for Bootstrap
53+
54+
| Variable | Purpose |
55+
|----------|---------|
56+
| `NVIDIA_API_KEY` | Inference provider configuration during onboarding |
57+
| `GITHUB_TOKEN` | `gh release download` of OpenShell binary from NVIDIA/OpenShell |
58+
| `NEMOCLAW_NON_INTERACTIVE=1` | Skip interactive prompts |
59+
| `NEMOCLAW_SANDBOX_NAME` | Sandbox name for non-interactive onboarding |
60+
61+
### Existing Patterns to Reuse
62+
63+
- `bin/nemoclaw.js` deploy function (lines 136-206): rsync, SSH poll loop, `.env` SCP pattern
64+
- `test/e2e/test-full-e2e.sh` (355 lines): 6-phase E2E test with pass/fail harness
65+
- `test/e2e-test.sh` test #8 (lines 123-212): calls `createSnapshotBundle()` via real TypeScript plugin
66+
- `pr-limit.yaml`: PR comment interaction pattern, maintainer allowlist
67+
68+
## Architecture Design
69+
70+
### Ephemeral Box Per Test Run
71+
72+
```mermaid
73+
sequenceDiagram
74+
participant M as Maintainer
75+
participant GH as GitHub Actions (ubuntu-latest)
76+
participant BV as Brev (new instance)
77+
78+
M->>GH: Comment /test-hardware on PR
79+
GH->>GH: Validate maintainer, extract branch
80+
GH->>GH: brev login --token $BREV_API_TOKEN
81+
GH->>BV: brev create pr-<number> --cpu 4x16
82+
GH->>GH: brev refresh (SSH config updated)
83+
GH->>BV: rsync PR branch code
84+
GH->>BV: SSH: pass secrets + run brev-setup.sh
85+
GH->>BV: SSH: run test suite
86+
BV-->>GH: exit code + streamed logs
87+
GH->>GH: brev delete pr-<number> (always)
88+
GH->>GH: Post check run + PR comment
89+
```
90+
91+
**Why ephemeral (not persistent):**
92+
- No state to clean between runs — every test starts fresh
93+
- No idle cost — box only exists during the test
94+
- Reproducible — same bootstrap every time, no drift
95+
- Replaceable — no "the CI box is broken" problems
96+
97+
**Why not a self-hosted runner:**
98+
- Security: fork PRs could execute arbitrary code
99+
- Complexity: runner agent requires maintenance
100+
- Cost: must be always-on to pick up jobs
101+
102+
### Auth Chain
103+
104+
```
105+
GitHub Actions runner (ubuntu-latest)
106+
├── brev login --token $BREV_API_TOKEN (refresh token from ~/.brev/credentials.json)
107+
├── brev create pr-<N> --cpu 4x16 (new ephemeral instance)
108+
├── brev refresh (SSH config updated with new IP)
109+
└── ssh pr-<N> "export NVIDIA_API_KEY=... && export GITHUB_TOKEN=... && ..."
110+
```
111+
112+
Only 2 secrets needed: `BREV_API_TOKEN` and `NVIDIA_API_KEY`. `GITHUB_TOKEN` is the default workflow token.
113+
114+
## Configuration & Deployment Changes
115+
116+
### Secrets (GitHub Repository)
117+
118+
| Secret | Purpose | How to obtain |
119+
|--------|---------|---------------|
120+
| `NVIDIA_API_KEY` | Inference config during onboarding | Already exists |
121+
| `BREV_API_TOKEN` | Brev CLI headless auth | `brev login` locally, then: `python3 -c "import json; print(json.load(open('$HOME/.brev/credentials.json'))['refresh_token'])"` |
122+
123+
`GITHUB_TOKEN` is automatically available in workflows as `github.token`.
124+
125+
### Dependencies
126+
127+
None. Brev CLI installed on the GitHub runner via curl. All scripts use bash, ssh, rsync.
128+
129+
## Implementation Phases
130+
131+
## Phase 1: Ephemeral Box Orchestration Script
132+
133+
- **Description**: Single bash script that manages the full lifecycle of an ephemeral Brev instance — create, bootstrap, run tests, destroy.
134+
- **Core Functionality**: Create fresh Brev box, sync code, bootstrap, run test suite, capture results, tear down.
135+
- **Dependencies**: Brev CLI available on PATH, `BREV_API_TOKEN` and `NVIDIA_API_KEY` env vars set.
136+
- **New file**: `scripts/e2e-brev-test.sh`
137+
- **Design**:
138+
- Inputs (env vars):
139+
- `BREV_API_TOKEN` — for `brev login --token`
140+
- `NVIDIA_API_KEY` — passed to VM for bootstrap
141+
- `GITHUB_TOKEN` — passed to VM for OpenShell download
142+
- `INSTANCE_NAME` — e.g. `pr-156-test` (caller provides, derived from PR number)
143+
- `TEST_SUITE` — which test to run (default: `full`)
144+
- Steps:
145+
1. `brev login --token $BREV_API_TOKEN`
146+
2. `brev create $INSTANCE_NAME --cpu 4x16 --detached`
147+
3. `brev refresh` then SSH poll loop (60 attempts, 5s apart) until `ssh $INSTANCE_NAME "echo ok"` succeeds
148+
4. `rsync -az --delete --exclude node_modules --exclude .git --exclude dist --exclude .venv ./ $INSTANCE_NAME:/home/ubuntu/nemoclaw/`
149+
5. SSH: `export NVIDIA_API_KEY=... && export GITHUB_TOKEN=... && export NEMOCLAW_NON_INTERACTIVE=1 && export NEMOCLAW_SANDBOX_NAME=e2e-test && bash scripts/brev-setup.sh`
150+
6. SSH: run selected test script, tee output to local log file
151+
7. Capture remote exit code
152+
8. (always, via trap) `brev delete $INSTANCE_NAME`
153+
9. Exit with captured code
154+
- Uses `trap` to ensure `brev delete` runs even on script failure/interrupt
155+
- Reuses patterns from `bin/nemoclaw.js` deploy function (lines 136-206)
156+
- **Unit Test Requirements**:
157+
158+
Tests to Write:
159+
160+
1. **test_creates_instance_with_correct_cpu_spec**: Verify `brev create` called with `--cpu 4x16`
161+
2. **test_ssh_poll_retries_on_failure**: Verify poll loop retries when SSH fails
162+
3. **test_ssh_poll_fails_after_timeout**: Verify script exits non-zero after 60 failed attempts
163+
4. **test_cleanup_runs_on_failure**: Verify `brev delete` is called even when test script fails
164+
5. **test_cleanup_runs_on_interrupt**: Verify `brev delete` is called on SIGINT/SIGTERM
165+
6. **test_exit_code_propagation**: Verify remote test failure becomes local exit code
166+
7. **test_rsync_excludes_correct_dirs**: Verify node_modules, .git, dist, .venv excluded
167+
168+
- **Acceptance Criteria**:
169+
- Running `scripts/e2e-brev-test.sh` creates a Brev instance, bootstraps it, runs tests, and deletes it
170+
- Instance is always deleted — even on test failure, bootstrap failure, or interrupt
171+
- Test output is streamed to stdout and saved to a local log file
172+
- Non-zero exit from remote test propagates to local exit code
173+
- `brev ls` shows no leftover instances after the script completes
174+
175+
## Phase 2: GitHub Workflow — Hardware E2E
176+
177+
- **Description**: GitHub Actions workflow that triggers the ephemeral test run and reports results back to a PR.
178+
- **Core Functionality**: Trigger via dispatch or reusable call, run `e2e-brev-test.sh`, report results.
179+
- **Dependencies**: Phase 1 (script exists). `BREV_API_TOKEN` secret configured.
180+
- **New file**: `.github/workflows/hardware-e2e.yaml`
181+
- **Design**:
182+
- Triggers: `workflow_dispatch` (with inputs: branch, pr_number, test_suite) + `workflow_call` (same inputs, for reuse)
183+
- Permissions: `contents: read`, `checks: write`, `pull-requests: write`
184+
- Concurrency: single group `hardware-e2e`, cancel-in-progress
185+
- Guard: `if: github.repository == 'NVIDIA/NemoClaw'`
186+
- Timeout: 30 minutes (instance creation ~3min + bootstrap ~10min + tests ~15min)
187+
- Steps:
188+
1. Checkout `inputs.branch`
189+
2. If `pr_number` provided: create GitHub check run (in_progress) via `gh api repos/.../check-runs`
190+
3. Install Brev CLI: `curl -fsSL https://raw.githubusercontent.com/brevdev/brev-cli/main/bin/install-latest.sh | bash`
191+
4. Run `scripts/e2e-brev-test.sh` with env vars:
192+
- `BREV_API_TOKEN: ${{ secrets.BREV_API_TOKEN }}`
193+
- `NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}`
194+
- `GITHUB_TOKEN: ${{ github.token }}`
195+
- `INSTANCE_NAME: pr-${{ inputs.pr_number || github.run_id }}`
196+
- `TEST_SUITE: ${{ inputs.test_suite }}`
197+
5. (always) Update check run to completed with pass/fail conclusion
198+
6. (always) Post PR comment: "Hardware E2E (suite): PASSED/FAILED on branch `X` [See logs](url)"
199+
7. Upload logs artifact on failure
200+
- **Unit Test Requirements**:
201+
202+
Tests to Write:
203+
204+
1. **test_workflow_dispatch_triggers_correctly**: Verify workflow accepts branch, pr_number, test_suite inputs
205+
2. **test_check_run_created_for_pr**: Verify check run is created when pr_number is provided
206+
3. **test_check_run_skipped_without_pr**: Verify no check run when pr_number is empty
207+
4. **test_comment_posted_on_success**: Verify PR comment with PASSED status
208+
5. **test_comment_posted_on_failure**: Verify PR comment with FAILED status and logs link
209+
210+
- **Acceptance Criteria**:
211+
- `workflow_dispatch` from GitHub UI with `branch: main` runs the full cycle
212+
- PR check run shows in_progress → completed with correct conclusion
213+
- PR comment includes pass/fail, branch name, and link to workflow logs
214+
- No leftover Brev instances after the workflow completes
215+
216+
## Phase 3: PR Comment Trigger [COMPLETED: 80a49d5]
217+
218+
- **Description**: Workflow that lets named maintainers trigger hardware E2E via `/test-hardware` PR comment.
219+
- **Core Functionality**: Parse comment, validate author, dispatch hardware-e2e workflow.
220+
- **Dependencies**: Phase 2 (hardware-e2e workflow exists).
221+
- **New file**: `.github/workflows/hardware-e2e-trigger.yaml`
222+
- **Design**:
223+
- Trigger: `issue_comment` (created)
224+
- Guard: `github.event.issue.pull_request` exists AND comment contains `/test-hardware`
225+
- Maintainer allowlist (bash variable in workflow): `ericksoa kjw3 jacobtomlinson cv jyaunches`
226+
- Steps:
227+
1. Check commenter login against allowlist
228+
2. Extract PR head branch via `gh pr view`
229+
3. `gh workflow run hardware-e2e.yaml -f branch=$BRANCH -f pr_number=$PR_NUMBER -f test_suite=all`
230+
- Non-maintainer comments silently ignored
231+
- **Unit Test Requirements**:
232+
233+
Tests to Write:
234+
235+
1. **test_maintainer_triggers_workflow**: Verify allowed user's comment dispatches hardware-e2e
236+
2. **test_non_maintainer_ignored**: Verify non-allowed user's comment does nothing
237+
3. **test_non_pr_comment_ignored**: Verify comment on issue (not PR) does nothing
238+
4. **test_correct_branch_extracted**: Verify PR head branch is passed to workflow dispatch
239+
240+
- **Acceptance Criteria**:
241+
- Maintainer commenting `/test-hardware` triggers hardware-e2e with correct branch and PR number
242+
- Non-maintainer commenting `/test-hardware` does nothing (no error, no noise)
243+
- PR number is passed through so results report back to the correct PR
244+
245+
## Phase 4: Clean the House
246+
247+
- **Description**: Post-implementation cleanup and documentation.
248+
- **Tasks**:
249+
1. **Remove Dead Code**: Remove any temporary test scripts created during development
250+
2. **Update README.md**: Add section on hardware E2E — how to trigger (`/test-hardware`), what it tests, who can trigger, cost implications
251+
3. **Update CONTRIBUTING.md**: Note that security PRs should be validated with `/test-hardware` before merge
252+
4. **Document Brev token refresh**: How to regenerate `BREV_API_TOKEN` if it expires (re-run `brev login`, extract refresh token)
253+
- **Acceptance Criteria**:
254+
- No commented-out code blocks remain
255+
- Documentation reflects current state of infrastructure
256+
- Token refresh process is documented for future maintainers
257+
- All TODOs from implementation are resolved or documented

0 commit comments

Comments
 (0)