Skip to content

feat(benchmark): add multipass VM integration test suite#830

Open
pszymkowiak wants to merge 4 commits intodevelopfrom
feat/multipass-integration-tests
Open

feat(benchmark): add multipass VM integration test suite#830
pszymkowiak wants to merge 4 commits intodevelopfrom
feat/multipass-integration-tests

Conversation

@pszymkowiak
Copy link
Collaborator

Summary

  • Bun/TypeScript orchestrator for comprehensive RTK testing via multipass VM
  • Ubuntu 24.04 VM with cloud-init provisioning (Rust, Go, Node, Python, .NET, Terraform, Helm, etc.)
  • 103 tests, 11 phases, 100% pass rate

Test phases

Phase Tests Description
Build 3 cargo fmt, clippy, test
Rust commands 47 git, ls, grep, cargo, pytest, go, tsc, docker...
TOML filters 21 df, ps, shellcheck, hadolint, helm, terraform...
Rewrite engine 17 Basic rewrites, compound commands, shell builtins
Exit codes 4 Success + failure propagation
Token savings 4 git log 62%, ls 70%, log dedup 100%, read 92%
Pipes 3 stdout piping compatibility
Edge cases 3 Empty output, unicode, ANSI
Performance 1 Memory < 20MB
Concurrency 1 10 parallel executions

Usage

bun run scripts/benchmark/run.ts           # Full suite (~3 min, ~15 min first run)
bun run scripts/benchmark/run.ts --quick   # Skip perf/concurrency
bun run scripts/benchmark/run.ts --phase 3 # Single phase
bun run scripts/benchmark/cleanup.ts       # Delete VM
bun run scripts/benchmark/rebuild.ts       # Fast rebuild

Prerequisites

brew install multipass
# bun already required by project

Test plan

  • 103/103 tests passing (100%)
  • Token savings avg 81%
  • VM creation + cloud-init + build + tests in ~3 min (reuse VM)

Bun/TypeScript orchestrator that creates an Ubuntu 24.04 VM via multipass,
installs all dev tools (Rust, Go, Node, Python, .NET, Terraform, etc.),
builds RTK, and runs 103 tests across 11 phases:

- Cargo quality (fmt, clippy, test)
- 47 Rust built-in commands (git, ls, grep, cargo, pytest, go, tsc...)
- 21 TOML filter commands (df, ps, shellcheck, hadolint, helm...)
- Hook rewrite engine (17 rewrite assertions)
- Exit code preservation
- Token savings verification (avg 81%)
- Pipe compatibility
- Edge cases (unicode, ANSI, empty output)
- Performance (memory < 20MB)
- Concurrency (10 parallel executions)

Usage:
  bun run scripts/benchmark/run.ts           # Full suite (~3 min)
  bun run scripts/benchmark/run.ts --quick   # Skip perf/concurrency
  bun run scripts/benchmark/run.ts --phase 3 # Single phase
  bun run scripts/benchmark/cleanup.ts       # Delete VM
  bun run scripts/benchmark/rebuild.ts       # Fast rebuild

Prerequisites: brew install multipass, bun
Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>
…lure

- Detect binary files (null bytes in first 8KB) before filtering
- Show clear message: [binary file] path (size) instead of empty output
- Fallback to raw content if filter produces empty output on non-empty file
- Prevents LLM from concluding a 70MB file is "empty" (was #822)

Fixes #822

Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>
@FlorianBruniaux
Copy link
Collaborator

Good contribution — real Linux integration tests fill a gap that macOS unit tests can't cover. The VM reuse pattern is practical, cloud-init provisioning is thorough, and the phase-based architecture (vm/test/report separation) is clean. The testSavings approach (raw vs RTK byte count comparison) is the right way to verify token savings claims.

A few things worth addressing before merge:

Binary size limit changed without docs update

The PR bumps the limit to 8MB but CLAUDE.md specifies <5MB stripped as a release blocker. ARM Linux binaries being larger is a valid reason to adjust, but the constraint should be updated in CLAUDE.md explicitly — otherwise future contributors will be confused by the mismatch.

"any" exit code overuse weakens test value

"any" only checks that the process didn't die from a signal (exit < 128). Commands like cargo:build, cargo:test, cargo:clippy, go:test, go:build run against known-good test projects and should be expected to exit 0. Right now they'd pass even if RTK broke these commands entirely.

For Phase 2, the fmt/clippy comment says they may fail because of non-Rust files — but .ts files don't affect cargo fmt --all --check. Worth testing whether they actually pass before using "any" there.

Known RTK bug documented instead of fixed

// NOTE: rtk err swallows exit code (known bug) — test output only, not exit code
await testCmd("runner:err", `${RTK} err false`, "any");

This should be a separate tracked issue, not a permanent workaround in the test suite.

--phase NaN edge case

const phaseOnly = args.includes("--phase")
  ? parseInt(args[args.indexOf("--phase") + 1], 10)
  : null;

If --phase is the last arg, parseInt(undefined, 10) returns NaN. NaN === phase is always false, so no phases run silently. Add a Number.isNaN() guard.

failed <= 2 considered "READY FOR RELEASE"

Two failing tests shouldn't be an acceptable release state — it creates a budget for flaky tests to accumulate. Either fix the flaky tests or skip them explicitly with a reason.


No CI integration is understandable (multipass in GHA is non-trivial), but worth a follow-up issue to wire this into the pre-release workflow.

- Strict exit codes: cargo/python/go tests expect exact exit codes
  instead of "any" (catches real regressions)
- Binary size limit documented: 8MB for ARM Linux VM vs 5MB x86 stripped
- --phase NaN guard: error message instead of silent no-op
- Verdict: 0 failures = READY, any failure = NOT READY (no more "minor issues" budget)
- rtk err exit code bug tracked in #846

Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>
@pszymkowiak
Copy link
Collaborator Author

Thanks for the thorough review @FlorianBruniaux — all 5 points addressed in e91a4dc:

  1. Binary size: documented why 8MB (ARM Linux VM vs 5MB x86 stripped target in CLAUDE.md)
  2. Exit codes: strict now — cargo:test expects 101, pytest expects 1, go:test expects 1, etc. No more blanket "any"
  3. rtk err bug: tracked in bug: rtk err swallows exit code (returns 0 instead of underlying exit code) #846 (referenced in code comment)
  4. --phase NaN: error + exit 1 instead of silent no-op
  5. Verdict: removed the failed <= 2 budget — any failure = NOT READY

Current run: 101/103 (98%) — the 2 failures are cargo fmt and cargo clippy on the develop source code (pre-existing formatting issue in src/read.rs), not the test infra.

Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>
@aeppling
Copy link
Contributor

Hey

We are cleaning up the codebase and improving the project structure for better onboarding. As part of this effort, PR #826 reorganizes src/ from a flat layout into subfolders.

No logic changes — only file moves and import path updates.

What you need to do

Rebase your branch on develop when receiving this comment:

git fetch origin && git rebase origin/develop

Git detects renames automatically. If you get import conflicts, update the paths:

use crate::git;        // now: use crate::cmds::git::git;
use crate::tracking;   // now: use crate::core::tracking;
use crate::config;     // now: use crate::core::config;
use crate::init;       // now: use crate::hooks::init;
use crate::gain;       // now: use crate::analytics::gain;

Need help rebasing? Tag @aeppling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants