feat(benchmark): add multipass VM integration test suite by pszymkowiak · Pull Request #830 · rtk-ai/rtk

pszymkowiak · 2026-03-25T19:47:59Z

Summary

Bun/TypeScript orchestrator for comprehensive RTK testing via multipass VM
Ubuntu 24.04 VM with cloud-init provisioning (Rust, Go, Node, Python, .NET, Terraform, Helm, etc.)
103 tests, 11 phases, 100% pass rate

Test phases

Phase	Tests	Description
Build	3	cargo fmt, clippy, test
Rust commands	47	git, ls, grep, cargo, pytest, go, tsc, docker...
TOML filters	21	df, ps, shellcheck, hadolint, helm, terraform...
Rewrite engine	17	Basic rewrites, compound commands, shell builtins
Exit codes	4	Success + failure propagation
Token savings	4	git log 62%, ls 70%, log dedup 100%, read 92%
Pipes	3	stdout piping compatibility
Edge cases	3	Empty output, unicode, ANSI
Performance	1	Memory < 20MB
Concurrency	1	10 parallel executions

Usage

bun run scripts/benchmark/run.ts           # Full suite (~3 min, ~15 min first run)
bun run scripts/benchmark/run.ts --quick   # Skip perf/concurrency
bun run scripts/benchmark/run.ts --phase 3 # Single phase
bun run scripts/benchmark/cleanup.ts       # Delete VM
bun run scripts/benchmark/rebuild.ts       # Fast rebuild

Prerequisites

brew install multipass
# bun already required by project

Test plan

103/103 tests passing (100%)
Token savings avg 81%
VM creation + cloud-init + build + tests in ~3 min (reuse VM)

Bun/TypeScript orchestrator that creates an Ubuntu 24.04 VM via multipass, installs all dev tools (Rust, Go, Node, Python, .NET, Terraform, etc.), builds RTK, and runs 103 tests across 11 phases: - Cargo quality (fmt, clippy, test) - 47 Rust built-in commands (git, ls, grep, cargo, pytest, go, tsc...) - 21 TOML filter commands (df, ps, shellcheck, hadolint, helm...) - Hook rewrite engine (17 rewrite assertions) - Exit code preservation - Token savings verification (avg 81%) - Pipe compatibility - Edge cases (unicode, ANSI, empty output) - Performance (memory < 20MB) - Concurrency (10 parallel executions) Usage: bun run scripts/benchmark/run.ts # Full suite (~3 min) bun run scripts/benchmark/run.ts --quick # Skip perf/concurrency bun run scripts/benchmark/run.ts --phase 3 # Single phase bun run scripts/benchmark/cleanup.ts # Delete VM bun run scripts/benchmark/rebuild.ts # Fast rebuild Prerequisites: brew install multipass, bun Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>

…lure - Detect binary files (null bytes in first 8KB) before filtering - Show clear message: [binary file] path (size) instead of empty output - Fallback to raw content if filter produces empty output on non-empty file - Prevents LLM from concluding a 70MB file is "empty" (was #822) Fixes #822 Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>

FlorianBruniaux · 2026-03-25T21:06:42Z

Good contribution — real Linux integration tests fill a gap that macOS unit tests can't cover. The VM reuse pattern is practical, cloud-init provisioning is thorough, and the phase-based architecture (vm/test/report separation) is clean. The testSavings approach (raw vs RTK byte count comparison) is the right way to verify token savings claims.

A few things worth addressing before merge:

Binary size limit changed without docs update

The PR bumps the limit to 8MB but CLAUDE.md specifies <5MB stripped as a release blocker. ARM Linux binaries being larger is a valid reason to adjust, but the constraint should be updated in CLAUDE.md explicitly — otherwise future contributors will be confused by the mismatch.

"any" exit code overuse weakens test value

"any" only checks that the process didn't die from a signal (exit < 128). Commands like cargo:build, cargo:test, cargo:clippy, go:test, go:build run against known-good test projects and should be expected to exit 0. Right now they'd pass even if RTK broke these commands entirely.

For Phase 2, the fmt/clippy comment says they may fail because of non-Rust files — but .ts files don't affect cargo fmt --all --check. Worth testing whether they actually pass before using "any" there.

Known RTK bug documented instead of fixed

// NOTE: rtk err swallows exit code (known bug) — test output only, not exit code
await testCmd("runner:err", `${RTK} err false`, "any");

This should be a separate tracked issue, not a permanent workaround in the test suite.

--phase NaN edge case

const phaseOnly = args.includes("--phase")
  ? parseInt(args[args.indexOf("--phase") + 1], 10)
  : null;

If --phase is the last arg, parseInt(undefined, 10) returns NaN. NaN === phase is always false, so no phases run silently. Add a Number.isNaN() guard.

failed <= 2 considered "READY FOR RELEASE"

Two failing tests shouldn't be an acceptable release state — it creates a budget for flaky tests to accumulate. Either fix the flaky tests or skip them explicitly with a reason.

No CI integration is understandable (multipass in GHA is non-trivial), but worth a follow-up issue to wire this into the pre-release workflow.

- Strict exit codes: cargo/python/go tests expect exact exit codes instead of "any" (catches real regressions) - Binary size limit documented: 8MB for ARM Linux VM vs 5MB x86 stripped - --phase NaN guard: error message instead of silent no-op - Verdict: 0 failures = READY, any failure = NOT READY (no more "minor issues" budget) - rtk err exit code bug tracked in #846 Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>

pszymkowiak · 2026-03-26T08:02:11Z

Thanks for the thorough review @FlorianBruniaux — all 5 points addressed in e91a4dc:

Binary size: documented why 8MB (ARM Linux VM vs 5MB x86 stripped target in CLAUDE.md)
Exit codes: strict now — cargo:test expects 101, pytest expects 1, go:test expects 1, etc. No more blanket "any"
rtk err bug: tracked in bug: rtk err swallows exit code (returns 0 instead of underlying exit code) #846 (referenced in code comment)
--phase NaN: error + exit 1 instead of silent no-op
Verdict: removed the failed <= 2 budget — any failure = NOT READY

Current run: 101/103 (98%) — the 2 failures are cargo fmt and cargo clippy on the develop source code (pre-existing formatting issue in src/read.rs), not the test infra.

Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>

aeppling · 2026-03-26T18:38:15Z

Hey

We are cleaning up the codebase and improving the project structure for better onboarding. As part of this effort, PR #826 reorganizes src/ from a flat layout into subfolders.

No logic changes — only file moves and import path updates.

What you need to do

Rebase your branch on develop when receiving this comment:

git fetch origin && git rebase origin/develop

Git detects renames automatically. If you get import conflicts, update the paths:

use crate::git;        // now: use crate::cmds::git::git;
use crate::tracking;   // now: use crate::core::tracking;
use crate::config;     // now: use crate::core::config;
use crate::init;       // now: use crate::hooks::init;
use crate::gain;       // now: use crate::analytics::gain;

Need help rebasing? Tag @aeppling

pszymkowiak requested a review from aeppling March 25, 2026 19:49

pszymkowiak assigned aeppling Mar 25, 2026

pszymkowiak mentioned this pull request Mar 26, 2026

bug: rtk err swallows exit code (returns 0 instead of underlying exit code) #846

Open

feat(benchmark): add Swift ecosystem tests (6 commands + savings)

758aea2

Signed-off-by: Patrick szymkowiak <patrick.szymkowiak@innovtech.eu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add multipass VM integration test suite#830

feat(benchmark): add multipass VM integration test suite#830
pszymkowiak wants to merge 4 commits intodevelopfrom
feat/multipass-integration-tests

pszymkowiak commented Mar 25, 2026

Uh oh!

FlorianBruniaux commented Mar 25, 2026

Uh oh!

pszymkowiak commented Mar 26, 2026

Uh oh!

aeppling commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pszymkowiak commented Mar 25, 2026

Summary

Test phases

Usage

Prerequisites

Test plan

Uh oh!

FlorianBruniaux commented Mar 25, 2026

Uh oh!

pszymkowiak commented Mar 26, 2026

Uh oh!

aeppling commented Mar 26, 2026

What you need to do

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants