BS Buster

Don't blame the model. Measure the harness.

47 deterministic checks | Passive observation | 4 harness targets | Ablation attribution

BS Buster is a CLI for evaluating AI agent harnesses. It helps teams measure the orchestration layer behind GitHub Copilot, Claude Code, Codex, and custom agent workflows so you can separate model problems from harness problems.

Best for: AI engineering teams, agent platform builders, developer tooling teams, and anyone comparing coding assistants in real workflows.

Install

Install locally in your project:

npm install bs-buster

That's it. No build step, no config files, no Docker. One command.

After installing, all commands use npx:

npx bs-buster init
npx bs-buster start
npx bs-buster stop
npx bs-buster report

Or install globally

npm install -g bs-buster
bs-buster init    # no npx needed when global

Insert coin. Bust BS.

Quick Start

Note: If you installed locally (not globally), you must use npx to run commands. Running bs-buster start directly won't work — use npx bs-buster start instead.

1. Initialize (one time)

npx bs-buster init

BS Buster Setup
Don't blame the model. Measure the harness.
───────────────────────────────────────────

Scanning for installed harnesses...

  [✓] Claude Code     Found .jsonl files in ~/.claude/projects (high confidence)
  [✓] GitHub Copilot  Found extension(s) in ~/.vscode-insiders (high confidence)
  [✗] OpenAI Codex    Not detected
  [✓] Generic         Filesystem watcher available for any harness

Select your primary harness:
  1) Claude Code (recommended)
  2) GitHub Copilot
  3) Generic

Enter number [1]: 2

Configuration saved to .bs-buster/config.json

Auto-detects Claude Code, Codex, and Copilot (including VS Code Insiders). Also checks your project root for .claude/, CLAUDE.md, and .github/copilot-instructions.md. Saves config so all future commands need zero flags.

2. Observe

npx bs-buster start       # Starts background observer
# ... use your harness normally ...
npx bs-buster stop        # Stops observer, finalizes data

3. Report

npx bs-buster report      # Generates an HTML report

The report includes:

Overall score (0–100) and letter grade (A+ through F)
Per-pillar pass rates across 47 checks
Attribution analysis: harness failures vs. model failures vs. ambiguous
Observation coverage: what data the observer could and couldn't capture
Cross-pillar interactions: emergent capabilities between pillars
Ranked recommendations for improving your harness

Report formats: --format html (default), --format json, --format markdown

Ghostbusters-style attribution animation showing model blame being diagnosed as a harness issue

Run the checks. Find the actual culprit.

CLI Reference

npx bs-buster <command> [options]

Command	Description
`npx bs-buster init`	Auto-detect harnesses, save config
`npx bs-buster start`	Start background observer
`npx bs-buster stop`	Stop observer, finalize events
`npx bs-buster status`	Check if observer is running
`npx bs-buster report`	Generate evaluation report
`npx bs-buster sessions`	List past observation sessions

Option	Description	Default
`--target, -t`	`claude-code`, `codex`, `copilot`, `generic`	auto (from init)
`--dir, -d`	Workspace directory to watch	`.`
`--format, -f`	Report format: `html`, `json`, `markdown`	`html`
`--session, -s`	Session ID for report	latest
`--harness-name`	Label the harness being tested (e.g., "Ruflo Swarm")	auto
`--harness-desc`	Custom description for the harness in reports	auto

# Override saved config:
npx bs-buster start --target codex --dir ./my-project
npx bs-buster report --session abc123 --format json

The Problem

The AI industry has a hand-waving problem. When an agent fails — hallucinating, looping, destroying production data — the diagnosis is always the same: "the model needs to be smarter."

This is the model attribution error, and it is the most expensive misdiagnosis in AI engineering.

Same model, different harness = different outcomes. The model was fine. The orchestration wasn't.

"Your agent's reliability problem is not a model problem. We can prove it."

The BS Claim	What People Say	What the Data Shows
Model Blame	"We need a smarter model"	Same model, different harness = different outcomes
Benchmark Theater	"Our model scores 92% on SWE-bench"	Benchmarks test models in isolation. Production agents fail at the harness layer
Prompt Engineering	"We just need better prompts"	Prompts break on model updates. Harness guarantees persist across models
Context Window Copium	"We need a bigger context window"	You have 200K tokens and filled them with raw dumps. Lifecycle management was nonexistent
Scale Solves Everything	"Next year's model will fix it"	Next year's model still needs stopping rules, tool schemas, and a policy layer
Alignment Hand-Wave	"The model doesn't understand consequences"	You handed it `rm -rf` and `git push --force` with no permission gate

The industry's default diagnosis, visualized.

Real-World Case Study: Harness Quality Matters

Two real harness evaluations. Same evaluation framework. Different orchestration maturity.

Metric	Ruflo Swarm (A-)	ATV Agent Orchestrator (B)
Underlying Assistant	Claude Code	GitHub Copilot
Overall Score	91.1 / 100	83.7 / 100
Grade	A-	B
Observation Coverage	100%	75% (no token tracking)
Turns / Tool Calls	12 turns, 16 tool calls	381 turns, 103 tool calls
Worst Pillar	Context Assembly at 87.5% (4 of 5 pillars at 100%)	Loop Discipline at 33.3% (4 of 5 pillars at 100%)
Attribution (harness vs model)	0% harness failures -- all minor issues attributed to model behavior	0% harness failures -- eval framework-specific checks were corrected
Infinite Loop Signature	Zero infinite loops, zero idle turns, explicit termination	381 turns with 99% duplicate tool calls, 26 idle turns, no checkpoints

Don't blame the model. Measure the harness. Both Claude Code and GitHub Copilot are capable coding assistants -- the difference in scores reflects harness orchestration maturity, not model quality. Ruflo's mature harness design -- iteration ceilings, idle detection, progress tracking -- prevented the failure modes that ATV's orchestrator still exhibits. The remaining gap is a harness engineering gap: loop discipline and checkpoint strategy, not model intelligence.

Full reports: Ruflo Swarm (A-) | ATV Agent Orchestrator (B) | Comparative Analysis

When the harness is missing a policy layer, even HAL has receipts.

The Five Pillars

#	Pillar	What It Measures	Checks
1	Context Assembly	System prompt, tool declarations, state injection	7
2	Tool Integrity	Schema validation, error messages, self-correction	8
3	Loop Discipline	Stopping rules, iteration ceilings, stall detection	10
4	Policy Enforcement	Permission gates, destructive action blocking	8
5	Context Lifecycle	Token management, compaction, delegation	8
—	Cross-Pillar	Emergent interactions between pillars	6

How It Works

BS Buster uses passive observation — it watches your harness operate in its natural environment and reconstructs what happened from side effects. No synthetic sandboxes, no test doubles.

npx bs-buster start     →  Observer watches harness in background
  (use your harness normally)
npx bs-buster stop      →  Finalize event collection
npx bs-buster report    →  Reconstruct → 47 Checks → Attribution → Score → HTML Report

Evaluation flow diagram from passive observation to scoring and reporting

Observation Pipeline

fs.watch / file tailing
    → Observer emits ObserverEvents
        → EventCollector writes JSONL to disk
            → reconstructOutput() builds AgentOutput
                → evaluateObservation() runs 47 checks
                    → Scoring engine → HarnessReport

Every check is a pure function: (AgentOutput) → { pass, detail }. No model calls. No randomness. Reproducible across runs.

Supported Harnesses

Target	Observer Method	Captures
Claude Code	JSONL tail + workspace `fs.watch`	Model responses, tool calls, token usage, file changes
Codex	JSON output tail + workspace `fs.watch`	Model responses, tool calls, token usage, file changes
Copilot	Workspace snapshot + `fs.watch` + `.git` watcher	File changes, git activity
Generic	Workspace `fs.watch` with glob patterns	File changes (any harness)

Ghostbusters-style harness attribution animation

Programmatic API

import {
  evaluateObservation,
  reconstructOutput,
  EventCollector,
  generateHtmlReport,
} from "bs-buster";

// Read collected events
const events = EventCollector.readEvents("path/to/session.events.jsonl");

// Reconstruct agent output from raw events
const result = reconstructOutput(events, sessionId, target, harnessId, modelId);

// Run 47-check evaluation
const report = await evaluateObservation(
  result.output,
  "claude-code",
  result.observation_coverage
);

console.log(`Score: ${report.overall_score} (${report.overall_grade})`);

// Generate standalone HTML report
const html = generateHtmlReport({
  session_id: sessionId,
  target: "claude-code",
  observation_coverage: result.observation_coverage,
  warnings: result.warnings,
  evaluation: report,
  summary: { turns: 830, tool_calls: 976, total_tokens: 366570, duration_ms: 0 },
});

Project Structure

src/
├── cli/                                 # CLI entry point
│   ├── index.ts                         #   Argument parsing, 6 commands
│   ├── init-wizard.ts                   #   Guided setup with auto-detection
│   ├── harness-detector.ts              #   Scans system for installed harnesses
│   ├── html-report.ts                   #   Self-contained HTML report generator
│   ├── config.ts                        #   .bs-buster/config.json persistence
│   └── daemon.ts                        #   Background process with PID management
├── observer/                            # Passive observation engine
│   ├── types.ts                         #   HarnessTarget, ObserverEvent, HarnessObserver
│   ├── eval-bridge.ts                   #   Bridges observation → eval pipeline
│   ├── event-collector.ts               #   JSONL event writer/reader
│   ├── observer-registry.ts             #   Factory for target-specific observers
│   ├── observers/                       #   Per-harness observer implementations
│   │   ├── claude-code.observer.ts      #     JSONL tail + workspace watcher
│   │   ├── codex.observer.ts            #     JSON output tail + workspace watcher
│   │   ├── copilot.observer.ts          #     Workspace snapshot + fs.watch + git
│   │   └── generic.observer.ts          #     Glob-filtered filesystem watcher
│   └── reconstruction/
│       └── output-builder.ts            #   Events → AgentOutput reconstruction
└── eval/                                # Evaluation engine
    ├── types.ts                         #   40+ interfaces
    ├── check-registry.ts                #   Registration, lookup, lazy ESM loading
    ├── checks/                          #   47 deterministic checks
    ├── scoring/                         #   Weighted composite scoring + attribution
    ├── reporters/                       #   JSON & Markdown renderers
    └── harnesses/                       #   Ablation testing

Documentation

Document	What It Is
Agent Harness Whitepaper	The thesis: five pillars framework, failure modes, architectural patterns
BS Buster Philosophy	How 47 checks kill the model attribution error
Architecture	System architecture, module design, data flow
Eval Methodology	Phil Schmid adaptation for harness testing
Pillar Strategy	Per-pillar check design and metrics
Comparative Analysis	Real-world A- vs B harness comparison: Ruflo Swarm vs ATV Agent Orchestrator

Requirements

Node.js ≥ 18
No other dependencies at runtime (just zod for schema validation)
Works on macOS, Linux, and Windows

47 checks that don't lie.

npm · GitHub · Issues

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/prompts		.github/prompts
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
.npmignore		.npmignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BS Buster

Don't blame the model. Measure the harness.

Install

Or install globally

Quick Start

1. Initialize (one time)

2. Observe

3. Report

CLI Reference

The Problem

Real-World Case Study: Harness Quality Matters

The Five Pillars

How It Works

Observation Pipeline

Supported Harnesses

Programmatic API

Project Structure

Documentation

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BS Buster

Don't blame the model. Measure the harness.

Install

Or install globally

Quick Start

1. Initialize (one time)

2. Observe

3. Report

CLI Reference

The Problem

Real-World Case Study: Harness Quality Matters

The Five Pillars

How It Works

Observation Pipeline

Supported Harnesses

Programmatic API

Project Structure

Documentation

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages