feat: remove benchmark and harbor commands by jwm4 · Pull Request #396 · ambient-code/agentready

jwm4 · 2026-05-06T19:08:06Z

Summary

Removes agentready benchmark, agentready validate-assessor, and agentready harbor commands, which measured Claude Code performance on generic Terminal-Bench tasks unrelated to AgentReady and had statistical flaws (independence violations, insufficient power for plausible effect sizes)
Also removes the unregistered eval-harness CLI and its services, which shared the same tbench-based approach and was already inaccessible dead code
Cleans up all associated models, services, reporters, templates, tests, docs, and specs (47 files, 11,443 lines removed)

Closes #394

Test plan

agentready --help no longer shows benchmark or harbor commands
agentready assess . runs cleanly and produces a valid report
1,147 unit tests pass (7 pre-existing failures unrelated to this change)
black, isort, ruff all clean

🤖 Generated with Claude Code under the supervision of Bill Murdock

The benchmark and harbor commands measured Claude Code performance on generic Terminal-Bench tasks unrelated to AgentReady, so they validated nothing about the tool's core claims. They also had statistical flaws (independence violations, insufficient power for plausible effect sizes). Also removes the unregistered eval-harness CLI and its services, which shared the same tbench-based approach and was already inaccessible dead code. Closes #394 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-06T19:08:20Z

Warning

`.coderabbit.yaml` has a parsing error

The CodeRabbit configuration file in this repository has a parsing error and default settings were used instead. Please fix the error(s) in the configuration file. You can initialize chat with CodeRabbit to get help with the configuration file.

💥 Parsing errors (1)

Validation error: String must contain at most 250 character(s) at "tone_instructions"

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dc120156-b772-461e-acf1-078e7219b776

📥 Commits

Reviewing files that changed from the base of the PR and between facd1a2 and 1300c82.

📒 Files selected for processing (47)

README.md
docs/harbor-comparison-guide.md
patches/harbor-task-filtering-fix.patch
repos-for-benchmark.txt
specs/002-harbor-real-integration/DOUBLEAGENT_IMPACT.md
specs/002-harbor-real-integration/checklists/requirements.md
specs/002-harbor-real-integration/contracts/aggregation-output-schema.json
specs/002-harbor-real-integration/contracts/harbor-results-schema.json
specs/002-harbor-real-integration/data-model.md
specs/002-harbor-real-integration/plan.md
specs/002-harbor-real-integration/quickstart.md
specs/002-harbor-real-integration/research.md
specs/002-harbor-real-integration/spec.md
specs/002-harbor-real-integration/tasks.md
src/agentready/cli/benchmark.py
src/agentready/cli/eval_harness.py
src/agentready/cli/harbor.py
src/agentready/cli/main.py
src/agentready/models/harbor.py
src/agentready/reporters/harbor_markdown.py
src/agentready/services/eval_harness/__init__.py
src/agentready/services/eval_harness/aggregator.py
src/agentready/services/eval_harness/assessor_tester.py
src/agentready/services/eval_harness/baseline.py
src/agentready/services/eval_harness/batch_runner.py
src/agentready/services/eval_harness/dashboard_generator.py
src/agentready/services/eval_harness/harbor_config.py
src/agentready/services/eval_harness/tbench_runner.py
src/agentready/services/harbor/__init__.py
src/agentready/services/harbor/agent_toggler.py
src/agentready/services/harbor/comparer.py
src/agentready/services/harbor/dashboard_generator.py
src/agentready/services/harbor/result_parser.py
src/agentready/services/harbor/runner.py
src/agentready/templates/harbor_comparison.html.j2
src/agentready/utils/__init__.py
src/agentready/utils/preflight.py
tests/unit/services/harbor/__init__.py
tests/unit/services/harbor/test_assessor_state_toggler.py
tests/unit/test_cli_benchmark.py
tests/unit/test_cli_harbor.py
tests/unit/test_eval_harness_cli.py
tests/unit/test_eval_harness_services.py
tests/unit/test_harbor_config.py
tests/unit/test_harbor_models.py
tests/unit/test_harbor_services.py
tests/unit/utils/test_preflight.py

💤 Files with no reviewable changes (45)

src/agentready/cli/eval_harness.py
tests/unit/test_cli_benchmark.py
specs/002-harbor-real-integration/contracts/aggregation-output-schema.json
src/agentready/services/eval_harness/baseline.py
src/agentready/services/eval_harness/tbench_runner.py
src/agentready/cli/harbor.py
src/agentready/services/eval_harness/dashboard_generator.py
specs/002-harbor-real-integration/spec.md
docs/harbor-comparison-guide.md
specs/002-harbor-real-integration/quickstart.md
README.md
src/agentready/services/eval_harness/harbor_config.py
tests/unit/test_eval_harness_cli.py
tests/unit/test_cli_harbor.py
src/agentready/services/harbor/runner.py
src/agentready/services/harbor/dashboard_generator.py
specs/002-harbor-real-integration/DOUBLEAGENT_IMPACT.md
src/agentready/cli/benchmark.py
tests/unit/services/harbor/test_assessor_state_toggler.py
src/agentready/models/harbor.py
tests/unit/test_harbor_services.py
src/agentready/services/harbor/init.py
tests/unit/test_eval_harness_services.py
src/agentready/services/eval_harness/batch_runner.py
src/agentready/reporters/harbor_markdown.py
src/agentready/utils/preflight.py
tests/unit/utils/test_preflight.py
specs/002-harbor-real-integration/tasks.md
repos-for-benchmark.txt
specs/002-harbor-real-integration/data-model.md
src/agentready/services/eval_harness/aggregator.py
specs/002-harbor-real-integration/contracts/harbor-results-schema.json
src/agentready/utils/init.py
src/agentready/services/eval_harness/assessor_tester.py
src/agentready/services/eval_harness/init.py
tests/unit/test_harbor_config.py
src/agentready/services/harbor/agent_toggler.py
src/agentready/services/harbor/result_parser.py
specs/002-harbor-real-integration/research.md
specs/002-harbor-real-integration/checklists/requirements.md
patches/harbor-task-filtering-fix.patch
specs/002-harbor-real-integration/plan.md
src/agentready/services/harbor/comparer.py
src/agentready/cli/main.py
tests/unit/test_harbor_models.py

Walkthrough

This PR removes Harbor/Benchmark functionality from the codebase. The benchmark measures agent performance on unrelated generic coding tasks and has fundamental statistical issues (non-independent observations, insufficient statistical power). The PR deletes CLI commands, service modules, documentation, tests, and configuration related to Harbor benchmarking, then rewires the CLI to replace the benchmark command with a lightweight align command.

Changes

Harbor/Benchmark Feature Removal

Layer / File(s)	Summary
Data Models & Configuration `src/agentready/models/harbor.py`, `src/agentready/services/eval_harness/harbor_config.py`, `repos-for-benchmark.txt`	HarborConfig, model definitions, and benchmark repository list are deleted.
Service Layer `src/agentready/services/eval_harness/baseline.py`, `assessor_tester.py`, `batch_runner.py`, `aggregator.py`, `src/agentready/services/harbor/runner.py`, `comparer.py`, `agent_toggler.py`, `dashboard_generator.py`, `result_parser.py`, `src/agentready/services/harbor/__init__.py`	Benchmark orchestration, result parsing, statistical comparison, and dashboard generation modules are deleted.
CLI Command & Integration `src/agentready/cli/benchmark.py`, `src/agentready/cli/main.py`	Benchmark CLI command (including `benchmark`, `_run_tbench`, `validate_assessor`) is removed. Main CLI is rewired to remove harbor from lazy-loading and replace eager benchmark import with align; assess-batch, experiment, and extract-skills are added to lazy-command mapping.
Documentation & Specifications `README.md`, `docs/harbor-comparison-guide.md`, `specs/002-harbor-real-integration/*`	Harbor CLI installation guide is removed from README. Harbor integration specification directory (DOUBLEAGENT_IMPACT.md, plan.md, data-model.md, spec.md, tasks.md, research.md, checklists/requirements.md, contracts/harbor-results-schema.json) is deleted.
Tests, Utilities & Supporting Files `tests/unit/test_cli_benchmark.py`, `test_cli_harbor.py`, `test_harbor_config.py`, `test_harbor_models.py`, `test_harbor_services.py`, `test_eval_harness_services.py`, `test_eval_harness_cli.py`, `tests/unit/services/harbor/test_assessor_state_toggler.py`, `src/agentready/utils/__init__.py`, `src/agentready/utils/preflight.py`, `src/agentready/services/eval_harness/__init__.py`, `src/agentready/reporters/harbor_markdown.py`, `src/agentready/cli/eval_harness.py`, `src/agentready/cli/harbor.py`, `src/agentready/services/eval_harness/tbench_runner.py`, `patches/harbor-task-filtering-fix.patch`	All Harbor-related and benchmark-related test files are deleted. Preflight utility exports for Harbor CLI checking and Terminal-Bench dataset validation are removed from utils module. Harbor reporter, Harbor CLI module, eval-harness CLI module, and Harbor task-filtering patch file are removed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: remove benchmark and harbor commands' clearly and concisely describes the main change: removal of specific CLI commands, which aligns perfectly with the changeset.
Description check	✅ Passed	The description is directly related to the changeset, explaining the rationale for removing benchmark/harbor/eval-harness commands and referencing the associated cleanup of models, services, tests, and docs.
Linked Issues check	✅ Passed	The PR successfully implements the core requirement from `#394`: removes benchmark and harbor commands due to measuring unrelated Terminal-Bench tasks with statistical flaws (independence violations and insufficient power for plausible effect sizes).
Out of Scope Changes check	✅ Passed	All changes are in scope: removal of benchmark/harbor/eval-harness commands, associated CLI integration changes, and cleanup of related code, services, tests, specs, and documentation as justified in `#394`.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch remove-benchmark-harbor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-06T19:10:10Z

📈 Test Coverage Report

Branch	Coverage
This PR	75.4%
Main	69.4%
Diff	✅ +6%

Coverage calculated from unit tests only

jwm4 requested a review from kami619 May 6, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove benchmark and harbor commands#396

feat: remove benchmark and harbor commands#396
jwm4 wants to merge 1 commit intomainfrom
remove-benchmark-harbor

jwm4 commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

`.coderabbit.yaml` has a parsing error

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jwm4 commented May 6, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has a parsing error

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions Bot commented May 6, 2026

📈 Test Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 6, 2026 •

edited

Loading

`.coderabbit.yaml` has a parsing error