Skip to content

Conversation

adam-paterson
Copy link
Contributor

@adam-paterson adam-paterson commented Sep 18, 2025

Summary

  • add standalone benchmarking project under benchmarks/terminal_bench_agent using uv
  • introduce Specify workflow mixin and OpenCode agent integration for Spec -> Plan -> Tasks runs
  • document benchmark usage in main README and detailed benchmarking README

TODO

  • Cleanup installed agent runner
  • Add ability to specify end user tool to run benchmarks with (claude, opencode, codex)
  • Fix "Current directory is not empty" notifications during benchmark test run
  • GitHub Actions for new releases

Testing

  • uv run tb run --dataset terminal-bench-core==head --task-id hello-world --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
  • uv run tb run --dataset deveval==head --task-id python-particle-swarm-optimization-implementation --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent(fails at dataset fetch due to remote availability)

asciicast

@Copilot Copilot AI review requested due to automatic review settings September 18, 2025 13:13
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Terminal Bench benchmarking capabilities to the Spec Kit repository by introducing a standalone benchmarking project that integrates with the Specify workflow methodology. The implementation provides workflow-aware agents that can run the Spec -> Plan -> Tasks process in isolated Terminal Bench environments.

Key changes:

  • Standalone benchmarking project structure under benchmarks/terminal_bench_agent with uv project management
  • Workflow mixin and agent implementations for OpenCode and Claude Code integration
  • Documentation updates in main README and dedicated benchmarking guide

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py Core agent implementations with workflow mixin and provider-specific agents
benchmarks/terminal_bench_agent/pyproject.toml Standalone uv project configuration for benchmarking dependencies
benchmarks/terminal_bench_agent/README.md Detailed benchmarking setup and usage documentation
README.md Main documentation updates with benchmarking section
benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py Package exports for agent classes
Setup script templates Shell scripts for OpenCode and Claude Code environment preparation
Configuration files Git ignore patterns and code ownership assignments

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# through the raw task instruction and let the CLI command wrap it.
return instruction

def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]:
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for TerminalCommand. This class is used in the return type annotation and instantiated on line 154, but it's not imported at the top of the file.

Copilot uses AI. Check for mistakes.

def __init__(self, model_name: str | None = None, *args, **kwargs):
super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs)

def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SpecifyOpenCodeWorkflowAgent overrides _render_instruction to bypass the workflow mixin's implementation, which contradicts the class inheritance design. Consider using composition instead of inheritance, or restructuring the mixin to make this override pattern more explicit.

Suggested change
def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
def render_instruction(self, instruction: str) -> str:

Copilot uses AI. Check for mistakes.

Comment on lines +8 to +9
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
NVM_VERSION="v0.40.2"
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
NVM_INSTALL_SH="/tmp/nvm-install.sh"
NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"
echo "${NVM_INSTALL_SH_SHA256} $NVM_INSTALL_SH" | sha256sum -c -
bash "$NVM_INSTALL_SH"

Copilot uses AI. Check for mistakes.

Comment on lines +6 to +7
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
# Download nvm install script
NVM_VERSION="v0.40.2"
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"
# Expected SHA256 checksum for nvm v0.40.2 install.sh
EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum
ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
echo "ERROR: Checksum verification failed for nvm install.sh!"
echo "Expected: $EXPECTED_SHA256"
echo "Actual: $ACTUAL_SHA256"
exit 1
fi
bash "$NVM_INSTALL_SCRIPT"

Copilot uses AI. Check for mistakes.

Comment on lines +1 to +2
#!/bin/bash
set -euo pipefail
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script uses set -euo pipefail but line 8 pipes curl output to bash, which could mask curl failures. Consider using intermediate error checking or separating the download and execution steps.

Copilot uses AI. Check for mistakes.

@adam-paterson adam-paterson changed the title Draft: Add Terminal Bench benchmarking workflow 🚧 [WIP] Add Terminal Bench benchmarking workflow Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant