🚧 [WIP] Add Terminal Bench benchmarking workflow #339

adam-paterson · 2025-09-18T13:13:29Z

Summary

add standalone benchmarking project under benchmarks/terminal_bench_agent using uv
introduce Specify workflow mixin and OpenCode agent integration for Spec -> Plan -> Tasks runs
document benchmark usage in main README and detailed benchmarking README

TODO

Cleanup installed agent runner
Add ability to specify end user tool to run benchmarks with (claude, opencode, codex)
Fix "Current directory is not empty" notifications during benchmark test run
GitHub Actions for new releases

Testing

uv run tb run --dataset terminal-bench-core==head --task-id hello-world --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
uv run tb run --dataset deveval==head --task-id python-particle-swarm-optimization-implementation --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent(fails at dataset fetch due to remote availability)

See discussion github#159

Copilot

Pull Request Overview

This PR adds Terminal Bench benchmarking capabilities to the Spec Kit repository by introducing a standalone benchmarking project that integrates with the Specify workflow methodology. The implementation provides workflow-aware agents that can run the Spec -> Plan -> Tasks process in isolated Terminal Bench environments.

Key changes:

Standalone benchmarking project structure under benchmarks/terminal_bench_agent with uv project management
Workflow mixin and agent implementations for OpenCode and Claude Code integration
Documentation updates in main README and dedicated benchmarking guide

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py`	Core agent implementations with workflow mixin and provider-specific agents
`benchmarks/terminal_bench_agent/pyproject.toml`	Standalone uv project configuration for benchmarking dependencies
`benchmarks/terminal_bench_agent/README.md`	Detailed benchmarking setup and usage documentation
`README.md`	Main documentation updates with benchmarking section
`benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py`	Package exports for agent classes
Setup script templates	Shell scripts for OpenCode and Claude Code environment preparation
Configuration files	Git ignore patterns and code ownership assignments

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-18T13:15:27Z

benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py

+        # through the raw task instruction and let the CLI command wrap it.
+        return instruction
+
+    def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]:


Missing import for TerminalCommand. This class is used in the return type annotation and instantiated on line 154, but it's not imported at the top of the file.

Copilot · 2025-09-18T13:15:27Z

benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py

+    def __init__(self, model_name: str | None = None, *args, **kwargs):
+        super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs)
+
+    def _render_instruction(self, instruction: str) -> str:  # type: ignore[override]


The SpecifyOpenCodeWorkflowAgent overrides _render_instruction to bypass the workflow mixin's implementation, which contradicts the class inheritance design. Consider using composition instead of inheritance, or restructuring the mixin to make this override pattern more explicit.

Suggested change

def _render_instruction(self, instruction: str) -> str: # type: ignore[override]

def render_instruction(self, instruction: str) -> str:

Copilot · 2025-09-18T13:15:28Z

benchmarks/terminal_bench_agent/specify_terminal_bench/opencode-setup.sh.j2

+curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+


Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

NVM_VERSION="v0.40.2"

NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"

NVM_INSTALL_SH="/tmp/nvm-install.sh"

NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release

curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"

echo "${NVM_INSTALL_SH_SHA256} $NVM_INSTALL_SH" | sha256sum -c -

bash "$NVM_INSTALL_SH"

Copilot · 2025-09-18T13:15:28Z

benchmarks/terminal_bench_agent/specify_terminal_bench/claude-code-setup.sh.j2

+curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+


Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.

Suggested change

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash

# Download nvm install script

NVM_VERSION="v0.40.2"

NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"

NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"

curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"

# Expected SHA256 checksum for nvm v0.40.2 install.sh

EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum

ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"

if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then

echo "ERROR: Checksum verification failed for nvm install.sh!"

echo "Expected: $EXPECTED_SHA256"

echo "Actual: $ACTUAL_SHA256"

exit 1

fi

bash "$NVM_INSTALL_SCRIPT"

Copilot · 2025-09-18T13:15:28Z

benchmarks/terminal_bench_agent/specify_terminal_bench/opencode-setup.sh.j2

+#!/bin/bash
+set -euo pipefail


The script uses set -euo pipefail but line 8 pipes curl output to bash, which could mask curl failures. Consider using intermediate error checking or separating the download and execution steps.

adam-paterson added 2 commits September 18, 2025 12:10

feat(benchmarks): initial terminal bench agent

dea522e

See discussion github#159

feat(benchmarks): add opencode runner with grok

891e563

Copilot AI review requested due to automatic review settings September 18, 2025 13:13

Copilot AI reviewed Sep 18, 2025

View reviewed changes

adam-paterson mentioned this pull request Sep 18, 2025

feat(cli): add --force flag for specify init #349

Closed

adam-paterson changed the title ~~Draft: Add Terminal Bench benchmarking workflow~~ 🚧 [WIP] Add Terminal Bench benchmarking workflow Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚧 [WIP] Add Terminal Bench benchmarking workflow #339

🚧 [WIP] Add Terminal Bench benchmarking workflow #339

Uh oh!

adam-paterson commented Sep 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Copilot AI Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def _render_instruction(self, instruction: str) -> str: # type: ignore[override]
	def render_instruction(self, instruction: str) -> str:

		curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh \| bash

-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SH="/tmp/nvm-install.sh"
+NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH"
+echo "${NVM_INSTALL_SH_SHA256}  $NVM_INSTALL_SH" | sha256sum -c -
+bash "$NVM_INSTALL_SH"

-curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash
+# Download nvm install script
+NVM_VERSION="v0.40.2"
+NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh"
+NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh"
+curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT"
+# Expected SHA256 checksum for nvm v0.40.2 install.sh
+EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum
+ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')"
+if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
+  echo "ERROR: Checksum verification failed for nvm install.sh!"
+  echo "Expected: $EXPECTED_SHA256"
+  echo "Actual:   $ACTUAL_SHA256"
+  exit 1
+fi
+bash "$NVM_INSTALL_SCRIPT"

		#!/bin/bash
		set -euo pipefail

🚧 [WIP] Add Terminal Bench benchmarking workflow #339

Are you sure you want to change the base?

🚧 [WIP] Add Terminal Bench benchmarking workflow #339

Uh oh!

Conversation

adam-paterson commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

TODO

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adam-paterson commented Sep 18, 2025 •

edited

Loading