-
Notifications
You must be signed in to change notification settings - Fork 3.2k
🚧 [WIP] Add Terminal Bench benchmarking workflow #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Terminal Bench benchmarking capabilities to the Spec Kit repository by introducing a standalone benchmarking project that integrates with the Specify workflow methodology. The implementation provides workflow-aware agents that can run the Spec -> Plan -> Tasks process in isolated Terminal Bench environments.
Key changes:
- Standalone benchmarking project structure under
benchmarks/terminal_bench_agent
with uv project management - Workflow mixin and agent implementations for OpenCode and Claude Code integration
- Documentation updates in main README and dedicated benchmarking guide
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
benchmarks/terminal_bench_agent/specify_terminal_bench/agent.py |
Core agent implementations with workflow mixin and provider-specific agents |
benchmarks/terminal_bench_agent/pyproject.toml |
Standalone uv project configuration for benchmarking dependencies |
benchmarks/terminal_bench_agent/README.md |
Detailed benchmarking setup and usage documentation |
README.md |
Main documentation updates with benchmarking section |
benchmarks/terminal_bench_agent/specify_terminal_bench/__init__.py |
Package exports for agent classes |
Setup script templates | Shell scripts for OpenCode and Claude Code environment preparation |
Configuration files | Git ignore patterns and code ownership assignments |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
# through the raw task instruction and let the CLI command wrap it. | ||
return instruction | ||
|
||
def _run_agent_commands(self, instruction: str) -> list[TerminalCommand]: |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for TerminalCommand
. This class is used in the return type annotation and instantiated on line 154, but it's not imported at the top of the file.
Copilot uses AI. Check for mistakes.
def __init__(self, model_name: str | None = None, *args, **kwargs): | ||
super().__init__(model_name=model_name or self._DEFAULT_MODEL, *args, **kwargs) | ||
|
||
def _render_instruction(self, instruction: str) -> str: # type: ignore[override] |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SpecifyOpenCodeWorkflowAgent
overrides _render_instruction
to bypass the workflow mixin's implementation, which contradicts the class inheritance design. Consider using composition instead of inheritance, or restructuring the mixin to make this override pattern more explicit.
def _render_instruction(self, instruction: str) -> str: # type: ignore[override] | |
def render_instruction(self, instruction: str) -> str: |
Copilot uses AI. Check for mistakes.
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | ||
|
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | |
NVM_VERSION="v0.40.2" | |
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh" | |
NVM_INSTALL_SH="/tmp/nvm-install.sh" | |
NVM_INSTALL_SH_SHA256="e1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2" # <-- Replace with actual SHA256 from nvm release | |
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SH" | |
echo "${NVM_INSTALL_SH_SHA256} $NVM_INSTALL_SH" | sha256sum -c - | |
bash "$NVM_INSTALL_SH" |
Copilot uses AI. Check for mistakes.
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | ||
|
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloading and executing shell scripts directly from the internet without verification poses a security risk. Consider adding checksum verification or using a more secure installation method.
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.2/install.sh | bash | |
# Download nvm install script | |
NVM_VERSION="v0.40.2" | |
NVM_INSTALL_URL="https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh" | |
NVM_INSTALL_SCRIPT="/tmp/nvm-install.sh" | |
curl -fsSL "$NVM_INSTALL_URL" -o "$NVM_INSTALL_SCRIPT" | |
# Expected SHA256 checksum for nvm v0.40.2 install.sh | |
EXPECTED_SHA256="e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3e2e7e2e3" # <-- Replace with actual checksum | |
ACTUAL_SHA256="$(sha256sum "$NVM_INSTALL_SCRIPT" | awk '{print $1}')" | |
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then | |
echo "ERROR: Checksum verification failed for nvm install.sh!" | |
echo "Expected: $EXPECTED_SHA256" | |
echo "Actual: $ACTUAL_SHA256" | |
exit 1 | |
fi | |
bash "$NVM_INSTALL_SCRIPT" |
Copilot uses AI. Check for mistakes.
#!/bin/bash | ||
set -euo pipefail |
Copilot
AI
Sep 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script uses set -euo pipefail
but line 8 pipes curl output to bash, which could mask curl failures. Consider using intermediate error checking or separating the download and execution steps.
Copilot uses AI. Check for mistakes.
Summary
TODO
Testing
uv run tb run --dataset terminal-bench-core==head --task-id hello-world --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
uv run tb run --dataset deveval==head --task-id python-particle-swarm-optimization-implementation --agent-import-path specify_terminal_bench.agent:SpecifyOpenCodeWorkflowAgent
(fails at dataset fetch due to remote availability)