🤓 Orchestrator: A multi-agent AI coder. Now #12 on Stanford's TerminalBench. Open sourced!

TL;DR:

Over the weekend, quite unexpectedly, I made a multi-agent AI system that places slightly higher than Claude Code on Stanford's TerminalBench leaderboard (12th place).
This AI system consists of an orchestration agent that dispatches multiple explorer and coder agents to do all the work.
The orchestrator explicitly defines what knowledge artifacts subagents must return, then reuses and synthesises these artifacts across future tasks - creating compound intelligence where each action builds meaningfully on previous discoveries.

How the System Works

The orchestrator acts as the brain of the operation - it receives the user's task but never touches code directly. Instead, it:

Analyses the task and breaks it into focused subtasks
Dispatches explorer agents to understand the system
Delegates implementation work to coder agents with precise instructions
Verifies all changes through additional explorer agents
Maintains the context store with all discovered knowledge

The orchestrator's lack of direct code access forces proper delegation and verification patterns, leading to more strategic solutions.

For a full breakdown of this project's code structure, see here

📈 Evaluation Results

Performance on TerminalBench

Terminal bench is a brilliant benchmark created by Stanford and Laude Institute to quantify agents' ability to complete complex tasks in the terminal. My Orchestrator system achieved 12th place on the leaderboard, demonstrating competitive performance against leading AI coding assistants.

I ran the Orchestrator evaluations with both Claude-4-Sonnet and also Qwen3-Coder-480B-A35B:

Cost & Efficiency

One of the most striking results is the amount of tokens used by Sonnet-4 as opposed to Qwen3-Coder.

The below table shows the total tokens (input and output included) processed across the TerminalBench evaluation run (5 attempts at 80 tasks = 400 trajectories).

Model	Success Rate	Total Evaluation Cost	Token Usage
Claude Sonnet-4	36.0%	$263.56*	93.2M tokens
Qwen-3-Coder	19.25%	$217.83	14.7M tokens

*Claude Sonnet-4 costs reflect heavy caching usage, reducing actual API costs

🤖 The Agents

While all agents use the same underlying LLM, each operates with its own context window, specialised system message, and distinct toolset. This creates functionally different agents optimised for their specific roles.

🎯 Orchestrator Agent

System message Role: Strategic coordinator and persistent intelligence layer
Capabilities: Task decomposition, context management, subagent delegation
Tools: Task creation, subagent launching, context store management
Restrictions: Cannot read or modify code directly - operates purely at architectural level

The orchestrator maintains the complete picture across all tasks, tracking discoveries and progress. It crafts precise task descriptions that explicitly specify what contexts subagents should return, ensuring focused and valuable information gathering.

Trust Calibration Strategy:
The orchestrator employs adaptive delegation based on task complexity:

Low Complexity Tasks: Grants extremely high autonomy to the coder agent for simple modifications and bug fixes
Medium/Large Tasks: Maintains strong trust but uses iterative decomposition - breaking complex problems into atomic, verifiable steps
Verification Philosophy: Uses explorer agents liberally to verify progress, especially when tasks involve critical functionality

🔍 Explorer Agent

System message Role: Read-only investigation and verification specialist
Capabilities: System exploration, code analysis, test execution, verification
Tools: File reading, search operations (grep/glob), bash commands, temporary script creation
Restrictions: Cannot modify existing files - strictly read-only operations

Explorers gather intelligence about the codebase, verify implementations, and discover system behaviors. They create knowledge artifacts that eliminate redundant exploration for future agents.

💻 Coder Agent

System message Role: Implementation specialist with write access
Capabilities: Code creation/modification, refactoring, bug fixes, system changes
Tools: Full file operations (read/write/edit), bash commands, search operations
Restrictions: None - full system access for implementation tasks

Coders transform architectural vision into working code. They receive focused tasks with relevant contexts and implement solutions while maintaining code quality and conventions.

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

I introduced a novel approach to multi-agent coordination through the Context Store - a persistent knowledge layer that transforms isolated agent actions into coherent problem-solving. Unlike traditional multi-agent systems where agents operate in isolation, my architecture enables sophisticated knowledge accumulation and sharing.

The Context Store Pattern:

Orchestrator-Directed Discovery: The orchestrator explicitly specifies what contexts it needs from each subagent, ensuring focused and relevant information gathering and implementation reporting
Knowledge Artifacts: Subagents create discrete, reusable context items based on the orchestrator's requirements
Persistent Memory: Contexts persist across agent interactions, building a comprehensive system understanding
Selective Injection: The orchestrator precisely injects relevant contexts into new tasks, eliminating redundant discovery and providing all the information a subagent needs to complete it's respective task
Compound Intelligence: Each action builds meaningfully on previous discoveries, creating exponential problem-solving capability

Key Benefits:

Eliminates Redundant Work: Subagents never need to rediscover the same information twice
Reduces Context Window Load: Agents receive only the specific contexts they need
Enables Complex Solutions: Multi-step problems that no single agent could solve become tractable
Maintains Focus: Each subagent operates with a clean, focused context window

This architecture ensures that every piece of discovered information becomes a permanent building block for future tasks, creating a system that genuinely learns and adapts throughout the problem-solving process.

📋 Task Management

The orchestrator maintains a comprehensive task management system that tracks all subagent activities:

Core Functions:

Progress Tracking: Monitors task status (pending, completed, failed) across potentially hundreds of coordinated actions
Failure Recovery: Captures failure reasons to enable strategic adaptation and intelligent retries
Workflow Orchestration: Maintains clear audit trails of what's been attempted, preventing redundant work
Strategic Planning: Enables systematic decomposition of complex problems into verifiable subtasks

The task manager serves as the orchestrator's operational memory - while the context store holds discovered knowledge, the task manager tracks the journey of discovery itself. This dual-layer system ensures the orchestrator always knows both what it has learned AND how it learned it, enabling sophisticated multi-step solutions that build intelligently on previous attempts.

⏱️ Time-Conscious Orchestration

One thing I noticed during early evaluations was that whilst the system was on track to complete extremely complex tasks, it would use lots of subagents to get there, and therefore the task would time out.

Therefore the orchestrator now employs a philosophy for time-efficient execution, recognising that wasted time often stems from poor task specification rather than slow execution:

Prevention Principles:

Front-Loading Precision: The orchestrator spends time crafting exact task descriptions rather than iterating on vague ones
Context Completeness: Always over-provides context rather than under-providing, preventing subagents from rediscovering known information
Explicit Expectations: Every task specifies exactly what contexts should be returned, eliminating unfocused exploration
Tight Scoping: Defines clear boundaries - what to do AND what not to do, preventing scope creep

Getting started

For dev:

uv sync

To run evals:

./run_terminal_bench_eval.sh

To quickly test various models: See /tests

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
readme_imgs		readme_imgs
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
pyproject.toml		pyproject.toml
run_terminal_bench_eval.sh		run_terminal_bench_eval.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤓 Orchestrator: A multi-agent AI coder. Now #12 on Stanford's TerminalBench. Open sourced!

How the System Works

📈 Evaluation Results

Performance on TerminalBench

Cost & Efficiency

🤖 The Agents

🎯 Orchestrator Agent

🔍 Explorer Agent

💻 Coder Agent

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

📋 Task Management

⏱️ Time-Conscious Orchestration

Getting started

About

Uh oh!

Releases

Packages

Languages

codehornets/multi-agent-coding-system

Folders and files

Latest commit

History

Repository files navigation

🤓 Orchestrator: A multi-agent AI coder. Now #12 on Stanford's TerminalBench. Open sourced!

How the System Works

📈 Evaluation Results

Performance on TerminalBench

Cost & Efficiency

🤖 The Agents

🎯 Orchestrator Agent

🔍 Explorer Agent

💻 Coder Agent

Key System Components

🧠 Smart Context Sharing

How Context Sharing Works

📋 Task Management

⏱️ Time-Conscious Orchestration

Getting started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages