Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions environments/from_mcp_template/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# HUD API Configuration
# Get your API key from https://hud.so/account
HUD_API_KEY=""

# Anthropic API Configuration (optional)
# Required for using Claude agents - get from https://console.anthropic.com/
ANTHROPIC_API_KEY=""
12 changes: 12 additions & 0 deletions environments/from_mcp_template/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM public.ecr.aws/docker/library/python:3.11-bookworm

WORKDIR /app

# Install git for dependency installation
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

# Copy and install dependencies
COPY pyproject.toml ./
COPY controller/ ./controller/
COPY environment/ ./environment/
RUN pip install --no-cache-dir -e .
108 changes: 108 additions & 0 deletions environments/from_mcp_template/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# test-test

## Environment design pattern
- Controller (Think of this as a frontend in web development)
- Creates the UX and manages the lifecycle of an app (in this case for an agent)
- Define `mcp = MCPServer()` and register `@mcp.tool` as tools the agent can interact with
- Environment (Think of this as a backend in web development)
- Owns all long‑lived states of the environment and exposes the environment data structure
- Expose simple HTTP endpoints (`/health`, `/act`, `/reset`, `/state`)

IMPORTANT: Make sure all logs are going to stderr instead of stdio, which is reserved for MCP communication

### Testing your environment
```bash
# 1. Configure your API keys (optional - only needed for evaluation)
# Edit .env file to add your HUD_API_KEY and ANTHROPIC_API_KEY

# 2. Start the environment (optional: with --inspector or --interactive)
hud dev --build --interactive

# 3. Choose your preferred way to test:

# Option A: Run the task with Claude (requires ANTHROPIC_API_KEY)
hud eval tasks.json --agent claude

# Option B: Interactive notebook test_env.ipynb (great for learning!)

# Option C: Simple Python script (runs all tasks from tasks.json)
python test_task.py
```

## Iterating on your environment
This is usually the process for making any environment better:
```bash
# 1. Start the environment and interact with it directly (or give MCP server to an agent):
hud dev --build --interactive

# 2. If the environment cannot start or fails inexplicably:
hud debug test_env:dev # Or your env name that appears when you run hud dev
# After fixing the error, go back to 1.

# 3. When the environment is in a stable state:
hud build
hud push # Requires docker login

# 4. As soon as it's pushed to the newest version, make sure tasks have it updated and run:
hud rl
# This is a good test to see if your environment and tasks are high quality!

## Layout
```
controller/
__init__.py # mcp + shared HTTP client
__main__.py # python -m controller → mcp.run()
hooks.py # @mcp.initialize / @mcp.shutdown
tools.py # @mcp.tool act / setup / evaluate

./environment
├── __init__.py
└── server.py # FastAPI app: /health, /act, /reset, /state
```

## Publishing Your Environment

Once your environment is ready, you can share it with the community:

### 1. Push to Registry
```bash
# Build and push your environment (requires docker hub login and hud api key)
hud build
hud push
```

### 2. Create a Dataset

Create a dataset on HuggingFace with your tasks:

**Option A: Upload manually**
1. Upload your `tasks.json` to HuggingFace
2. Make sure it's **public** to appear on leaderboards

**Option B: Use the SDK**
```python
from hud.datasets import save_tasks
import json

# Load your tasks
with open("tasks.json") as f:
tasks = json.load(f)

# Push to HuggingFace
save_tasks(tasks, repo_id="your-org/your-dataset")
```

### 3. Run and Track Performance

```bash
# Run Claude on your benchmark
hud eval "your-org/your-dataset" --agent claude

# View results at:
# hud.so/leaderboards/your-org/your-dataset
```

**Note**: Only public HuggingFace datasets appear as leaderboards!

📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)

16 changes: 16 additions & 0 deletions environments/from_mcp_template/controller/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Controller

Frontend for the agent: defines tools, minimal state, calls the environment over HTTP.

What to implement
- Shared client in `__init__.py` (one `httpx.AsyncClient`)
- Lifecycle in `hooks.py` (`@mcp.initialize`/`@mcp.shutdown`)
- Tools in `tools.py` (`@mcp.tool`) — keep logic thin; docstrings = descriptions

Run
```bash
hud run controller --transport http --reload
# Helper endpoints: http://localhost:8765/hud and /hud/tools
```

Principle: the controller is UX, not state. Keep long‑lived state in the environment.
27 changes: 27 additions & 0 deletions environments/from_mcp_template/controller/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""Controller package - registers hooks and tools."""

import sys
import os
import httpx
import logging
from hud.server import MCPServer

logging.basicConfig(
stream=sys.stderr,
level=logging.INFO,
format="[%(levelname)s] %(asctime)s | %(name)s | %(message)s",
force=True, # Force all loggers to use stderr
)

# Suppress httpx INFO logs to avoid cluttering MCP protocol
httpx_logger = logging.getLogger("httpx")
httpx_logger.setLevel(logging.WARNING) # Only show warnings and errors
httpcore_logger = logging.getLogger("httpcore")
httpcore_logger.setLevel(logging.WARNING) # Only show warnings and errors

mcp = MCPServer()

# Import tools and hooks to register them with the server
from . import tools, hooks

__all__ = ["mcp"]
4 changes: 4 additions & 0 deletions environments/from_mcp_template/controller/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from controller import mcp

if __name__ == "__main__":
mcp.run()
4 changes: 4 additions & 0 deletions environments/from_mcp_template/controller/tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Controller tools that call the environment API."""

from controller import mcp
from hud.tools.types import EvaluationResult
19 changes: 19 additions & 0 deletions environments/from_mcp_template/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[project]
name = "test_test"
version = "0.1.0"
description = "A minimal HUD environment"
requires-python = ">=3.11"
dependencies = [ "hud-python==0.4.41" ]

[build-system]
requires = [ "hatchling",]
build-backend = "hatchling.build"

[tool.hud]
image = "test_test:dev"

[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.build.targets.wheel]
packages = [ "controller", "environment",]
13 changes: 13 additions & 0 deletions environments/from_mcp_template/tasks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[
{
"prompt": "Do something in this strange new environment",
"mcp_config": {
"local": {
"url": "http://localhost:8765/mcp"
}
},
"setup_tool": {},
"integration_test_tool": {},
"evaluate_tool": {}
}
]
Loading
Loading