hud-evals · dylanbowman314 · Sep 30, 2025 · Sep 30, 2025 · Sep 30, 2025 · Sep 30, 2025
diff --git a/environments/from_mcp_template/.env.example b/environments/from_mcp_template/.env.example
@@ -0,0 +1,7 @@
+# HUD API Configuration
+# Get your API key from https://hud.so/account
+HUD_API_KEY=""
+
+# Anthropic API Configuration (optional)
+# Required for using Claude agents - get from https://console.anthropic.com/
+ANTHROPIC_API_KEY=""
diff --git a/environments/from_mcp_template/Dockerfile b/environments/from_mcp_template/Dockerfile
@@ -0,0 +1,12 @@
+FROM public.ecr.aws/docker/library/python:3.11-bookworm
+
+WORKDIR /app
+
+# Install git for dependency installation
+RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
+
+# Copy and install dependencies
+COPY pyproject.toml ./
+COPY controller/ ./controller/
+COPY environment/ ./environment/
+RUN pip install --no-cache-dir -e .
diff --git a/environments/from_mcp_template/README.md b/environments/from_mcp_template/README.md
@@ -0,0 +1,108 @@
+# test-test
+
+## Environment design pattern
+- Controller (Think of this as a frontend in web development)
+  - Creates the UX and manages the lifecycle of an app (in this case for an agent)
+  - Define `mcp = MCPServer()` and register `@mcp.tool` as tools the agent can interact with
+- Environment (Think of this as a backend in web development)
+  - Owns all long‑lived states of the environment and exposes the environment data structure
+  - Expose simple HTTP endpoints (`/health`, `/act`, `/reset`, `/state`)
+
+IMPORTANT: Make sure all logs are going to stderr instead of stdio, which is reserved for MCP communication
+
+### Testing your environment
+```bash
+# 1. Configure your API keys (optional - only needed for evaluation)
+# Edit .env file to add your HUD_API_KEY and ANTHROPIC_API_KEY
+
+# 2. Start the environment (optional: with --inspector or --interactive)
+hud dev --build --interactive
+
+# 3. Choose your preferred way to test:
+
+# Option A: Run the task with Claude (requires ANTHROPIC_API_KEY)
+hud eval tasks.json --agent claude
+
+# Option B: Interactive notebook test_env.ipynb (great for learning!)
+
+# Option C: Simple Python script (runs all tasks from tasks.json)
+python test_task.py
+```
+
+## Iterating on your environment
+This is usually the process for making any environment better:
+```bash
+# 1. Start the environment and interact with it directly (or give MCP server to an agent):
+hud dev --build --interactive
+
+# 2. If the environment cannot start or fails inexplicably:
+hud debug test_env:dev # Or your env name that appears when you run hud dev
+# After fixing the error, go back to 1.
+
+# 3. When the environment is in a stable state:
+hud build
+hud push # Requires docker login
+
+# 4. As soon as it's pushed to the newest version, make sure tasks have it updated and run:
+hud rl
+# This is a good test to see if your environment and tasks are high quality!
+
+## Layout
+```
+controller/
+  __init__.py   # mcp + shared HTTP client
+  __main__.py   # python -m controller → mcp.run()
+  hooks.py      # @mcp.initialize / @mcp.shutdown
+  tools.py      # @mcp.tool act / setup / evaluate
+
+./environment
+  ├── __init__.py
+  └── server.py       # FastAPI app: /health, /act, /reset, /state
+```
+
+## Publishing Your Environment
+
+Once your environment is ready, you can share it with the community:
+
+### 1. Push to Registry
+```bash
+# Build and push your environment (requires docker hub login and hud api key)
+hud build
+hud push
+```
+
+### 2. Create a Dataset
+
+Create a dataset on HuggingFace with your tasks:
+
+**Option A: Upload manually**
+1. Upload your `tasks.json` to HuggingFace
+2. Make sure it's **public** to appear on leaderboards
+
+**Option B: Use the SDK**
+```python
+from hud.datasets import save_tasks
+import json
+
+# Load your tasks
+with open("tasks.json") as f:
+    tasks = json.load(f)
+
+# Push to HuggingFace
+save_tasks(tasks, repo_id="your-org/your-dataset")
+```
+
+### 3. Run and Track Performance
+
+```bash
+# Run Claude on your benchmark
+hud eval "your-org/your-dataset" --agent claude
+
+# View results at:
+# hud.so/leaderboards/your-org/your-dataset
+```
+
+**Note**: Only public HuggingFace datasets appear as leaderboards!
+
+📚 Learn more: [Creating Benchmarks](https://docs.hud.so/evaluate-agents/create-benchmarks) | [Leaderboards](https://docs.hud.so/evaluate-agents/leaderboards)
+
diff --git a/environments/from_mcp_template/controller/README.md b/environments/from_mcp_template/controller/README.md
@@ -0,0 +1,16 @@
+# Controller
+
+Frontend for the agent: defines tools, minimal state, calls the environment over HTTP.
+
+What to implement
+- Shared client in `__init__.py` (one `httpx.AsyncClient`)
+- Lifecycle in `hooks.py` (`@mcp.initialize`/`@mcp.shutdown`)
+- Tools in `tools.py` (`@mcp.tool`) — keep logic thin; docstrings = descriptions
+
+Run
+```bash
+hud run controller --transport http --reload
+# Helper endpoints: http://localhost:8765/hud and /hud/tools
+```
+
+Principle: the controller is UX, not state. Keep long‑lived state in the environment.
diff --git a/environments/from_mcp_template/controller/__init__.py b/environments/from_mcp_template/controller/__init__.py
@@ -0,0 +1,27 @@
+"""Controller package - registers hooks and tools."""
+
+import sys
+import os
+import httpx
+import logging
+from hud.server import MCPServer
+
+logging.basicConfig(
+    stream=sys.stderr,
+    level=logging.INFO,
+    format="[%(levelname)s] %(asctime)s | %(name)s | %(message)s",
+    force=True,  # Force all loggers to use stderr
+)
+
+# Suppress httpx INFO logs to avoid cluttering MCP protocol
+httpx_logger = logging.getLogger("httpx")
+httpx_logger.setLevel(logging.WARNING)  # Only show warnings and errors
+httpcore_logger = logging.getLogger("httpcore")
+httpcore_logger.setLevel(logging.WARNING)  # Only show warnings and errors
+
+mcp = MCPServer()
+
+# Import tools and hooks to register them with the server
+from . import tools, hooks
+
+__all__ = ["mcp"]
diff --git a/environments/from_mcp_template/controller/__main__.py b/environments/from_mcp_template/controller/__main__.py
@@ -0,0 +1,4 @@
+from controller import mcp
+
+if __name__ == "__main__":
+    mcp.run()
diff --git a/environments/from_mcp_template/controller/tools.py b/environments/from_mcp_template/controller/tools.py
@@ -0,0 +1,4 @@
+"""Controller tools that call the environment API."""
+
+from controller import mcp
+from hud.tools.types import EvaluationResult
diff --git a/environments/from_mcp_template/pyproject.toml b/environments/from_mcp_template/pyproject.toml
@@ -0,0 +1,19 @@
+[project]
+name = "test_test"
+version = "0.1.0"
+description = "A minimal HUD environment"
+requires-python = ">=3.11"
+dependencies = [ "hud-python==0.4.41" ]
+
+[build-system]
+requires = [ "hatchling",]
+build-backend = "hatchling.build"
+
+[tool.hud]
+image = "test_test:dev"
+
+[tool.hatch.metadata]
+allow-direct-references = true
+
+[tool.hatch.build.targets.wheel]
+packages = [ "controller", "environment",]
diff --git a/environments/from_mcp_template/tasks.json b/environments/from_mcp_template/tasks.json
@@ -0,0 +1,13 @@
+[
+  {
+    "prompt": "Do something in this strange new environment",
+    "mcp_config": {
+      "local": {
+        "url": "http://localhost:8765/mcp"
+      }
+    },
+    "setup_tool": {},
+    "integration_test_tool": {},
+    "evaluate_tool": {}
+  }
+]