Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,13 @@ export OPENAI_API_KEY="<your-key>"
# Try the circle packing benchmark
uv sync --extra math
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
benchmarks/math/circle_packing/eval \
--config benchmarks/math/circle_packing/config.yaml \
--search evox \
--iterations 100

uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
benchmarks/math/circle_packing/eval \
--config benchmarks/math/circle_packing/config.yaml \
--search adaevolve \
--iterations 100
Expand Down Expand Up @@ -136,7 +136,7 @@ SkyDiscover supports three evaluator formats — pick whichever fits your use ca
| Format | When to use | What you point `evaluation_file` at |
|:---|:---|:---|
| **Python function** | Simple tasks, no system deps | `evaluator.py` |
| **Containerized** | Custom deps, data files, isolation | `evaluator/` directory (must contain `Dockerfile` + `evaluate.sh`) |
| **Containerized** | Custom deps, data files, isolation | `eval/` directory (must contain `Dockerfile` + `evaluate.sh`) |
| **Harbor task** | External benchmark suites (AlgoTune, EvoEval, HumanEvalFix, BigCodeBench, LiveCodeBench, USACO, CRUSTBench, CodePDE, and more) | Task directory (must contain `instruction.md` + `tests/` + `environment/Dockerfile`) |

SkyDiscover auto-detects the format. See [`benchmarks/README.md`](benchmarks/README.md#adding-a-benchmark) for full setup instructions.
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ADRS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,15 +46,15 @@ Given a set of database transactions with read/write dependencies on shared keys

Each benchmark directory contains:
- `initial_program.py` — the seed solution for evolution
- `evaluator.py` — the scoring function
- `eval/` — containerized evaluator directory
- `config.yaml` — run configuration

Run any benchmark from the repo root:

```bash
uv run skydiscover-run \
benchmarks/ADRS/cloudcast/initial_program.py \
benchmarks/ADRS/cloudcast/evaluator.py \
benchmarks/ADRS/cloudcast/eval \
-c benchmarks/ADRS/cloudcast/config.yaml \
-s [your_algorithm] \
-i 100
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ADRS/cloudcast/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ From the repo root:
```bash
uv run skydiscover-run \
benchmarks/ADRS/cloudcast/initial_program.py \
benchmarks/ADRS/cloudcast/evaluator.py \
benchmarks/ADRS/cloudcast/eval \
-c benchmarks/ADRS/cloudcast/config.yaml \
-s [your_algorithm] \
-i 100
Expand All @@ -42,7 +42,7 @@ uv run skydiscover-run \
| File | Description |
|------|-------------|
| `initial_program.py` | Baseline `search_algorithm` function to evolve |
| `evaluator.py` | Scores programs on total transfer cost across 5 network configs |
| `eval/` | Containerized evaluator — scores programs on total transfer cost across 5 network configs |
| `config.yaml` | Task-specific config (LLM, evaluator timeout, system prompt) |
| `simulator.py` | Broadcast cost simulator |
| `broadcast.py` | `BroadCastTopology` data structure |
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/ADRS/cloudcast/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# CloudCast — Cloud Broadcast Optimization (NSDI'24)
# Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.py eval -c config.yaml -s <strategy>
language: python
diff_based_generation: true
max_iterations: 100
Expand Down
6 changes: 3 additions & 3 deletions benchmarks/ADRS/eplb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ From the repo root:
```bash
uv run skydiscover-run \
benchmarks/ADRS/eplb/initial_program.py \
benchmarks/ADRS/eplb/evaluator.py \
benchmarks/ADRS/eplb/eval \
-c benchmarks/ADRS/eplb/config.yaml \
-s [your_algorithm] \
-i 100 \
Expand All @@ -40,7 +40,7 @@ uv run skydiscover-run \
Or from this directory:

```bash
uv run skydiscover-run initial_program.py evaluator.py \
uv run skydiscover-run initial_program.py eval \
-c config.yaml \
-s [your_algorithm] \
-i 100
Expand All @@ -57,7 +57,7 @@ python evaluate_best_program.py
| File | Description |
|------|-------------|
| `initial_program.py` | Baseline `rebalance_experts` function to evolve |
| `evaluator.py` | Scores programs on load-balance quality and execution speed |
| `eval/` | Containerized evaluator — scores programs on load-balance quality and execution speed |
| `config.yaml` | Task-specific config (LLM, evaluator timeout, system prompt) |
| `evaluate_best_program.py` | Standalone script to evaluate a saved best program |
| `expert-load.json` | Workload data (must be downloaded — see Setup) |
2 changes: 1 addition & 1 deletion benchmarks/ADRS/eplb/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Expert Parallelism Load Balancer (EPLB) — MoE Expert Rearrangement
# Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.py eval -c config.yaml -s <strategy>
# NOTE: Requires expert-load.json — see README.md for download instructions.
language: python
diff_based_generation: true
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ADRS/llm_sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ From the repo root:
```bash
uv run skydiscover-run \
benchmarks/ADRS/llm_sql/initial_program.py \
benchmarks/ADRS/llm_sql/evaluator.py \
benchmarks/ADRS/llm_sql/eval \
-c benchmarks/ADRS/llm_sql/config.yaml \
-s [your_algorithm] \
-i 100
Expand All @@ -49,7 +49,7 @@ Combined score: `0.95 * average_hit_rate + 0.05 * (12 - min(12, avg_runtime)) /
| File | Description |
|------|-------------|
| `initial_program.py` | Baseline `Evolved` class with `reorder()` method to evolve |
| `evaluator.py` | Scores programs on prefix hit rate and runtime across 5 datasets |
| `eval/` | Containerized evaluator — scores programs on prefix hit rate and runtime across 5 datasets |
| `config.yaml` | Task-specific config (LLM, evaluator timeout, system prompt) |
| `solver.py` | Base `Algorithm` class and greedy baseline |
| `utils.py` | Prefix hit count evaluation utilities |
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/ADRS/llm_sql/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# LLM SQL — Prompt Caching Column Reordering Optimization
# Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.py eval -c config.yaml -s <strategy>
language: python
diff_based_generation: true
max_iterations: 100
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/ADRS/prism/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Prism (GPU Model Placement) — Prompt Caching Column Reordering Optimization
# Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.py eval -c config.yaml -s <strategy>
language: python
diff_based_generation: true
max_iterations: 100
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/ADRS/txn_scheduling/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Transaction Scheduling — Minimize makespan for database workloads
# Usage: skydiscover-run initial_program.py evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.py eval -c config.yaml -s <strategy>
language: python
diff_based_generation: true
max_iterations: 100
Expand Down
14 changes: 7 additions & 7 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ export OPENAI_API_KEY="..."

# Containerized benchmark (recommended — evaluator runs in Docker)
uv run skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
benchmarks/math/circle_packing_rect/evaluator \
benchmarks/math/circle_packing_rect/eval \
-c benchmarks/math/circle_packing_rect/config.yaml \
-s best_of_n -i 50

# Plain Python evaluator (runs on host)
uv run skydiscover-run benchmarks/math/circle_packing/initial_program.py \
benchmarks/math/circle_packing/evaluator.py \
benchmarks/math/circle_packing/eval \
-c benchmarks/math/circle_packing/config.yaml \
-s best_of_n -i 100
```
Expand Down Expand Up @@ -67,15 +67,15 @@ There are three ways to set up a benchmark: a **containerized evaluator** (recom
<task>/
├── initial_program.py # Starting solution
├── config.yaml # System prompt + search/evaluator settings
└── evaluator/ # Self-contained Docker benchmark
└── eval/ # Self-contained Docker benchmark
├── Dockerfile
├── evaluate.sh # Entrypoint (receives solution path + mode)
├── evaluator.py # Scoring logic
├── requirements.txt # Python dependencies
└── ... # Any other data/files the evaluator needs
```

The `evaluator/` directory is the Docker build context. Everything inside it gets copied into the image — data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when `evaluation_file` points to a directory containing a `Dockerfile` and `evaluate.sh`.
The `eval/` directory is the Docker build context. Everything inside it gets copied into the image — data files, model weights, test fixtures, etc. SkyDiscover auto-detects this layout when `evaluation_file` points to a directory containing a `Dockerfile` and `evaluate.sh`.

### Plain Python evaluator

Expand Down Expand Up @@ -154,7 +154,7 @@ ENTRYPOINT ["./evaluate.sh"]

If you have an existing `evaluate(program_path) -> dict` function, you can wrap it with the backwards-compatibility wrapper:

1. Copy `skydiscover/evaluation/wrapper.py` into your `evaluator/` directory.
1. Copy `skydiscover/evaluation/wrapper.py` into your `eval/` directory.
2. Add this to the bottom of your `evaluator.py`:

```python
Expand All @@ -167,11 +167,11 @@ The wrapper handles stdout redirection (so debug prints don't corrupt JSON), err

#### Running a containerized benchmark

Point `evaluation_file` at the `evaluator/` directory:
Point `evaluation_file` at the `eval/` directory:

```bash
skydiscover-run benchmarks/math/circle_packing_rect/initial_program.py \
benchmarks/math/circle_packing_rect/evaluator \
benchmarks/math/circle_packing_rect/eval \
-c benchmarks/math/circle_packing_rect/config.yaml \
-s best_of_n -i 50
```
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ale_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Run evolution on a single problem:
```bash
uv run skydiscover-run \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/initial_program.cpp \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/evaluator.py \
benchmarks/ale_bench/ale-bench-lite-problems/ahc025/eval \
-c benchmarks/ale_bench/ale-bench-lite-problems/ahc025/config.yaml \
--search evox \
-i 100
Expand Down Expand Up @@ -59,7 +59,7 @@ ale_bench/
├── ale-bench-lite-problems/
│ └── ahcXXX/
│ ├── initial_program.cpp # Starting C++ solution
│ ├── evaluator.py # Runs 50 public cases via ale_bench
│ ├── eval/ # Containerized evaluator (runs 50 public cases via ale_bench)
│ └── config.yaml # Search config (cpp, diff-based, 100 iterations)
├── ale_agent_best/
│ └── ahcXXX.cpp # Best known solutions (reference)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ALE-Bench ahc008 — AtCoder Heuristic Contest
# Usage: skydiscover-run initial_program.cpp evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.cpp eval -c config.yaml -s <strategy>
language: cpp
diff_based_generation: true
max_iterations: 100
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python /benchmark/evaluator.py "$1"
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,7 @@ def evaluate(program_path):
return {
"overall_score": 0.0,
"error": str(e),
}
}
if __name__ == "__main__":
from wrapper import run
run(evaluate)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ale-bench
ale-bench-eval
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ALE-Bench ahc011 — AtCoder Heuristic Contest
# Usage: skydiscover-run initial_program.cpp evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.cpp eval -c config.yaml -s <strategy>
language: cpp
diff_based_generation: true
max_iterations: 100
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python /benchmark/evaluator.py "$1"
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,7 @@ def evaluate(program_path):
return {
"overall_score": 0.0,
"error": str(e),
}
}
if __name__ == "__main__":
from wrapper import run
run(evaluate)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ale-bench
ale-bench-eval
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ALE-Bench ahc015 — AtCoder Heuristic Contest
# Usage: skydiscover-run initial_program.cpp evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.cpp eval -c config.yaml -s <strategy>
language: cpp
diff_based_generation: true
max_iterations: 100
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python /benchmark/evaluator.py "$1"
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,7 @@ def evaluate(program_path):
return {
"overall_score": 0.0,
"error": str(e),
}
}
if __name__ == "__main__":
from wrapper import run
run(evaluate)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ale-bench
ale-bench-eval
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ALE-Bench ahc016 — AtCoder Heuristic Contest
# Usage: skydiscover-run initial_program.cpp evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.cpp eval -c config.yaml -s <strategy>
language: cpp
diff_based_generation: true
max_iterations: 100
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python /benchmark/evaluator.py "$1"
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,7 @@ def evaluate(program_path):
return {
"overall_score": 0.0,
"error": str(e),
}
}
if __name__ == "__main__":
from wrapper import run
run(evaluate)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ale-bench
ale-bench-eval
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ALE-Bench ahc024 — AtCoder Heuristic Contest
# Usage: skydiscover-run initial_program.cpp evaluator.py -c config.yaml -s <strategy>
# Usage: skydiscover-run initial_program.cpp eval -c config.yaml -s <strategy>
language: cpp
diff_based_generation: true
max_iterations: 100
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.12-slim

RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /benchmark
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
RUN chmod +x evaluate.sh
ENTRYPOINT ["./evaluate.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash
set -euo pipefail
python /benchmark/evaluator.py "$1"
Loading
Loading