Frontier-Eng: running the benchmark

Chinese: run_zh-CN.md

Framework-level commands live in frontier_eval/README.md. This page focuses on the released v1 problem set and the operator workflow around it.

0. Host requirements and what this repo automates

This repository automates the Python environments it owns. It does not install host-level tools, GPU drivers, Docker, or large third-party benchmark assets for you.

For a full v1 baseline sweep, assume you need:

Requirement	Needed for	How to verify
Linux shell environment with standard build tools and outbound network access	all setup paths	`python3 --version`, `git --version`, `curl --version`
NVIDIA GPU plus working CUDA runtime	`KernelEngineering/*`, `Aerodynamics/CarAerodynamicsSensing`, selected robotics tasks	`nvidia-smi` and a successful CUDA PyTorch import
Docker	`EngDesign`	`docker version`
Octave	`Astrodynamics/MannedLunarLanding`	`octave --version`
Task-local assets / checkpoints / third-party repos	selected tasks	check the task README for exact files and paths

Install host tools when needed

Examples below assume Ubuntu or another Debian-like Linux distribution.

If you want the repository to drive the common apt-based setup for you, use:

bash scripts/bootstrap/install_host_deps.sh --octave
bash scripts/bootstrap/install_host_deps.sh --docker --configure-docker-group

Octave

sudo apt-get update
sudo apt-get install -y octave
octave --version

Docker

If your distro packages are sufficient for local evaluation:

sudo apt-get update
sudo apt-get install -y docker.io
sudo usermod -aG docker "$USER"
docker version

Start a new shell after changing group membership. If you need the full upstream setup instead, follow the official Docker Engine guide: docs.docker.com/engine/install.

CUDA / NVIDIA stack

This repository does not install GPU drivers or CUDA for you. Before running GPU-required tasks, verify:

nvidia-smi
python3 - <<'PY'
import torch
print(torch.cuda.is_available())
PY

If those checks fail, install a working NVIDIA driver / CUDA stack first, following your cluster or machine's standard setup procedure.

External assets that are still manual

The repo now includes an asset bootstrap helper for benchmarks and optional algorithm repos:

python scripts/bootstrap/fetch_task_assets.py --list
python scripts/bootstrap/fetch_task_assets.py --target v1-baseline-assets
python scripts/bootstrap/fetch_task_assets.py --target shinkaevolve
python scripts/bootstrap/fetch_task_assets.py --target abmcts

v1-baseline-assets covers the currently automated benchmark-side bundles, such as PhySense assets for Aerodynamics/CarAerodynamicsSensing, the OpenProblems cache for SingleCellAnalysis/perturbation_prediction, and the upstream dc-rl checkout path for SustainableDataCenterControl when that vendored tree is absent.

Requirement	Affected tasks	Where to look
`dc-rl` checkout and SustainDC assets	`SustainableDataCenterControl`	`benchmarks/SustainableDataCenterControl/README.md` and `hand_written_control/README.md`
PhySense dataset, checkpoints, and reference points	`Aerodynamics/CarAerodynamicsSensing`	`benchmarks/Aerodynamics/CarAerodynamicsSensing/README.md`
OpenProblems NeurIPS 2023 data cache	`SingleCellAnalysis/perturbation_prediction`	`benchmarks/SingleCellAnalysis/perturbation_prediction/README.md`
`openff-dev` runtime	`MolecularMechanics/*`	`bash scripts/bootstrap/install_openff_dev.sh`, then see `benchmarks/MolecularMechanics/README.md`

1. Prepare the environments

From the repo root:

bash init.sh
bash scripts/env/setup_v1_task_envs.sh
source .venvs/frontier-eval-driver/bin/activate

scripts/env/setup_v1_task_envs.sh now bootstraps the released v1 problem set more aggressively by default:

creates the repo-owned uv environments
installs common host tools with scripts/bootstrap/install_host_deps.sh
fetches the current v1-baseline-assets bundle
installs .venvs/openff-dev

So this step may require sudo, large downloads, and more wall-clock time than a plain Python environment setup.

That gives you:

.venvs/frontier-eval-driver for the driver
.venvs/frontier-v1-main for most CPU tasks
.venvs/frontier-v1-summit for ReactionOptimisation/*
.venvs/frontier-v1-sustaindc for SustainableDataCenterControl/*
.venvs/frontier-v1-kernel for kernel/GPU runtimes

Before longer runs:

export PYTHONNOUSERSITE=1
export PYTHONUTF8=1

2. Configure model access

Optimization runs need a working .env:

cp .env.example .env

Set at least:

OPENAI_API_KEY
optionally OPENAI_API_BASE
optionally OPENAI_MODEL

Baseline-only validation does not need an API key as long as you run with algorithm.iterations=0.

3. Run the released `v1` problem set

Standard batch run

bash scripts/batch/run_v1_batch.sh

This launches:

python -m frontier_eval.batch --matrix frontier_eval/conf/batch/v1.yaml

through the driver interpreter in .venvs/frontier-eval-driver.

Useful variants:

bash scripts/batch/run_v1_batch.sh --dry-run
bash scripts/batch/run_v1_batch.sh --tasks KernelEngineering/MLA
bash scripts/batch/run_v1_batch.sh --exclude-tasks engdesign

Baseline-only validation

To verify the shipped baselines without any LLM calls:

bash scripts/batch/validate_v1_task_envs.sh

This runs the batch config for the released v1 problem set with algorithm.iterations=0 and splits validation into CPU, GPU, kernel, and engdesign subsets.

Important: this command validates the tasks only after their host prerequisites and external assets are already in place. It is not a promise that a fresh machine with only uv installed will automatically pass every task.

4. Important runtime knobs

CUDA_VISIBLE_DEVICES: select the GPU for GPU-heavy tasks
GPU_DEVICES: GPU id used by scripts/batch/validate_v1_task_envs.sh
DRIVER_ENV: defaults to frontier-eval-driver
DRIVER_PY: explicit path to the driver Python if you do not want to use the default .venvs/frontier-eval-driver/bin/python
V1_MATRIX: override the matrix path
ENGDESIGN_EVAL_MODE, ENGDESIGN_DOCKER_IMAGE: see benchmarks/EngDesign/README.md

5. What a successful baseline sweep does and does not prove

A baseline-only run is valuable because it verifies:

the Hydra config resolves correctly
the benchmark runtime starts
the evaluator can execute the shipped baseline
metrics.json / artifacts.json handling is wired correctly

It does not prove that every benchmark is fully self-contained on a fresh machine. Some tasks still require external assets, Docker, Octave, CUDA, or benchmark-local data before the baseline can run successfully.

6. Output locations

Batch results are written under:

runs/batch/<run.name>/

Validation runs use:

runs/batch_validation/

Each task gets its own output directory, and aggregated summaries are stored in summary.jsonl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frontier-Eng: running the benchmark

0. Host requirements and what this repo automates

Install host tools when needed

Octave

Docker

CUDA / NVIDIA stack

External assets that are still manual

1. Prepare the environments

2. Configure model access

3. Run the released `v1` problem set

Standard batch run

Baseline-only validation

4. Important runtime knobs

5. What a successful baseline sweep does and does not prove

6. Output locations

FilesExpand file tree

run.md

Latest commit

History

run.md

File metadata and controls

Frontier-Eng: running the benchmark

0. Host requirements and what this repo automates

Install host tools when needed

Octave

Docker

CUDA / NVIDIA stack

External assets that are still manual

1. Prepare the environments

2. Configure model access

3. Run the released v1 problem set

Standard batch run

Baseline-only validation

4. Important runtime knobs

5. What a successful baseline sweep does and does not prove

6. Output locations

3. Run the released `v1` problem set