Chinese: run_zh-CN.md
Framework-level commands live in frontier_eval/README.md. This page focuses on the released v1 problem set and the operator workflow around it.
This repository automates the Python environments it owns. It does not install host-level tools, GPU drivers, Docker, or large third-party benchmark assets for you.
For a full v1 baseline sweep, assume you need:
| Requirement | Needed for | How to verify |
|---|---|---|
| Linux shell environment with standard build tools and outbound network access | all setup paths | python3 --version, git --version, curl --version |
| NVIDIA GPU plus working CUDA runtime | KernelEngineering/*, Aerodynamics/CarAerodynamicsSensing, selected robotics tasks |
nvidia-smi and a successful CUDA PyTorch import |
| Docker | EngDesign |
docker version |
| Octave | Astrodynamics/MannedLunarLanding |
octave --version |
| Task-local assets / checkpoints / third-party repos | selected tasks | check the task README for exact files and paths |
Examples below assume Ubuntu or another Debian-like Linux distribution.
If you want the repository to drive the common apt-based setup for you, use:
bash scripts/bootstrap/install_host_deps.sh --octave
bash scripts/bootstrap/install_host_deps.sh --docker --configure-docker-groupsudo apt-get update
sudo apt-get install -y octave
octave --versionIf your distro packages are sufficient for local evaluation:
sudo apt-get update
sudo apt-get install -y docker.io
sudo usermod -aG docker "$USER"
docker versionStart a new shell after changing group membership. If you need the full upstream setup instead, follow the official Docker Engine guide: docs.docker.com/engine/install.
This repository does not install GPU drivers or CUDA for you. Before running GPU-required tasks, verify:
nvidia-smi
python3 - <<'PY'
import torch
print(torch.cuda.is_available())
PYIf those checks fail, install a working NVIDIA driver / CUDA stack first, following your cluster or machine's standard setup procedure.
The repo now includes an asset bootstrap helper for benchmarks and optional algorithm repos:
python scripts/bootstrap/fetch_task_assets.py --list
python scripts/bootstrap/fetch_task_assets.py --target v1-baseline-assets
python scripts/bootstrap/fetch_task_assets.py --target shinkaevolve
python scripts/bootstrap/fetch_task_assets.py --target abmctsv1-baseline-assets covers the currently automated benchmark-side bundles, such as PhySense assets for Aerodynamics/CarAerodynamicsSensing, the OpenProblems cache for SingleCellAnalysis/perturbation_prediction, and the upstream dc-rl checkout path for SustainableDataCenterControl when that vendored tree is absent.
| Requirement | Affected tasks | Where to look |
|---|---|---|
dc-rl checkout and SustainDC assets |
SustainableDataCenterControl |
benchmarks/SustainableDataCenterControl/README.md and hand_written_control/README.md |
| PhySense dataset, checkpoints, and reference points | Aerodynamics/CarAerodynamicsSensing |
benchmarks/Aerodynamics/CarAerodynamicsSensing/README.md |
| OpenProblems NeurIPS 2023 data cache | SingleCellAnalysis/perturbation_prediction |
benchmarks/SingleCellAnalysis/perturbation_prediction/README.md |
openff-dev runtime |
MolecularMechanics/* |
bash scripts/bootstrap/install_openff_dev.sh, then see benchmarks/MolecularMechanics/README.md |
From the repo root:
bash init.sh
bash scripts/env/setup_v1_task_envs.sh
source .venvs/frontier-eval-driver/bin/activatescripts/env/setup_v1_task_envs.sh now bootstraps the released v1 problem set more aggressively by default:
- creates the repo-owned
uvenvironments - installs common host tools with
scripts/bootstrap/install_host_deps.sh - fetches the current
v1-baseline-assetsbundle - installs
.venvs/openff-dev
So this step may require sudo, large downloads, and more wall-clock time than a plain Python environment setup.
That gives you:
.venvs/frontier-eval-driverfor the driver.venvs/frontier-v1-mainfor most CPU tasks.venvs/frontier-v1-summitforReactionOptimisation/*.venvs/frontier-v1-sustaindcforSustainableDataCenterControl/*.venvs/frontier-v1-kernelfor kernel/GPU runtimes
Before longer runs:
export PYTHONNOUSERSITE=1
export PYTHONUTF8=1Optimization runs need a working .env:
cp .env.example .envSet at least:
OPENAI_API_KEY- optionally
OPENAI_API_BASE - optionally
OPENAI_MODEL
Baseline-only validation does not need an API key as long as you run with algorithm.iterations=0.
bash scripts/batch/run_v1_batch.shThis launches:
python -m frontier_eval.batch --matrix frontier_eval/conf/batch/v1.yamlthrough the driver interpreter in .venvs/frontier-eval-driver.
Useful variants:
bash scripts/batch/run_v1_batch.sh --dry-run
bash scripts/batch/run_v1_batch.sh --tasks KernelEngineering/MLA
bash scripts/batch/run_v1_batch.sh --exclude-tasks engdesignTo verify the shipped baselines without any LLM calls:
bash scripts/batch/validate_v1_task_envs.shThis runs the batch config for the released v1 problem set with algorithm.iterations=0 and splits validation into CPU, GPU, kernel, and engdesign subsets.
Important: this command validates the tasks only after their host prerequisites and external assets are already in place. It is not a promise that a fresh machine with only uv installed will automatically pass every task.
CUDA_VISIBLE_DEVICES: select the GPU for GPU-heavy tasksGPU_DEVICES: GPU id used byscripts/batch/validate_v1_task_envs.shDRIVER_ENV: defaults tofrontier-eval-driverDRIVER_PY: explicit path to the driver Python if you do not want to use the default.venvs/frontier-eval-driver/bin/pythonV1_MATRIX: override the matrix pathENGDESIGN_EVAL_MODE,ENGDESIGN_DOCKER_IMAGE: seebenchmarks/EngDesign/README.md
A baseline-only run is valuable because it verifies:
- the Hydra config resolves correctly
- the benchmark runtime starts
- the evaluator can execute the shipped baseline
metrics.json/artifacts.jsonhandling is wired correctly
It does not prove that every benchmark is fully self-contained on a fresh machine. Some tasks still require external assets, Docker, Octave, CUDA, or benchmark-local data before the baseline can run successfully.
Batch results are written under:
runs/batch/<run.name>/
Validation runs use:
runs/batch_validation/
Each task gets its own output directory, and aggregated summaries are stored in summary.jsonl.