English | 简体中文
Frontier-Eng is a benchmark for generative optimization: agents iteratively edit runnable engineering code, get feedback from frozen verifiers, and improve under a fixed interaction budget. The paper's central claim is simple: engineering performance is usually about the best design you can discover within budget, not average pass rate. Frontier-Eng therefore focuses on:
- continuous reward signals instead of binary grading
- real verifiers and simulators instead of judge models
- improvement trajectories instead of single-shot answers
The benchmark currently covers 47 tasks across computing, quantum information, operations research, robotics and control, optics and communications, and physical sciences. The project homepage and paper frame it as a missing evaluation axis between pass/fail coding benchmarks and real engineering work: most engineering problems start from a feasible baseline and reward iterative improvement, not one-shot correctness.
The repository-owned Python environments are automated, but a full v1 baseline sweep still assumes a Linux host with the following capabilities already in place:
| Requirement | Needed for | Notes |
|---|---|---|
| Standard build tools and outbound network access | all setup paths | required for uv, Python wheels, and benchmark-local downloads |
| NVIDIA GPU + working CUDA stack | KernelEngineering/*, Aerodynamics/CarAerodynamicsSensing, selected robotics tasks |
not needed for CPU-only tasks |
| Docker | EngDesign |
needed only for that task family |
| Octave | Astrodynamics/MannedLunarLanding |
host-level tool, not installed by uv |
| Extra datasets / checkpoints / third-party repos | selected tasks such as SustainableDataCenterControl, CarAerodynamicsSensing, MolecularMechanics |
see task README files for exact paths |
If you want the task-by-task preparation details, including host-tool installation notes, read run.md before launching the full v1 problem set.
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"From the repository root:
bash init.sh
source .venvs/frontier-eval-driver/bin/activateThis creates the driver environment used to run python -m frontier_eval.
python -m frontier_eval task=smoke algorithm=openevolve algorithm.iterations=0This does not need an API key and is the fastest way to confirm the framework itself is wired correctly.
There are two layers of environments in this repository:
frontier-eval-driver: the driver environment.venvs/<runtime-name>: task runtime environments such asfrontier-v1-main,frontier-v1-kernel, andfrontier-v1-summit
To create the v1 task runtime environments used by the released v1 problem set:
bash scripts/env/setup_v1_task_envs.shThe runtime selector now uses:
task.runtime.env_name=<name>to prepend.venvs/<name>/bintoPATHtask.runtime.python_path=uv-env:<name>when a task must call a runtime interpreter directly
No LLM key is needed for baseline-only validation:
python -m frontier_eval \
task=unified \
task.benchmark=WirelessChannelSimulation/HighReliableSimulation \
algorithm=openevolve \
algorithm.iterations=0To validate the released v1 problem set without any model API calls:
bash scripts/batch/validate_v1_task_envs.shThat command runs the batch config for the v1 problem set with algorithm.iterations=0, which evaluates each task's shipped baseline instead of asking an LLM to improve it.
If you want the full v1 problem set with normal optimization runs later, see run.md.
- Framework commands and task onboarding:
frontier_eval/README.md - Batch-running the released
v1problem set:run.md - Full task list:
TASK_DETAILS.md - Archived best solutions from published runs:
baseline_archive/README.md
Detailed leaderboard: lab.einsia.ai/frontier-eng/leaderboard.html
| Rank | Model | Average Rank |
|---|---|---|
| 1 | GPT-5.4 | 3.54 |
| 2 | Claude Opus 4.6 | 3.63 |
| 3 | GLM-5 | 4.34 |
| 4 | DeepSeek V3.2 | 4.76 |
| 5 | Gemini 3.1 Pro Preview | 5.53 |
| 6 | Grok 4.20 | 5.82 |
| 7 | SEED 2.0 Pro | 5.86 |
| 8 | Qwen3 Coder Next | 6.71 |
Contribution guidelines live in CONTRIBUTING.md.
@article{chi2026frontier,
title={Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization},
author={Chi, Yizhe and Hong, Deyao and Jiang, Dapeng and Luo, Tianwei and Yang, Kaisen and Zhang, Boshi and Cao, Zhe and Fan, Xiaoyan and He, Bingxiang and Hao, Han and others},
journal={arXiv preprint arXiv:2604.12290},
year={2026}
}