Skip to content

lin-shui/Frontier-Engineering

 
 

Repository files navigation

Frontier-Eng

English | 简体中文

Homepage arXiv Discord

Frontier-Eng is a benchmark for generative optimization: agents iteratively edit runnable engineering code, get feedback from frozen verifiers, and improve under a fixed interaction budget. The paper's central claim is simple: engineering performance is usually about the best design you can discover within budget, not average pass rate. Frontier-Eng therefore focuses on:

  • continuous reward signals instead of binary grading
  • real verifiers and simulators instead of judge models
  • improvement trajectories instead of single-shot answers

The benchmark currently covers 47 tasks across computing, quantum information, operations research, robotics and control, optics and communications, and physical sciences. The project homepage and paper frame it as a missing evaluation axis between pass/fail coding benchmarks and real engineering work: most engineering problems start from a feasible baseline and reward iterative improvement, not one-shot correctness.

0. Host Requirements

The repository-owned Python environments are automated, but a full v1 baseline sweep still assumes a Linux host with the following capabilities already in place:

Requirement Needed for Notes
Standard build tools and outbound network access all setup paths required for uv, Python wheels, and benchmark-local downloads
NVIDIA GPU + working CUDA stack KernelEngineering/*, Aerodynamics/CarAerodynamicsSensing, selected robotics tasks not needed for CPU-only tasks
Docker EngDesign needed only for that task family
Octave Astrodynamics/MannedLunarLanding host-level tool, not installed by uv
Extra datasets / checkpoints / third-party repos selected tasks such as SustainableDataCenterControl, CarAerodynamicsSensing, MolecularMechanics see task README files for exact paths

If you want the task-by-task preparation details, including host-tool installation notes, read run.md before launching the full v1 problem set.

Quickstart

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

2. Create the driver environment

From the repository root:

bash init.sh
source .venvs/frontier-eval-driver/bin/activate

This creates the driver environment used to run python -m frontier_eval.

3. Run a smoke test

python -m frontier_eval task=smoke algorithm=openevolve algorithm.iterations=0

This does not need an API key and is the fastest way to confirm the framework itself is wired correctly.

Running Real Benchmarks

There are two layers of environments in this repository:

  • frontier-eval-driver: the driver environment
  • .venvs/<runtime-name>: task runtime environments such as frontier-v1-main, frontier-v1-kernel, and frontier-v1-summit

To create the v1 task runtime environments used by the released v1 problem set:

bash scripts/env/setup_v1_task_envs.sh

The runtime selector now uses:

  • task.runtime.env_name=<name> to prepend .venvs/<name>/bin to PATH
  • task.runtime.python_path=uv-env:<name> when a task must call a runtime interpreter directly

Single-task baseline run

No LLM key is needed for baseline-only validation:

python -m frontier_eval \
  task=unified \
  task.benchmark=WirelessChannelSimulation/HighReliableSimulation \
  algorithm=openevolve \
  algorithm.iterations=0

Full v1 baseline sweep

To validate the released v1 problem set without any model API calls:

bash scripts/batch/validate_v1_task_envs.sh

That command runs the batch config for the v1 problem set with algorithm.iterations=0, which evaluates each task's shipped baseline instead of asking an LLM to improve it.

If you want the full v1 problem set with normal optimization runs later, see run.md.

Where To Go Next

Leaderboard

Detailed leaderboard: lab.einsia.ai/frontier-eng/leaderboard.html

Rank Model Average Rank
1 GPT-5.4 3.54
2 Claude Opus 4.6 3.63
3 GLM-5 4.34
4 DeepSeek V3.2 4.76
5 Gemini 3.1 Pro Preview 5.53
6 Grok 4.20 5.82
7 SEED 2.0 Pro 5.86
8 Qwen3 Coder Next 6.71

Contributing

Contribution guidelines live in CONTRIBUTING.md.

Citation

@article{chi2026frontier,
  title={Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization},
  author={Chi, Yizhe and Hong, Deyao and Jiang, Dapeng and Luo, Tianwei and Yang, Kaisen and Zhang, Boshi and Cao, Zhe and Fan, Xiaoyan and He, Bingxiang and Hao, Han and others},
  journal={arXiv preprint arXiv:2604.12290},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 65.5%
  • KCL 28.4%
  • C++ 3.4%
  • C 1.1%
  • MATLAB 0.7%
  • Shell 0.5%
  • Other 0.4%