diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/README.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/README.md
new file mode 100644
index 00000000..902079ad
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/README.md
@@ -0,0 +1,55 @@
+# DuckDB Index Selection
+
+Choose a whitelist subset of DuckDB indexes for a frozen analytical lookup workload and minimize total runtime.
+
+## Why This Benchmark Matters
+
+This benchmark captures physical-design tuning on top of a stable DuckDB workload. Extra indexes can speed repeated queries, but they also add build and maintenance cost, so the goal is to choose the right subset rather than simply choosing more.
+
+Viewed computationally, this is a subset-selection problem over a fixed candidate set, with real execution time rather than a proxy metric deciding the score.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `select_indexes(workload_manifest)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBIndexSelection_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBIndexSelection \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/README_zh-CN.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/README_zh-CN.md
new file mode 100644
index 00000000..f19c4d53
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/README_zh-CN.md
@@ -0,0 +1,55 @@
+# DuckDB 索引选择
+
+在冻结的 DuckDB 分析 workload 上，从白名单中挑选一组索引，尽量降低总执行时间。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是稳定 DuckDB workload 上的物理设计调优。额外索引确实可能加速重复查询，但也会带来构建和维护成本，所以关键不是“建得越多越好”，而是选对那一小部分。
+
+从计算角度看，这是一道固定候选集合上的子集选择题，而且分数来自真实执行时间，不是某个替代指标。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`select_indexes(workload_manifest)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBIndexSelection_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBIndexSelection \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/Task.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task.md
new file mode 100644
index 00000000..6d19deb2
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task.md
@@ -0,0 +1,51 @@
+# DuckDB Index Selection Task
+
+## Problem
+
+Choose a whitelist subset of DuckDB indexes for a frozen analytical lookup workload and minimize total runtime.
+
+This benchmark captures physical-design tuning on top of a stable DuckDB workload. Extra indexes can speed repeated queries, but they also add build and maintenance cost, so the goal is to choose the right subset rather than simply choosing more.
+
+Viewed computationally, this is a subset-selection problem over a fixed candidate set, with real execution time rather than a proxy metric deciding the score.
+
+## What Is Frozen
+
+- The schema, local data generator, and workload manifest in `runtime/problem.py`.
+- The whitelist of legal index names in `workload_manifest["candidate_indexes"]`.
+- The timing protocol: index build plus four repeated workload executions.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+Return a list of whitelist index names. A dict with key `indexes` is also accepted.
+
+## Evaluation
+
+1. Build the frozen DuckDB database and load the manifest.
+2. Create the indexes you selected from the whitelist.
+3. Run the fixed lookup workload four times.
+4. Measure total candidate runtime and report no-index baseline numbers for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_runtime_s`
+- `valid`: `1.0` only if every selected index name is legal and execution succeeds
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## Invalid Submissions
+
+- `select_indexes(...)` is missing or crashes
+- The return value cannot be parsed into a list of names
+- Any selected name is outside the whitelist
+- Index creation or workload execution fails
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/Task_zh-CN.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task_zh-CN.md
new file mode 100644
index 00000000..b8ad5524
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task_zh-CN.md
@@ -0,0 +1,51 @@
+# DuckDB 索引选择
+
+## 任务概览
+
+在冻结的 DuckDB 分析 workload 上，从白名单中挑选一组索引，尽量降低总执行时间。
+
+这个 benchmark 对应的是稳定 DuckDB workload 上的物理设计调优。额外索引确实可能加速重复查询，但也会带来构建和维护成本，所以关键不是“建得越多越好”，而是选对那一小部分。
+
+从计算角度看，这是一道固定候选集合上的子集选择题，而且分数来自真实执行时间，不是某个替代指标。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 schema、本地数据生成逻辑和 workload manifest。
+- `workload_manifest["candidate_indexes"]` 中给出的合法索引白名单。
+- 固定的计时协议：先建索引，再重复执行 workload 四轮。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+返回白名单中的索引名列表；也接受带 `indexes` 字段的字典。
+
+## 评测流程
+
+1. 构建冻结的 DuckDB 数据库并加载 manifest。
+2. 创建你从白名单中选出的索引。
+3. 固定重复执行查询 workload 四次。
+4. 统计候选总耗时，并同时给出无索引 baseline 作为参考。
+
+## 指标
+
+- `combined_score`：`-candidate_total_runtime_s`
+- `valid`：只有索引名合法且执行成功时才为 `1.0`
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## 判为无效的情况
+
+- 缺少 `select_indexes(...)`，或函数在评测中报错
+- 返回值无法解析为索引名列表
+- 任意一个索引名不在白名单中
+- 建索引或执行 workload 时失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/baseline/solution.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/baseline/solution.py
new file mode 100644
index 00000000..56a5989c
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/baseline/solution.py
@@ -0,0 +1,5 @@
+from __future__ import annotations
+
+
+def select_indexes(workload_manifest):
+    return []
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/agent_files.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/candidate_destination.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/constraints.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/constraints.txt
new file mode 100644
index 00000000..88b1935c
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_command.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_cwd.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/initial_program.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/readonly_files.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..8bb37291
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+baseline/solution.py
+runtime/problem.py
+runtime/duckdb_local_workload.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/references/source_manifest.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/references/source_manifest.md
new file mode 100644
index 00000000..b5b0db78
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL and index support
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the schema and workload lineage are traceable to official DuckDB/TPC-H benchmarking materials, but the data itself is a local frozen synthetic asset used because online extension-based generation was not reliable in this environment.
+- License lineage: DuckDB is released under the MIT License.
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/duckdb_local_workload.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/duckdb_local_workload.py
new file mode 100644
index 00000000..a9134cbc
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/duckdb_local_workload.py
@@ -0,0 +1,419 @@
+from __future__ import annotations
+
+import math
+import time
+from typing import Any
+
+import duckdb
+
+
+CUSTOMER_COUNT = 20_000
+ORDER_COUNT = 120_000
+LINEITEM_COUNT = 600_000
+
+SEGMENTS = ("BUILDING", "AUTOMOBILE", "HOUSEHOLD", "FURNITURE", "MACHINERY")
+SHIPMODES = ("AIR", "MAIL", "RAIL", "TRUCK", "SHIP")
+
+CUSTOMER_KEYS = tuple(1 + ((i * 97) % CUSTOMER_COUNT) for i in range(1, 301))
+ORDER_KEYS = tuple(1 + ((i * 193) % ORDER_COUNT) for i in range(1, 301))
+
+
+INDEX_CANDIDATES = {
+    "idx_orders_cust": "CREATE INDEX idx_orders_cust ON orders(o_custkey)",
+    "idx_orders_date": "CREATE INDEX idx_orders_date ON orders(o_orderdate)",
+    "idx_lineitem_order": "CREATE INDEX idx_lineitem_order ON lineitem(l_orderkey)",
+    "idx_customer_segment": "CREATE INDEX idx_customer_segment ON customer(c_mktsegment)",
+    "idx_orders_priority": "CREATE INDEX idx_orders_priority ON orders(o_orderpriority)",
+}
+
+INDEX_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_indexes": tuple(sorted(INDEX_CANDIDATES)),
+    "workload_notes": (
+        "Repeated selective customer lookups on orders",
+        "Repeated selective order lookups on lineitem",
+        "Repeated priority-filtered joins from customer to orders",
+    ),
+    "repetitions": 4,
+}
+
+
+PREAGGREGATION_CANDIDATES = {
+    "agg_quarter_segment_revenue": (
+        "CREATE TABLE agg_quarter_segment_revenue AS "
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_month_shipmode_revenue": (
+        "CREATE TABLE agg_month_shipmode_revenue AS "
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "GROUP BY 1, 2"
+    ),
+    "agg_customer_year_revenue": (
+        "CREATE TABLE agg_customer_year_revenue AS "
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_unused_priority_only": (
+        "CREATE TABLE agg_unused_priority_only AS "
+        "SELECT o.o_orderpriority, count(*) AS order_count "
+        "FROM orders o "
+        "GROUP BY 1"
+    ),
+}
+
+PREAGGREGATION_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_preaggregations": tuple(sorted(PREAGGREGATION_CANDIDATES)),
+    "workload_notes": (
+        "Quarter revenue by customer segment",
+        "Monthly revenue by ship mode",
+        "Top customers by yearly revenue",
+    ),
+    "repetitions": 4,
+}
+
+
+ORIGINAL_QUERY_SQL = '''
+WITH revenue AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+),
+order_counts AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         count(DISTINCT o.o_orderkey) AS order_count
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+)
+SELECT r.quarter_bucket, r.segment, r.revenue, o.order_count
+FROM revenue r
+JOIN order_counts o USING (quarter_bucket, segment)
+ORDER BY quarter_bucket, segment
+'''.strip()
+
+QUERY_REWRITE_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "query_goal": "Fuse repeated scans of the same join into one grouped aggregation while preserving results and ordering.",
+    "result_order_required": True,
+    "repetitions": 4,
+}
+
+
+def build_connection() -> duckdb.DuckDBPyConnection:
+    con = duckdb.connect(database=":memory:")
+    con.execute("PRAGMA threads=1")
+    con.execute(
+        f"""
+        CREATE TABLE customer AS
+        SELECT i AS c_custkey,
+               'Customer #' || i AS c_name,
+               CASE i % 5
+                 WHEN 0 THEN 'BUILDING'
+                 WHEN 1 THEN 'AUTOMOBILE'
+                 WHEN 2 THEN 'HOUSEHOLD'
+                 WHEN 3 THEN 'FURNITURE'
+                 ELSE 'MACHINERY'
+               END AS c_mktsegment,
+               i % 25 AS c_nationkey
+        FROM range(1, {CUSTOMER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE orders AS
+        SELECT i AS o_orderkey,
+               1 + ((i * 17) % {CUSTOMER_COUNT}) AS o_custkey,
+               DATE '1995-01-01' + (((i * 13) % 1460) * INTERVAL 1 DAY) AS o_orderdate,
+               100 + (((i * 37) % 100000) / 10.0) AS o_totalprice,
+               CASE i % 5
+                 WHEN 0 THEN '1-URGENT'
+                 WHEN 1 THEN '2-HIGH'
+                 WHEN 2 THEN '3-MEDIUM'
+                 WHEN 3 THEN '4-NOT SPECIFIED'
+                 ELSE '5-LOW'
+               END AS o_orderpriority
+        FROM range(1, {ORDER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE lineitem AS
+        SELECT i AS l_lineitemkey,
+               1 + ((i * 7) % {ORDER_COUNT}) AS l_orderkey,
+               1 + ((i * 11) % 50000) AS l_partkey,
+               1 + ((i * 13) % 10000) AS l_suppkey,
+               1 + ((i * 5) % 50) AS l_quantity,
+               10 + (((i * 19) % 100000) / 20.0) AS l_extendedprice,
+               (((i * 3) % 10) / 100.0) AS l_discount,
+               DATE '1995-01-01' + (((i * 29) % 1460) * INTERVAL 1 DAY) AS l_shipdate,
+               CASE i % 5
+                 WHEN 0 THEN 'AIR'
+                 WHEN 1 THEN 'MAIL'
+                 WHEN 2 THEN 'RAIL'
+                 WHEN 3 THEN 'TRUCK'
+                 ELSE 'SHIP'
+               END AS l_shipmode
+        FROM range(1, {LINEITEM_COUNT + 1}) t(i)
+        """
+    )
+    return con
+
+
+def normalize_name_list(value: Any, key: str) -> list[str]:
+    if isinstance(value, dict):
+        if key not in value:
+            raise ValueError(f"missing {key}")
+        value = value[key]
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"{key} must be a list or tuple")
+    out: list[str] = []
+    seen = set()
+    for item in value:
+        name = str(item)
+        if name not in seen:
+            out.append(name)
+            seen.add(name)
+    return out
+
+
+def compare_results(lhs: list[tuple[Any, ...]], rhs: list[tuple[Any, ...]], tol: float = 1e-6) -> bool:
+    if len(lhs) != len(rhs):
+        return False
+    for left_row, right_row in zip(lhs, rhs):
+        if len(left_row) != len(right_row):
+            return False
+        for left_value, right_value in zip(left_row, right_row):
+            if isinstance(left_value, float) or isinstance(right_value, float):
+                if not math.isfinite(float(left_value)) or not math.isfinite(float(right_value)):
+                    return False
+                if abs(float(left_value) - float(right_value)) > tol:
+                    return False
+            else:
+                if left_value != right_value:
+                    return False
+    return True
+
+
+def _report_quarter_segment(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT quarter_bucket, segment, revenue "
+            "FROM agg_quarter_segment_revenue "
+            "ORDER BY quarter_bucket, segment"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "ORDER BY quarter_bucket, segment"
+    ).fetchall()
+
+
+def _report_month_shipmode(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT month_bucket, shipmode, revenue "
+            "FROM agg_month_shipmode_revenue "
+            "WHERE month_bucket >= DATE '1997-01-01' "
+            "ORDER BY month_bucket, shipmode"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "WHERE l.l_shipdate >= DATE '1997-01-01' "
+        "GROUP BY 1, 2 "
+        "ORDER BY month_bucket, shipmode"
+    ).fetchall()
+
+
+def _report_customer_year(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT revenue_year, c_custkey, revenue "
+            "FROM agg_customer_year_revenue "
+            "WHERE revenue_year = 1998 "
+            "ORDER BY revenue DESC, c_custkey "
+            "LIMIT 100"
+        ).fetchall()
+    return con.execute(
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "HAVING year(o.o_orderdate) = 1998 "
+        "ORDER BY revenue DESC, c.c_custkey "
+        "LIMIT 100"
+    ).fetchall()
+
+
+def run_index_workload(con: duckdb.DuckDBPyConnection) -> float:
+    start_time = time.perf_counter()
+    for customer_key in CUSTOMER_KEYS:
+        con.execute(
+            "SELECT sum(o_totalprice) "
+            "FROM orders "
+            "WHERE o_custkey = ? AND o_orderdate >= DATE '1997-01-01'",
+            [customer_key],
+        ).fetchone()
+    for order_key in ORDER_KEYS:
+        con.execute(
+            "SELECT sum(l_extendedprice * (1 - l_discount)) "
+            "FROM lineitem "
+            "WHERE l_orderkey = ?",
+            [order_key],
+        ).fetchone()
+    for customer_key in CUSTOMER_KEYS[:120]:
+        con.execute(
+            "SELECT count(*) "
+            "FROM customer c "
+            "JOIN orders o ON c.c_custkey = o.o_custkey "
+            "WHERE c.c_custkey = ? AND o.o_orderpriority = '1-URGENT'",
+            [customer_key],
+        ).fetchone()
+    return time.perf_counter() - start_time
+
+
+def measure_index_design(selected_indexes: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_indexes if name not in INDEX_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown index names: {unknown}")
+    con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_indexes:
+        con.execute(INDEX_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+    run_index_workload(con)
+    workload_runtime = 0.0
+    for _ in range(int(INDEX_WORKLOAD_MANIFEST["repetitions"])):
+        workload_runtime += run_index_workload(con)
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "workload_runtime_s": float(workload_runtime),
+        "total_runtime_s": float(setup_runtime + workload_runtime),
+        "selected_index_count": len(selected_indexes),
+    }
+
+
+def measure_query_rewrite(sql: str) -> dict[str, Any]:
+    sql = str(sql).strip()
+    if not sql:
+        raise ValueError("query must not be empty")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    baseline_rows = baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    candidate_rows = candidate_con.execute(sql).fetchall()
+    if not compare_results(candidate_rows, baseline_rows):
+        raise ValueError("candidate query result does not match the baseline result")
+
+    baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    candidate_con.execute(sql).fetchall()
+    candidate_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        candidate_rows = candidate_con.execute(sql).fetchall()
+    candidate_runtime = time.perf_counter() - candidate_start
+
+    return {
+        "baseline_runtime_s": float(baseline_runtime),
+        "candidate_runtime_s": float(candidate_runtime),
+        "row_count": len(candidate_rows),
+    }
+
+
+def _run_preaggregation_reports(con: duckdb.DuckDBPyConnection, selected: set[str]) -> tuple[float, tuple[list[tuple[Any, ...]], ...]]:
+    start_time = time.perf_counter()
+    result_a = _report_quarter_segment(con, "agg_quarter_segment_revenue" in selected)
+    result_b = _report_month_shipmode(con, "agg_month_shipmode_revenue" in selected)
+    result_c = _report_customer_year(con, "agg_customer_year_revenue" in selected)
+    runtime = time.perf_counter() - start_time
+    return runtime, (result_a, result_b, result_c)
+
+
+def measure_preaggregation_design(selected_preaggregations: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_preaggregations if name not in PREAGGREGATION_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown pre-aggregation names: {unknown}")
+    if not selected_preaggregations:
+        con = build_connection()
+        _run_preaggregation_reports(con, set())
+        repeated_runtime = 0.0
+        for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+            extra_runtime, _ = _run_preaggregation_reports(con, set())
+            repeated_runtime += extra_runtime
+        return {
+            "setup_runtime_s": 0.0,
+            "candidate_workload_runtime_s": float(repeated_runtime),
+            "candidate_total_runtime_s": float(repeated_runtime),
+            "baseline_total_runtime_s": float(repeated_runtime),
+            "selected_preaggregation_count": 0,
+        }
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_preaggregations:
+        candidate_con.execute(PREAGGREGATION_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+
+    _, baseline_results = _run_preaggregation_reports(baseline_con, set())
+    _, candidate_results = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+    if any(not compare_results(left, right) for left, right in zip(candidate_results, baseline_results)):
+        raise ValueError("candidate pre-aggregation selection changed the query results")
+
+    _run_preaggregation_reports(baseline_con, set())
+    _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+
+    repeated_baseline_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(baseline_con, set())
+        repeated_baseline_runtime += extra_runtime
+
+    repeated_candidate_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+        repeated_candidate_runtime += extra_runtime
+
+    candidate_total_runtime = setup_runtime + repeated_candidate_runtime
+    baseline_total_runtime = repeated_baseline_runtime
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "candidate_workload_runtime_s": float(repeated_candidate_runtime),
+        "candidate_total_runtime_s": float(candidate_total_runtime),
+        "baseline_total_runtime_s": float(baseline_total_runtime),
+        "selected_preaggregation_count": len(selected_preaggregations),
+    }
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/problem.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/problem.py
new file mode 100644
index 00000000..70236d19
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/runtime/problem.py
@@ -0,0 +1,14 @@
+from __future__ import annotations
+
+from .duckdb_local_workload import INDEX_WORKLOAD_MANIFEST, measure_index_design, normalize_name_list
+
+
+WORKLOAD_MANIFEST = dict(INDEX_WORKLOAD_MANIFEST)
+
+
+def load_instance():
+    return dict(WORKLOAD_MANIFEST)
+
+
+def evaluate_selection(selection):
+    return measure_index_design(normalize_name_list(selection, "indexes"))
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py
new file mode 100644
index 00000000..cf1bad8b
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py
@@ -0,0 +1,44 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBIndexSelection.baseline.solution import select_indexes as _baseline_select_indexes
+    from benchmarks.ComputerSystems.DuckDBIndexSelection.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_indexes as _baseline_select_indexes
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+# EVOLVE-BLOCK-START
+def select_indexes(workload_manifest):
+    return _baseline_select_indexes(workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_selection(select_indexes(WORKLOAD_MANIFEST)))
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py
new file mode 100644
index 00000000..7800a6e5
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py
@@ -0,0 +1,90 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBIndexSelection.baseline.solution import select_indexes as baseline_select_indexes
+    from benchmarks.ComputerSystems.DuckDBIndexSelection.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_indexes as baseline_select_indexes
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_total_runtime_s": 0.0,
+        "baseline_total_runtime_s": 0.0,
+        "candidate_setup_runtime_s": 0.0,
+        "candidate_workload_runtime_s": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    select_indexes = namespace.get("select_indexes")
+    if not callable(select_indexes):
+        artifacts["error_message"] = "candidate must define select_indexes(workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_selection(baseline_select_indexes(WORKLOAD_MANIFEST))
+        candidate = evaluate_selection(select_indexes(WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_total = float(candidate["total_runtime_s"])
+    baseline_total = float(baseline["total_runtime_s"])
+    if not math.isfinite(candidate_total) or candidate_total <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_total_runtime_s"] = candidate_total
+    metrics["baseline_total_runtime_s"] = baseline_total
+    metrics["candidate_setup_runtime_s"] = float(candidate["setup_runtime_s"])
+    metrics["candidate_workload_runtime_s"] = float(candidate["workload_runtime_s"])
+    metrics["combined_score"] = -candidate_total
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
new file mode 100644
index 00000000..8a6ba6a1
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
@@ -0,0 +1 @@
+duckdb
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README.md b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README.md
new file mode 100644
index 00000000..06a001f8
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README.md
@@ -0,0 +1,55 @@
+# DuckDB Pre-Aggregation Selection
+
+Choose a whitelist subset of pre-aggregation tables for a frozen DuckDB reporting workload and minimize total runtime.
+
+## Why This Benchmark Matters
+
+This benchmark models a very common warehouse decision: which summary tables are worth materializing for a recurring reporting workload. The wrong choice wastes storage and refresh time; the right choice reduces repeated scan and aggregation cost.
+
+Algorithmically, it is a materialized-view selection problem over a fixed candidate set, scored by real query execution under exact-result checks.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `select_preaggregations(workload_manifest)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBPreAggregationSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBPreAggregationSelection_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBPreAggregationSelection \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README_zh-CN.md b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README_zh-CN.md
new file mode 100644
index 00000000..98b1d6ea
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/README_zh-CN.md
@@ -0,0 +1,55 @@
+# DuckDB 预聚合选择
+
+在冻结的 DuckDB 报表 workload 上，从白名单中选择一组预聚合表，尽量降低总执行时间。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是数据仓库里很常见的一类决策：面对重复报表 workload，到底哪些汇总表值得物化。选错了会浪费存储和刷新成本，选对了才能真正降低重复扫描与聚合开销。
+
+从算法角度看，它是固定候选集合上的物化视图选择题，而且分数来自真实查询执行，同时要满足严格结果一致性。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`select_preaggregations(workload_manifest)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBPreAggregationSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBPreAggregationSelection_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBPreAggregationSelection \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task.md b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task.md
new file mode 100644
index 00000000..5715b0e5
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task.md
@@ -0,0 +1,51 @@
+# DuckDB Pre-Aggregation Selection Task
+
+## Problem
+
+Choose a whitelist subset of pre-aggregation tables for a frozen DuckDB reporting workload and minimize total runtime.
+
+This benchmark models a very common warehouse decision: which summary tables are worth materializing for a recurring reporting workload. The wrong choice wastes storage and refresh time; the right choice reduces repeated scan and aggregation cost.
+
+Algorithmically, it is a materialized-view selection problem over a fixed candidate set, scored by real query execution under exact-result checks.
+
+## What Is Frozen
+
+- The schema, local data generator, and reporting workload in `runtime/problem.py`.
+- The whitelist of legal summary tables in `workload_manifest["candidate_preaggregations"]`.
+- The correctness audit and the timing protocol used to compare setup plus repeated report execution.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def select_preaggregations(workload_manifest):
+    ...
+```
+
+Return a list of whitelist pre-aggregation names. A dict with key `preaggregations` is also accepted.
+
+## Evaluation
+
+1. Build the frozen DuckDB database and load the workload manifest.
+2. Create the pre-aggregation tables you selected from the whitelist.
+3. Run the fixed reporting queries and verify result equivalence.
+4. Measure total candidate runtime and report the no-materialization baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_runtime_s`
+- `valid`: `1.0` only if all selected names are legal and query results stay unchanged
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## Invalid Submissions
+
+- `select_preaggregations(...)` is missing or crashes
+- The return value cannot be parsed into a list of names
+- Any selected name is outside the whitelist
+- Materialization or query execution fails, or results change
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task_zh-CN.md b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task_zh-CN.md
new file mode 100644
index 00000000..0a9007d8
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/Task_zh-CN.md
@@ -0,0 +1,51 @@
+# DuckDB 预聚合选择
+
+## 任务概览
+
+在冻结的 DuckDB 报表 workload 上，从白名单中选择一组预聚合表，尽量降低总执行时间。
+
+这个 benchmark 对应的是数据仓库里很常见的一类决策：面对重复报表 workload，到底哪些汇总表值得物化。选错了会浪费存储和刷新成本，选对了才能真正降低重复扫描与聚合开销。
+
+从算法角度看，它是固定候选集合上的物化视图选择题，而且分数来自真实查询执行，同时要满足严格结果一致性。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 schema、本地数据生成逻辑和报表 workload。
+- `workload_manifest["candidate_preaggregations"]` 中给出的合法预聚合白名单。
+- 固定的正确性校验与计时协议，统计建表开销和重复报表执行开销。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def select_preaggregations(workload_manifest):
+    ...
+```
+
+返回白名单中的预聚合名列表；也接受带 `preaggregations` 字段的字典。
+
+## 评测流程
+
+1. 构建冻结的 DuckDB 数据库并加载 workload manifest。
+2. 创建你从白名单中选出的预聚合表。
+3. 执行固定报表查询并校验结果是否一致。
+4. 统计候选总耗时，并同时给出不物化任何汇总表的 baseline。
+
+## 指标
+
+- `combined_score`：`-candidate_total_runtime_s`
+- `valid`：只有名字合法且查询结果不变时才为 `1.0`
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## 判为无效的情况
+
+- 缺少 `select_preaggregations(...)`，或函数在评测中报错
+- 返回值无法解析为预聚合名列表
+- 任意一个名字不在白名单中
+- 物化失败、查询执行失败，或结果发生变化
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/baseline/solution.py b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/baseline/solution.py
new file mode 100644
index 00000000..e6966c2f
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/baseline/solution.py
@@ -0,0 +1,5 @@
+from __future__ import annotations
+
+
+def select_preaggregations(workload_manifest):
+    return []
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/agent_files.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/candidate_destination.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/constraints.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/constraints.txt
new file mode 100644
index 00000000..88b1935c
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_command.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_cwd.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/initial_program.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/readonly_files.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..8bb37291
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+baseline/solution.py
+runtime/problem.py
+runtime/duckdb_local_workload.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/references/source_manifest.md b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/references/source_manifest.md
new file mode 100644
index 00000000..36093907
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL execution on analytical reporting queries
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the reporting queries and schema family are traceable to official analytical benchmark patterns, while the candidate pre-aggregations are benchmark-local frozen physical-design options.
+- License lineage: DuckDB is released under the MIT License.
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/duckdb_local_workload.py b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/duckdb_local_workload.py
new file mode 100644
index 00000000..a9134cbc
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/duckdb_local_workload.py
@@ -0,0 +1,419 @@
+from __future__ import annotations
+
+import math
+import time
+from typing import Any
+
+import duckdb
+
+
+CUSTOMER_COUNT = 20_000
+ORDER_COUNT = 120_000
+LINEITEM_COUNT = 600_000
+
+SEGMENTS = ("BUILDING", "AUTOMOBILE", "HOUSEHOLD", "FURNITURE", "MACHINERY")
+SHIPMODES = ("AIR", "MAIL", "RAIL", "TRUCK", "SHIP")
+
+CUSTOMER_KEYS = tuple(1 + ((i * 97) % CUSTOMER_COUNT) for i in range(1, 301))
+ORDER_KEYS = tuple(1 + ((i * 193) % ORDER_COUNT) for i in range(1, 301))
+
+
+INDEX_CANDIDATES = {
+    "idx_orders_cust": "CREATE INDEX idx_orders_cust ON orders(o_custkey)",
+    "idx_orders_date": "CREATE INDEX idx_orders_date ON orders(o_orderdate)",
+    "idx_lineitem_order": "CREATE INDEX idx_lineitem_order ON lineitem(l_orderkey)",
+    "idx_customer_segment": "CREATE INDEX idx_customer_segment ON customer(c_mktsegment)",
+    "idx_orders_priority": "CREATE INDEX idx_orders_priority ON orders(o_orderpriority)",
+}
+
+INDEX_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_indexes": tuple(sorted(INDEX_CANDIDATES)),
+    "workload_notes": (
+        "Repeated selective customer lookups on orders",
+        "Repeated selective order lookups on lineitem",
+        "Repeated priority-filtered joins from customer to orders",
+    ),
+    "repetitions": 4,
+}
+
+
+PREAGGREGATION_CANDIDATES = {
+    "agg_quarter_segment_revenue": (
+        "CREATE TABLE agg_quarter_segment_revenue AS "
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_month_shipmode_revenue": (
+        "CREATE TABLE agg_month_shipmode_revenue AS "
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "GROUP BY 1, 2"
+    ),
+    "agg_customer_year_revenue": (
+        "CREATE TABLE agg_customer_year_revenue AS "
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_unused_priority_only": (
+        "CREATE TABLE agg_unused_priority_only AS "
+        "SELECT o.o_orderpriority, count(*) AS order_count "
+        "FROM orders o "
+        "GROUP BY 1"
+    ),
+}
+
+PREAGGREGATION_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_preaggregations": tuple(sorted(PREAGGREGATION_CANDIDATES)),
+    "workload_notes": (
+        "Quarter revenue by customer segment",
+        "Monthly revenue by ship mode",
+        "Top customers by yearly revenue",
+    ),
+    "repetitions": 4,
+}
+
+
+ORIGINAL_QUERY_SQL = '''
+WITH revenue AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+),
+order_counts AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         count(DISTINCT o.o_orderkey) AS order_count
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+)
+SELECT r.quarter_bucket, r.segment, r.revenue, o.order_count
+FROM revenue r
+JOIN order_counts o USING (quarter_bucket, segment)
+ORDER BY quarter_bucket, segment
+'''.strip()
+
+QUERY_REWRITE_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "query_goal": "Fuse repeated scans of the same join into one grouped aggregation while preserving results and ordering.",
+    "result_order_required": True,
+    "repetitions": 4,
+}
+
+
+def build_connection() -> duckdb.DuckDBPyConnection:
+    con = duckdb.connect(database=":memory:")
+    con.execute("PRAGMA threads=1")
+    con.execute(
+        f"""
+        CREATE TABLE customer AS
+        SELECT i AS c_custkey,
+               'Customer #' || i AS c_name,
+               CASE i % 5
+                 WHEN 0 THEN 'BUILDING'
+                 WHEN 1 THEN 'AUTOMOBILE'
+                 WHEN 2 THEN 'HOUSEHOLD'
+                 WHEN 3 THEN 'FURNITURE'
+                 ELSE 'MACHINERY'
+               END AS c_mktsegment,
+               i % 25 AS c_nationkey
+        FROM range(1, {CUSTOMER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE orders AS
+        SELECT i AS o_orderkey,
+               1 + ((i * 17) % {CUSTOMER_COUNT}) AS o_custkey,
+               DATE '1995-01-01' + (((i * 13) % 1460) * INTERVAL 1 DAY) AS o_orderdate,
+               100 + (((i * 37) % 100000) / 10.0) AS o_totalprice,
+               CASE i % 5
+                 WHEN 0 THEN '1-URGENT'
+                 WHEN 1 THEN '2-HIGH'
+                 WHEN 2 THEN '3-MEDIUM'
+                 WHEN 3 THEN '4-NOT SPECIFIED'
+                 ELSE '5-LOW'
+               END AS o_orderpriority
+        FROM range(1, {ORDER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE lineitem AS
+        SELECT i AS l_lineitemkey,
+               1 + ((i * 7) % {ORDER_COUNT}) AS l_orderkey,
+               1 + ((i * 11) % 50000) AS l_partkey,
+               1 + ((i * 13) % 10000) AS l_suppkey,
+               1 + ((i * 5) % 50) AS l_quantity,
+               10 + (((i * 19) % 100000) / 20.0) AS l_extendedprice,
+               (((i * 3) % 10) / 100.0) AS l_discount,
+               DATE '1995-01-01' + (((i * 29) % 1460) * INTERVAL 1 DAY) AS l_shipdate,
+               CASE i % 5
+                 WHEN 0 THEN 'AIR'
+                 WHEN 1 THEN 'MAIL'
+                 WHEN 2 THEN 'RAIL'
+                 WHEN 3 THEN 'TRUCK'
+                 ELSE 'SHIP'
+               END AS l_shipmode
+        FROM range(1, {LINEITEM_COUNT + 1}) t(i)
+        """
+    )
+    return con
+
+
+def normalize_name_list(value: Any, key: str) -> list[str]:
+    if isinstance(value, dict):
+        if key not in value:
+            raise ValueError(f"missing {key}")
+        value = value[key]
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"{key} must be a list or tuple")
+    out: list[str] = []
+    seen = set()
+    for item in value:
+        name = str(item)
+        if name not in seen:
+            out.append(name)
+            seen.add(name)
+    return out
+
+
+def compare_results(lhs: list[tuple[Any, ...]], rhs: list[tuple[Any, ...]], tol: float = 1e-6) -> bool:
+    if len(lhs) != len(rhs):
+        return False
+    for left_row, right_row in zip(lhs, rhs):
+        if len(left_row) != len(right_row):
+            return False
+        for left_value, right_value in zip(left_row, right_row):
+            if isinstance(left_value, float) or isinstance(right_value, float):
+                if not math.isfinite(float(left_value)) or not math.isfinite(float(right_value)):
+                    return False
+                if abs(float(left_value) - float(right_value)) > tol:
+                    return False
+            else:
+                if left_value != right_value:
+                    return False
+    return True
+
+
+def _report_quarter_segment(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT quarter_bucket, segment, revenue "
+            "FROM agg_quarter_segment_revenue "
+            "ORDER BY quarter_bucket, segment"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "ORDER BY quarter_bucket, segment"
+    ).fetchall()
+
+
+def _report_month_shipmode(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT month_bucket, shipmode, revenue "
+            "FROM agg_month_shipmode_revenue "
+            "WHERE month_bucket >= DATE '1997-01-01' "
+            "ORDER BY month_bucket, shipmode"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "WHERE l.l_shipdate >= DATE '1997-01-01' "
+        "GROUP BY 1, 2 "
+        "ORDER BY month_bucket, shipmode"
+    ).fetchall()
+
+
+def _report_customer_year(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT revenue_year, c_custkey, revenue "
+            "FROM agg_customer_year_revenue "
+            "WHERE revenue_year = 1998 "
+            "ORDER BY revenue DESC, c_custkey "
+            "LIMIT 100"
+        ).fetchall()
+    return con.execute(
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "HAVING year(o.o_orderdate) = 1998 "
+        "ORDER BY revenue DESC, c.c_custkey "
+        "LIMIT 100"
+    ).fetchall()
+
+
+def run_index_workload(con: duckdb.DuckDBPyConnection) -> float:
+    start_time = time.perf_counter()
+    for customer_key in CUSTOMER_KEYS:
+        con.execute(
+            "SELECT sum(o_totalprice) "
+            "FROM orders "
+            "WHERE o_custkey = ? AND o_orderdate >= DATE '1997-01-01'",
+            [customer_key],
+        ).fetchone()
+    for order_key in ORDER_KEYS:
+        con.execute(
+            "SELECT sum(l_extendedprice * (1 - l_discount)) "
+            "FROM lineitem "
+            "WHERE l_orderkey = ?",
+            [order_key],
+        ).fetchone()
+    for customer_key in CUSTOMER_KEYS[:120]:
+        con.execute(
+            "SELECT count(*) "
+            "FROM customer c "
+            "JOIN orders o ON c.c_custkey = o.o_custkey "
+            "WHERE c.c_custkey = ? AND o.o_orderpriority = '1-URGENT'",
+            [customer_key],
+        ).fetchone()
+    return time.perf_counter() - start_time
+
+
+def measure_index_design(selected_indexes: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_indexes if name not in INDEX_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown index names: {unknown}")
+    con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_indexes:
+        con.execute(INDEX_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+    run_index_workload(con)
+    workload_runtime = 0.0
+    for _ in range(int(INDEX_WORKLOAD_MANIFEST["repetitions"])):
+        workload_runtime += run_index_workload(con)
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "workload_runtime_s": float(workload_runtime),
+        "total_runtime_s": float(setup_runtime + workload_runtime),
+        "selected_index_count": len(selected_indexes),
+    }
+
+
+def measure_query_rewrite(sql: str) -> dict[str, Any]:
+    sql = str(sql).strip()
+    if not sql:
+        raise ValueError("query must not be empty")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    baseline_rows = baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    candidate_rows = candidate_con.execute(sql).fetchall()
+    if not compare_results(candidate_rows, baseline_rows):
+        raise ValueError("candidate query result does not match the baseline result")
+
+    baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    candidate_con.execute(sql).fetchall()
+    candidate_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        candidate_rows = candidate_con.execute(sql).fetchall()
+    candidate_runtime = time.perf_counter() - candidate_start
+
+    return {
+        "baseline_runtime_s": float(baseline_runtime),
+        "candidate_runtime_s": float(candidate_runtime),
+        "row_count": len(candidate_rows),
+    }
+
+
+def _run_preaggregation_reports(con: duckdb.DuckDBPyConnection, selected: set[str]) -> tuple[float, tuple[list[tuple[Any, ...]], ...]]:
+    start_time = time.perf_counter()
+    result_a = _report_quarter_segment(con, "agg_quarter_segment_revenue" in selected)
+    result_b = _report_month_shipmode(con, "agg_month_shipmode_revenue" in selected)
+    result_c = _report_customer_year(con, "agg_customer_year_revenue" in selected)
+    runtime = time.perf_counter() - start_time
+    return runtime, (result_a, result_b, result_c)
+
+
+def measure_preaggregation_design(selected_preaggregations: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_preaggregations if name not in PREAGGREGATION_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown pre-aggregation names: {unknown}")
+    if not selected_preaggregations:
+        con = build_connection()
+        _run_preaggregation_reports(con, set())
+        repeated_runtime = 0.0
+        for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+            extra_runtime, _ = _run_preaggregation_reports(con, set())
+            repeated_runtime += extra_runtime
+        return {
+            "setup_runtime_s": 0.0,
+            "candidate_workload_runtime_s": float(repeated_runtime),
+            "candidate_total_runtime_s": float(repeated_runtime),
+            "baseline_total_runtime_s": float(repeated_runtime),
+            "selected_preaggregation_count": 0,
+        }
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_preaggregations:
+        candidate_con.execute(PREAGGREGATION_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+
+    _, baseline_results = _run_preaggregation_reports(baseline_con, set())
+    _, candidate_results = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+    if any(not compare_results(left, right) for left, right in zip(candidate_results, baseline_results)):
+        raise ValueError("candidate pre-aggregation selection changed the query results")
+
+    _run_preaggregation_reports(baseline_con, set())
+    _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+
+    repeated_baseline_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(baseline_con, set())
+        repeated_baseline_runtime += extra_runtime
+
+    repeated_candidate_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+        repeated_candidate_runtime += extra_runtime
+
+    candidate_total_runtime = setup_runtime + repeated_candidate_runtime
+    baseline_total_runtime = repeated_baseline_runtime
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "candidate_workload_runtime_s": float(repeated_candidate_runtime),
+        "candidate_total_runtime_s": float(candidate_total_runtime),
+        "baseline_total_runtime_s": float(baseline_total_runtime),
+        "selected_preaggregation_count": len(selected_preaggregations),
+    }
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/problem.py b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/problem.py
new file mode 100644
index 00000000..b70a9607
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/runtime/problem.py
@@ -0,0 +1,14 @@
+from __future__ import annotations
+
+from .duckdb_local_workload import PREAGGREGATION_WORKLOAD_MANIFEST, measure_preaggregation_design, normalize_name_list
+
+
+WORKLOAD_MANIFEST = dict(PREAGGREGATION_WORKLOAD_MANIFEST)
+
+
+def load_instance():
+    return dict(WORKLOAD_MANIFEST)
+
+
+def evaluate_selection(selection):
+    return measure_preaggregation_design(normalize_name_list(selection, "preaggregations"))
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/scripts/init.py b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/scripts/init.py
new file mode 100644
index 00000000..93cf2a4d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/scripts/init.py
@@ -0,0 +1,44 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBPreAggregationSelection.baseline.solution import select_preaggregations as _baseline_select_preaggregations
+    from benchmarks.ComputerSystems.DuckDBPreAggregationSelection.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_preaggregations as _baseline_select_preaggregations
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+# EVOLVE-BLOCK-START
+def select_preaggregations(workload_manifest):
+    return _baseline_select_preaggregations(workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_selection(select_preaggregations(WORKLOAD_MANIFEST)))
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/evaluator.py b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/evaluator.py
new file mode 100644
index 00000000..21252be1
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/evaluator.py
@@ -0,0 +1,90 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBPreAggregationSelection.baseline.solution import select_preaggregations as baseline_select_preaggregations
+    from benchmarks.ComputerSystems.DuckDBPreAggregationSelection.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_preaggregations as baseline_select_preaggregations
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_total_runtime_s": 0.0,
+        "baseline_total_runtime_s": 0.0,
+        "candidate_setup_runtime_s": 0.0,
+        "candidate_workload_runtime_s": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    select_preaggregations = namespace.get("select_preaggregations")
+    if not callable(select_preaggregations):
+        artifacts["error_message"] = "candidate must define select_preaggregations(workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_selection(baseline_select_preaggregations(WORKLOAD_MANIFEST))
+        candidate = evaluate_selection(select_preaggregations(WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_total = float(candidate["candidate_total_runtime_s"])
+    baseline_total = float(candidate["baseline_total_runtime_s"])
+    if not math.isfinite(candidate_total) or candidate_total <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_total_runtime_s"] = candidate_total
+    metrics["baseline_total_runtime_s"] = baseline_total
+    metrics["candidate_setup_runtime_s"] = float(candidate["setup_runtime_s"])
+    metrics["candidate_workload_runtime_s"] = float(candidate["candidate_workload_runtime_s"])
+    metrics["combined_score"] = -candidate_total
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/requirements.txt b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/requirements.txt
new file mode 100644
index 00000000..8a6ba6a1
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBPreAggregationSelection/verification/requirements.txt
@@ -0,0 +1 @@
+duckdb
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/README.md b/benchmarks/ComputerSystems/DuckDBQueryRewrite/README.md
new file mode 100644
index 00000000..aa810cbd
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/README.md
@@ -0,0 +1,55 @@
+# DuckDB Query Rewrite
+
+Rewrite one frozen analytical DuckDB query so that results stay identical while runtime decreases.
+
+## Why This Benchmark Matters
+
+This benchmark stands in for real SQL performance tuning, where engineers often cannot change upstream product logic but can still rewrite a slow analytical query. Runtime matters, but only after semantic equivalence is preserved exactly.
+
+From a CS point of view, this is a semantics-preserving program transformation problem where the “program” happens to be SQL.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `rewrite_query(sql, workload_manifest)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBQueryRewrite/scripts/init.py \
+  --metrics-out /tmp/DuckDBQueryRewrite_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBQueryRewrite \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/README_zh-CN.md b/benchmarks/ComputerSystems/DuckDBQueryRewrite/README_zh-CN.md
new file mode 100644
index 00000000..9e664307
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/README_zh-CN.md
@@ -0,0 +1,55 @@
+# DuckDB 查询重写
+
+重写一条冻结的 DuckDB 分析 SQL，在保证结果完全一致的前提下尽量缩短运行时间。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是真实 SQL 调优场景：工程师往往不能改上游产品逻辑，但仍然可以通过重写一条慢查询来提速。这里性能当然重要，但前提永远是语义必须完全不变。
+
+从计算机视角看，这是一道保持语义不变的程序变换题，只不过这里被优化的“程序”是 SQL。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`rewrite_query(sql, workload_manifest)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBQueryRewrite/scripts/init.py \
+  --metrics-out /tmp/DuckDBQueryRewrite_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBQueryRewrite \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task.md b/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task.md
new file mode 100644
index 00000000..32f5f941
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task.md
@@ -0,0 +1,50 @@
+# DuckDB Query Rewrite Task
+
+## Problem
+
+Rewrite one frozen analytical DuckDB query so that results stay identical while runtime decreases.
+
+This benchmark stands in for real SQL performance tuning, where engineers often cannot change upstream product logic but can still rewrite a slow analytical query. Runtime matters, but only after semantic equivalence is preserved exactly.
+
+From a CS point of view, this is a semantics-preserving program transformation problem where the “program” happens to be SQL.
+
+## What Is Frozen
+
+- The schema, local data generator, original SQL, and workload manifest in `runtime/problem.py`.
+- The exact-result equivalence check used by the evaluator.
+- The repeated timing protocol for the candidate and baseline queries.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def rewrite_query(sql, workload_manifest):
+    ...
+```
+
+Return a rewritten SQL string. A dict with key `sql` is also accepted.
+
+## Evaluation
+
+1. Build the frozen DuckDB database and execute the original SQL to obtain the reference result.
+2. Execute your rewritten SQL and check exact result equivalence.
+3. If the results match, time repeated runs of the candidate query.
+4. Report candidate runtime together with the original-query baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_runtime_s`
+- `valid`: `1.0` only if the rewritten query preserves results exactly
+- `candidate_runtime_s`
+- `baseline_runtime_s`
+- `row_count`
+
+## Invalid Submissions
+
+- `rewrite_query(...)` is missing or crashes
+- The return value is not a SQL string or a dict with `sql`
+- The rewritten query fails to execute
+- The rewritten query changes the result set
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task_zh-CN.md b/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task_zh-CN.md
new file mode 100644
index 00000000..5ce563e7
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/Task_zh-CN.md
@@ -0,0 +1,50 @@
+# DuckDB 查询重写
+
+## 任务概览
+
+重写一条冻结的 DuckDB 分析 SQL，在保证结果完全一致的前提下尽量缩短运行时间。
+
+这个 benchmark 对应的是真实 SQL 调优场景：工程师往往不能改上游产品逻辑，但仍然可以通过重写一条慢查询来提速。这里性能当然重要，但前提永远是语义必须完全不变。
+
+从计算机视角看，这是一道保持语义不变的程序变换题，只不过这里被优化的“程序”是 SQL。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 schema、本地数据生成逻辑、原始 SQL 和 workload manifest。
+- 评测器使用的精确结果等价性校验。
+- 用于比较候选查询与 baseline 查询的固定重复计时协议。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def rewrite_query(sql, workload_manifest):
+    ...
+```
+
+返回重写后的 SQL 字符串；也接受带 `sql` 字段的字典。
+
+## 评测流程
+
+1. 构建冻结的 DuckDB 数据库，并先执行原始 SQL 得到参考结果。
+2. 执行你重写后的 SQL，并做精确结果等价校验。
+3. 只有结果一致时，才会继续对候选查询做重复计时。
+4. 输出候选运行时间，并同时给出原始查询的 baseline。
+
+## 指标
+
+- `combined_score`：`-candidate_runtime_s`
+- `valid`：只有重写后的查询精确保留结果时才为 `1.0`
+- `candidate_runtime_s`
+- `baseline_runtime_s`
+- `row_count`
+
+## 判为无效的情况
+
+- 缺少 `rewrite_query(...)`，或函数在评测中报错
+- 返回值不是 SQL 字符串，也不是带 `sql` 字段的字典
+- 重写后的查询无法执行
+- 重写后的查询改变了结果集
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/baseline/solution.py b/benchmarks/ComputerSystems/DuckDBQueryRewrite/baseline/solution.py
new file mode 100644
index 00000000..0d6b6bde
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/baseline/solution.py
@@ -0,0 +1,5 @@
+from __future__ import annotations
+
+
+def rewrite_query(sql, workload_manifest):
+    return sql
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/agent_files.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/candidate_destination.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/constraints.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/constraints.txt
new file mode 100644
index 00000000..88b1935c
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_command.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_cwd.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/initial_program.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/readonly_files.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..8bb37291
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+baseline/solution.py
+runtime/problem.py
+runtime/duckdb_local_workload.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/references/source_manifest.md b/benchmarks/ComputerSystems/DuckDBQueryRewrite/references/source_manifest.md
new file mode 100644
index 00000000..43dc6c9b
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL optimizer and query execution model
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the workload shape is traceable to official DuckDB/TPC-H analytical reporting patterns, while the exact query instance is a benchmark-local frozen SQL task chosen to expose meaningful rewrite opportunities.
+- License lineage: DuckDB is released under the MIT License.
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/duckdb_local_workload.py b/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/duckdb_local_workload.py
new file mode 100644
index 00000000..a9134cbc
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/duckdb_local_workload.py
@@ -0,0 +1,419 @@
+from __future__ import annotations
+
+import math
+import time
+from typing import Any
+
+import duckdb
+
+
+CUSTOMER_COUNT = 20_000
+ORDER_COUNT = 120_000
+LINEITEM_COUNT = 600_000
+
+SEGMENTS = ("BUILDING", "AUTOMOBILE", "HOUSEHOLD", "FURNITURE", "MACHINERY")
+SHIPMODES = ("AIR", "MAIL", "RAIL", "TRUCK", "SHIP")
+
+CUSTOMER_KEYS = tuple(1 + ((i * 97) % CUSTOMER_COUNT) for i in range(1, 301))
+ORDER_KEYS = tuple(1 + ((i * 193) % ORDER_COUNT) for i in range(1, 301))
+
+
+INDEX_CANDIDATES = {
+    "idx_orders_cust": "CREATE INDEX idx_orders_cust ON orders(o_custkey)",
+    "idx_orders_date": "CREATE INDEX idx_orders_date ON orders(o_orderdate)",
+    "idx_lineitem_order": "CREATE INDEX idx_lineitem_order ON lineitem(l_orderkey)",
+    "idx_customer_segment": "CREATE INDEX idx_customer_segment ON customer(c_mktsegment)",
+    "idx_orders_priority": "CREATE INDEX idx_orders_priority ON orders(o_orderpriority)",
+}
+
+INDEX_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_indexes": tuple(sorted(INDEX_CANDIDATES)),
+    "workload_notes": (
+        "Repeated selective customer lookups on orders",
+        "Repeated selective order lookups on lineitem",
+        "Repeated priority-filtered joins from customer to orders",
+    ),
+    "repetitions": 4,
+}
+
+
+PREAGGREGATION_CANDIDATES = {
+    "agg_quarter_segment_revenue": (
+        "CREATE TABLE agg_quarter_segment_revenue AS "
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_month_shipmode_revenue": (
+        "CREATE TABLE agg_month_shipmode_revenue AS "
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "GROUP BY 1, 2"
+    ),
+    "agg_customer_year_revenue": (
+        "CREATE TABLE agg_customer_year_revenue AS "
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_unused_priority_only": (
+        "CREATE TABLE agg_unused_priority_only AS "
+        "SELECT o.o_orderpriority, count(*) AS order_count "
+        "FROM orders o "
+        "GROUP BY 1"
+    ),
+}
+
+PREAGGREGATION_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_preaggregations": tuple(sorted(PREAGGREGATION_CANDIDATES)),
+    "workload_notes": (
+        "Quarter revenue by customer segment",
+        "Monthly revenue by ship mode",
+        "Top customers by yearly revenue",
+    ),
+    "repetitions": 4,
+}
+
+
+ORIGINAL_QUERY_SQL = '''
+WITH revenue AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+),
+order_counts AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         count(DISTINCT o.o_orderkey) AS order_count
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+)
+SELECT r.quarter_bucket, r.segment, r.revenue, o.order_count
+FROM revenue r
+JOIN order_counts o USING (quarter_bucket, segment)
+ORDER BY quarter_bucket, segment
+'''.strip()
+
+QUERY_REWRITE_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "query_goal": "Fuse repeated scans of the same join into one grouped aggregation while preserving results and ordering.",
+    "result_order_required": True,
+    "repetitions": 4,
+}
+
+
+def build_connection() -> duckdb.DuckDBPyConnection:
+    con = duckdb.connect(database=":memory:")
+    con.execute("PRAGMA threads=1")
+    con.execute(
+        f"""
+        CREATE TABLE customer AS
+        SELECT i AS c_custkey,
+               'Customer #' || i AS c_name,
+               CASE i % 5
+                 WHEN 0 THEN 'BUILDING'
+                 WHEN 1 THEN 'AUTOMOBILE'
+                 WHEN 2 THEN 'HOUSEHOLD'
+                 WHEN 3 THEN 'FURNITURE'
+                 ELSE 'MACHINERY'
+               END AS c_mktsegment,
+               i % 25 AS c_nationkey
+        FROM range(1, {CUSTOMER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE orders AS
+        SELECT i AS o_orderkey,
+               1 + ((i * 17) % {CUSTOMER_COUNT}) AS o_custkey,
+               DATE '1995-01-01' + (((i * 13) % 1460) * INTERVAL 1 DAY) AS o_orderdate,
+               100 + (((i * 37) % 100000) / 10.0) AS o_totalprice,
+               CASE i % 5
+                 WHEN 0 THEN '1-URGENT'
+                 WHEN 1 THEN '2-HIGH'
+                 WHEN 2 THEN '3-MEDIUM'
+                 WHEN 3 THEN '4-NOT SPECIFIED'
+                 ELSE '5-LOW'
+               END AS o_orderpriority
+        FROM range(1, {ORDER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE lineitem AS
+        SELECT i AS l_lineitemkey,
+               1 + ((i * 7) % {ORDER_COUNT}) AS l_orderkey,
+               1 + ((i * 11) % 50000) AS l_partkey,
+               1 + ((i * 13) % 10000) AS l_suppkey,
+               1 + ((i * 5) % 50) AS l_quantity,
+               10 + (((i * 19) % 100000) / 20.0) AS l_extendedprice,
+               (((i * 3) % 10) / 100.0) AS l_discount,
+               DATE '1995-01-01' + (((i * 29) % 1460) * INTERVAL 1 DAY) AS l_shipdate,
+               CASE i % 5
+                 WHEN 0 THEN 'AIR'
+                 WHEN 1 THEN 'MAIL'
+                 WHEN 2 THEN 'RAIL'
+                 WHEN 3 THEN 'TRUCK'
+                 ELSE 'SHIP'
+               END AS l_shipmode
+        FROM range(1, {LINEITEM_COUNT + 1}) t(i)
+        """
+    )
+    return con
+
+
+def normalize_name_list(value: Any, key: str) -> list[str]:
+    if isinstance(value, dict):
+        if key not in value:
+            raise ValueError(f"missing {key}")
+        value = value[key]
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"{key} must be a list or tuple")
+    out: list[str] = []
+    seen = set()
+    for item in value:
+        name = str(item)
+        if name not in seen:
+            out.append(name)
+            seen.add(name)
+    return out
+
+
+def compare_results(lhs: list[tuple[Any, ...]], rhs: list[tuple[Any, ...]], tol: float = 1e-6) -> bool:
+    if len(lhs) != len(rhs):
+        return False
+    for left_row, right_row in zip(lhs, rhs):
+        if len(left_row) != len(right_row):
+            return False
+        for left_value, right_value in zip(left_row, right_row):
+            if isinstance(left_value, float) or isinstance(right_value, float):
+                if not math.isfinite(float(left_value)) or not math.isfinite(float(right_value)):
+                    return False
+                if abs(float(left_value) - float(right_value)) > tol:
+                    return False
+            else:
+                if left_value != right_value:
+                    return False
+    return True
+
+
+def _report_quarter_segment(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT quarter_bucket, segment, revenue "
+            "FROM agg_quarter_segment_revenue "
+            "ORDER BY quarter_bucket, segment"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "ORDER BY quarter_bucket, segment"
+    ).fetchall()
+
+
+def _report_month_shipmode(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT month_bucket, shipmode, revenue "
+            "FROM agg_month_shipmode_revenue "
+            "WHERE month_bucket >= DATE '1997-01-01' "
+            "ORDER BY month_bucket, shipmode"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "WHERE l.l_shipdate >= DATE '1997-01-01' "
+        "GROUP BY 1, 2 "
+        "ORDER BY month_bucket, shipmode"
+    ).fetchall()
+
+
+def _report_customer_year(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT revenue_year, c_custkey, revenue "
+            "FROM agg_customer_year_revenue "
+            "WHERE revenue_year = 1998 "
+            "ORDER BY revenue DESC, c_custkey "
+            "LIMIT 100"
+        ).fetchall()
+    return con.execute(
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "HAVING year(o.o_orderdate) = 1998 "
+        "ORDER BY revenue DESC, c.c_custkey "
+        "LIMIT 100"
+    ).fetchall()
+
+
+def run_index_workload(con: duckdb.DuckDBPyConnection) -> float:
+    start_time = time.perf_counter()
+    for customer_key in CUSTOMER_KEYS:
+        con.execute(
+            "SELECT sum(o_totalprice) "
+            "FROM orders "
+            "WHERE o_custkey = ? AND o_orderdate >= DATE '1997-01-01'",
+            [customer_key],
+        ).fetchone()
+    for order_key in ORDER_KEYS:
+        con.execute(
+            "SELECT sum(l_extendedprice * (1 - l_discount)) "
+            "FROM lineitem "
+            "WHERE l_orderkey = ?",
+            [order_key],
+        ).fetchone()
+    for customer_key in CUSTOMER_KEYS[:120]:
+        con.execute(
+            "SELECT count(*) "
+            "FROM customer c "
+            "JOIN orders o ON c.c_custkey = o.o_custkey "
+            "WHERE c.c_custkey = ? AND o.o_orderpriority = '1-URGENT'",
+            [customer_key],
+        ).fetchone()
+    return time.perf_counter() - start_time
+
+
+def measure_index_design(selected_indexes: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_indexes if name not in INDEX_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown index names: {unknown}")
+    con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_indexes:
+        con.execute(INDEX_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+    run_index_workload(con)
+    workload_runtime = 0.0
+    for _ in range(int(INDEX_WORKLOAD_MANIFEST["repetitions"])):
+        workload_runtime += run_index_workload(con)
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "workload_runtime_s": float(workload_runtime),
+        "total_runtime_s": float(setup_runtime + workload_runtime),
+        "selected_index_count": len(selected_indexes),
+    }
+
+
+def measure_query_rewrite(sql: str) -> dict[str, Any]:
+    sql = str(sql).strip()
+    if not sql:
+        raise ValueError("query must not be empty")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    baseline_rows = baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    candidate_rows = candidate_con.execute(sql).fetchall()
+    if not compare_results(candidate_rows, baseline_rows):
+        raise ValueError("candidate query result does not match the baseline result")
+
+    baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    candidate_con.execute(sql).fetchall()
+    candidate_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        candidate_rows = candidate_con.execute(sql).fetchall()
+    candidate_runtime = time.perf_counter() - candidate_start
+
+    return {
+        "baseline_runtime_s": float(baseline_runtime),
+        "candidate_runtime_s": float(candidate_runtime),
+        "row_count": len(candidate_rows),
+    }
+
+
+def _run_preaggregation_reports(con: duckdb.DuckDBPyConnection, selected: set[str]) -> tuple[float, tuple[list[tuple[Any, ...]], ...]]:
+    start_time = time.perf_counter()
+    result_a = _report_quarter_segment(con, "agg_quarter_segment_revenue" in selected)
+    result_b = _report_month_shipmode(con, "agg_month_shipmode_revenue" in selected)
+    result_c = _report_customer_year(con, "agg_customer_year_revenue" in selected)
+    runtime = time.perf_counter() - start_time
+    return runtime, (result_a, result_b, result_c)
+
+
+def measure_preaggregation_design(selected_preaggregations: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_preaggregations if name not in PREAGGREGATION_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown pre-aggregation names: {unknown}")
+    if not selected_preaggregations:
+        con = build_connection()
+        _run_preaggregation_reports(con, set())
+        repeated_runtime = 0.0
+        for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+            extra_runtime, _ = _run_preaggregation_reports(con, set())
+            repeated_runtime += extra_runtime
+        return {
+            "setup_runtime_s": 0.0,
+            "candidate_workload_runtime_s": float(repeated_runtime),
+            "candidate_total_runtime_s": float(repeated_runtime),
+            "baseline_total_runtime_s": float(repeated_runtime),
+            "selected_preaggregation_count": 0,
+        }
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_preaggregations:
+        candidate_con.execute(PREAGGREGATION_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+
+    _, baseline_results = _run_preaggregation_reports(baseline_con, set())
+    _, candidate_results = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+    if any(not compare_results(left, right) for left, right in zip(candidate_results, baseline_results)):
+        raise ValueError("candidate pre-aggregation selection changed the query results")
+
+    _run_preaggregation_reports(baseline_con, set())
+    _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+
+    repeated_baseline_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(baseline_con, set())
+        repeated_baseline_runtime += extra_runtime
+
+    repeated_candidate_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+        repeated_candidate_runtime += extra_runtime
+
+    candidate_total_runtime = setup_runtime + repeated_candidate_runtime
+    baseline_total_runtime = repeated_baseline_runtime
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "candidate_workload_runtime_s": float(repeated_candidate_runtime),
+        "candidate_total_runtime_s": float(candidate_total_runtime),
+        "baseline_total_runtime_s": float(baseline_total_runtime),
+        "selected_preaggregation_count": len(selected_preaggregations),
+    }
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/problem.py b/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/problem.py
new file mode 100644
index 00000000..81e8a337
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/runtime/problem.py
@@ -0,0 +1,18 @@
+from __future__ import annotations
+
+from .duckdb_local_workload import ORIGINAL_QUERY_SQL, QUERY_REWRITE_MANIFEST, measure_query_rewrite
+
+
+WORKLOAD_MANIFEST = dict(QUERY_REWRITE_MANIFEST)
+
+
+def load_instance():
+    return {"sql": ORIGINAL_QUERY_SQL, "manifest": dict(WORKLOAD_MANIFEST)}
+
+
+def evaluate_query(value):
+    if isinstance(value, dict):
+        if "sql" not in value:
+            raise ValueError("missing sql")
+        value = value["sql"]
+    return measure_query_rewrite(str(value))
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/scripts/init.py b/benchmarks/ComputerSystems/DuckDBQueryRewrite/scripts/init.py
new file mode 100644
index 00000000..c0f51b39
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/scripts/init.py
@@ -0,0 +1,44 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBQueryRewrite.baseline.solution import rewrite_query as _baseline_rewrite_query
+    from benchmarks.ComputerSystems.DuckDBQueryRewrite.runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+except ModuleNotFoundError:
+    from baseline.solution import rewrite_query as _baseline_rewrite_query
+    from runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+
+
+# EVOLVE-BLOCK-START
+def rewrite_query(sql, workload_manifest):
+    return _baseline_rewrite_query(sql, workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_query(rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST)))
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/evaluator.py b/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/evaluator.py
new file mode 100644
index 00000000..ac90231c
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/evaluator.py
@@ -0,0 +1,88 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.DuckDBQueryRewrite.baseline.solution import rewrite_query as baseline_rewrite_query
+    from benchmarks.ComputerSystems.DuckDBQueryRewrite.runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+except ModuleNotFoundError:
+    from baseline.solution import rewrite_query as baseline_rewrite_query
+    from runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_runtime_s": 0.0,
+        "baseline_runtime_s": 0.0,
+        "row_count": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    rewrite_query = namespace.get("rewrite_query")
+    if not callable(rewrite_query):
+        artifacts["error_message"] = "candidate must define rewrite_query(sql, workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_query(baseline_rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST))
+        candidate = evaluate_query(rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_runtime = float(candidate["candidate_runtime_s"])
+    baseline_runtime = float(baseline["candidate_runtime_s"])
+    if not math.isfinite(candidate_runtime) or candidate_runtime <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_runtime_s"] = candidate_runtime
+    metrics["baseline_runtime_s"] = baseline_runtime
+    metrics["row_count"] = float(candidate["row_count"])
+    metrics["combined_score"] = -candidate_runtime
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/requirements.txt b/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/requirements.txt
new file mode 100644
index 00000000..8a6ba6a1
--- /dev/null
+++ b/benchmarks/ComputerSystems/DuckDBQueryRewrite/verification/requirements.txt
@@ -0,0 +1 @@
+duckdb
diff --git a/benchmarks/ComputerSystems/duckdb_local_workload.py b/benchmarks/ComputerSystems/duckdb_local_workload.py
new file mode 100644
index 00000000..0ea8a0c6
--- /dev/null
+++ b/benchmarks/ComputerSystems/duckdb_local_workload.py
@@ -0,0 +1,405 @@
+from __future__ import annotations
+
+import math
+import time
+from typing import Any
+
+import duckdb
+
+
+CUSTOMER_COUNT = 20_000
+ORDER_COUNT = 120_000
+LINEITEM_COUNT = 600_000
+
+SEGMENTS = ("BUILDING", "AUTOMOBILE", "HOUSEHOLD", "FURNITURE", "MACHINERY")
+SHIPMODES = ("AIR", "MAIL", "RAIL", "TRUCK", "SHIP")
+
+CUSTOMER_KEYS = tuple(1 + ((i * 97) % CUSTOMER_COUNT) for i in range(1, 301))
+ORDER_KEYS = tuple(1 + ((i * 193) % ORDER_COUNT) for i in range(1, 301))
+
+
+INDEX_CANDIDATES = {
+    "idx_orders_cust": "CREATE INDEX idx_orders_cust ON orders(o_custkey)",
+    "idx_orders_date": "CREATE INDEX idx_orders_date ON orders(o_orderdate)",
+    "idx_lineitem_order": "CREATE INDEX idx_lineitem_order ON lineitem(l_orderkey)",
+    "idx_customer_segment": "CREATE INDEX idx_customer_segment ON customer(c_mktsegment)",
+    "idx_orders_priority": "CREATE INDEX idx_orders_priority ON orders(o_orderpriority)",
+}
+
+INDEX_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_indexes": tuple(sorted(INDEX_CANDIDATES)),
+    "workload_notes": (
+        "Repeated selective customer lookups on orders",
+        "Repeated selective order lookups on lineitem",
+        "Repeated priority-filtered joins from customer to orders",
+    ),
+    "repetitions": 4,
+}
+
+
+PREAGGREGATION_CANDIDATES = {
+    "agg_quarter_segment_revenue": (
+        "CREATE TABLE agg_quarter_segment_revenue AS "
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_month_shipmode_revenue": (
+        "CREATE TABLE agg_month_shipmode_revenue AS "
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "GROUP BY 1, 2"
+    ),
+    "agg_customer_year_revenue": (
+        "CREATE TABLE agg_customer_year_revenue AS "
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_unused_priority_only": (
+        "CREATE TABLE agg_unused_priority_only AS "
+        "SELECT o.o_orderpriority, count(*) AS order_count "
+        "FROM orders o "
+        "GROUP BY 1"
+    ),
+}
+
+PREAGGREGATION_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_preaggregations": tuple(sorted(PREAGGREGATION_CANDIDATES)),
+    "workload_notes": (
+        "Quarter revenue by customer segment",
+        "Monthly revenue by ship mode",
+        "Top customers by yearly revenue",
+    ),
+    "repetitions": 4,
+}
+
+
+ORIGINAL_QUERY_SQL = '''
+WITH revenue AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+),
+order_counts AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         count(DISTINCT o.o_orderkey) AS order_count
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+)
+SELECT r.quarter_bucket, r.segment, r.revenue, o.order_count
+FROM revenue r
+JOIN order_counts o USING (quarter_bucket, segment)
+ORDER BY quarter_bucket, segment
+'''.strip()
+
+QUERY_REWRITE_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "query_goal": "Fuse repeated scans of the same join into one grouped aggregation while preserving results and ordering.",
+    "result_order_required": True,
+    "repetitions": 4,
+}
+
+
+def build_connection() -> duckdb.DuckDBPyConnection:
+    con = duckdb.connect(database=":memory:")
+    con.execute("PRAGMA threads=1")
+    con.execute(
+        f"""
+        CREATE TABLE customer AS
+        SELECT i AS c_custkey,
+               'Customer #' || i AS c_name,
+               CASE i % 5
+                 WHEN 0 THEN 'BUILDING'
+                 WHEN 1 THEN 'AUTOMOBILE'
+                 WHEN 2 THEN 'HOUSEHOLD'
+                 WHEN 3 THEN 'FURNITURE'
+                 ELSE 'MACHINERY'
+               END AS c_mktsegment,
+               i % 25 AS c_nationkey
+        FROM range(1, {CUSTOMER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE orders AS
+        SELECT i AS o_orderkey,
+               1 + ((i * 17) % {CUSTOMER_COUNT}) AS o_custkey,
+               DATE '1995-01-01' + (((i * 13) % 1460) * INTERVAL 1 DAY) AS o_orderdate,
+               100 + (((i * 37) % 100000) / 10.0) AS o_totalprice,
+               CASE i % 5
+                 WHEN 0 THEN '1-URGENT'
+                 WHEN 1 THEN '2-HIGH'
+                 WHEN 2 THEN '3-MEDIUM'
+                 WHEN 3 THEN '4-NOT SPECIFIED'
+                 ELSE '5-LOW'
+               END AS o_orderpriority
+        FROM range(1, {ORDER_COUNT + 1}) t(i)
+        """
+    )
+    con.execute(
+        f"""
+        CREATE TABLE lineitem AS
+        SELECT i AS l_lineitemkey,
+               1 + ((i * 7) % {ORDER_COUNT}) AS l_orderkey,
+               1 + ((i * 11) % 50000) AS l_partkey,
+               1 + ((i * 13) % 10000) AS l_suppkey,
+               1 + ((i * 5) % 50) AS l_quantity,
+               10 + (((i * 19) % 100000) / 20.0) AS l_extendedprice,
+               (((i * 3) % 10) / 100.0) AS l_discount,
+               DATE '1995-01-01' + (((i * 29) % 1460) * INTERVAL 1 DAY) AS l_shipdate,
+               CASE i % 5
+                 WHEN 0 THEN 'AIR'
+                 WHEN 1 THEN 'MAIL'
+                 WHEN 2 THEN 'RAIL'
+                 WHEN 3 THEN 'TRUCK'
+                 ELSE 'SHIP'
+               END AS l_shipmode
+        FROM range(1, {LINEITEM_COUNT + 1}) t(i)
+        """
+    )
+    return con
+
+
+def normalize_name_list(value: Any, key: str) -> list[str]:
+    if isinstance(value, dict):
+        if key not in value:
+            raise ValueError(f"missing {key}")
+        value = value[key]
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"{key} must be a list or tuple")
+    out: list[str] = []
+    seen = set()
+    for item in value:
+        name = str(item)
+        if name not in seen:
+            out.append(name)
+            seen.add(name)
+    return out
+
+
+def compare_results(lhs: list[tuple[Any, ...]], rhs: list[tuple[Any, ...]], tol: float = 1e-6) -> bool:
+    if len(lhs) != len(rhs):
+        return False
+    for left_row, right_row in zip(lhs, rhs):
+        if len(left_row) != len(right_row):
+            return False
+        for left_value, right_value in zip(left_row, right_row):
+            if isinstance(left_value, float) or isinstance(right_value, float):
+                if not math.isfinite(float(left_value)) or not math.isfinite(float(right_value)):
+                    return False
+                if abs(float(left_value) - float(right_value)) > tol:
+                    return False
+            else:
+                if left_value != right_value:
+                    return False
+    return True
+
+
+def _report_quarter_segment(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT quarter_bucket, segment, revenue "
+            "FROM agg_quarter_segment_revenue "
+            "ORDER BY quarter_bucket, segment"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "ORDER BY quarter_bucket, segment"
+    ).fetchall()
+
+
+def _report_month_shipmode(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT month_bucket, shipmode, revenue "
+            "FROM agg_month_shipmode_revenue "
+            "WHERE month_bucket >= DATE '1997-01-01' "
+            "ORDER BY month_bucket, shipmode"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "WHERE l.l_shipdate >= DATE '1997-01-01' "
+        "GROUP BY 1, 2 "
+        "ORDER BY month_bucket, shipmode"
+    ).fetchall()
+
+
+def _report_customer_year(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT revenue_year, c_custkey, revenue "
+            "FROM agg_customer_year_revenue "
+            "WHERE revenue_year = 1998 "
+            "ORDER BY revenue DESC, c_custkey "
+            "LIMIT 100"
+        ).fetchall()
+    return con.execute(
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "HAVING year(o.o_orderdate) = 1998 "
+        "ORDER BY revenue DESC, c.c_custkey "
+        "LIMIT 100"
+    ).fetchall()
+
+
+def run_index_workload(con: duckdb.DuckDBPyConnection) -> float:
+    start_time = time.perf_counter()
+    for customer_key in CUSTOMER_KEYS:
+        con.execute(
+            "SELECT sum(o_totalprice) "
+            "FROM orders "
+            "WHERE o_custkey = ? AND o_orderdate >= DATE '1997-01-01'",
+            [customer_key],
+        ).fetchone()
+    for order_key in ORDER_KEYS:
+        con.execute(
+            "SELECT sum(l_extendedprice * (1 - l_discount)) "
+            "FROM lineitem "
+            "WHERE l_orderkey = ?",
+            [order_key],
+        ).fetchone()
+    for customer_key in CUSTOMER_KEYS[:120]:
+        con.execute(
+            "SELECT count(*) "
+            "FROM customer c "
+            "JOIN orders o ON c.c_custkey = o.o_custkey "
+            "WHERE c.c_custkey = ? AND o.o_orderpriority = '1-URGENT'",
+            [customer_key],
+        ).fetchone()
+    return time.perf_counter() - start_time
+
+
+def measure_index_design(selected_indexes: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_indexes if name not in INDEX_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown index names: {unknown}")
+    con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_indexes:
+        con.execute(INDEX_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+    run_index_workload(con)
+    workload_runtime = 0.0
+    for _ in range(int(INDEX_WORKLOAD_MANIFEST["repetitions"])):
+        workload_runtime += run_index_workload(con)
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "workload_runtime_s": float(workload_runtime),
+        "total_runtime_s": float(setup_runtime + workload_runtime),
+        "selected_index_count": len(selected_indexes),
+    }
+
+
+def measure_query_rewrite(sql: str) -> dict[str, Any]:
+    sql = str(sql).strip()
+    if not sql:
+        raise ValueError("query must not be empty")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    baseline_rows = baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    candidate_rows = candidate_con.execute(sql).fetchall()
+    if not compare_results(candidate_rows, baseline_rows):
+        raise ValueError("candidate query result does not match the baseline result")
+
+    baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    candidate_con.execute(sql).fetchall()
+    candidate_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        candidate_rows = candidate_con.execute(sql).fetchall()
+    candidate_runtime = time.perf_counter() - candidate_start
+
+    return {
+        "baseline_runtime_s": float(baseline_runtime),
+        "candidate_runtime_s": float(candidate_runtime),
+        "row_count": len(candidate_rows),
+    }
+
+
+def _run_preaggregation_reports(con: duckdb.DuckDBPyConnection, selected: set[str]) -> tuple[float, tuple[list[tuple[Any, ...]], ...]]:
+    start_time = time.perf_counter()
+    result_a = _report_quarter_segment(con, "agg_quarter_segment_revenue" in selected)
+    result_b = _report_month_shipmode(con, "agg_month_shipmode_revenue" in selected)
+    result_c = _report_customer_year(con, "agg_customer_year_revenue" in selected)
+    runtime = time.perf_counter() - start_time
+    return runtime, (result_a, result_b, result_c)
+
+
+def measure_preaggregation_design(selected_preaggregations: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_preaggregations if name not in PREAGGREGATION_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown pre-aggregation names: {unknown}")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_preaggregations:
+        candidate_con.execute(PREAGGREGATION_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+
+    _, baseline_results = _run_preaggregation_reports(baseline_con, set())
+    _, candidate_results = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+    if any(not compare_results(left, right) for left, right in zip(candidate_results, baseline_results)):
+        raise ValueError("candidate pre-aggregation selection changed the query results")
+
+    _run_preaggregation_reports(baseline_con, set())
+    _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+
+    repeated_baseline_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(baseline_con, set())
+        repeated_baseline_runtime += extra_runtime
+
+    repeated_candidate_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+        repeated_candidate_runtime += extra_runtime
+
+    candidate_total_runtime = setup_runtime + repeated_candidate_runtime
+    baseline_total_runtime = repeated_baseline_runtime
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "candidate_workload_runtime_s": float(repeated_candidate_runtime),
+        "candidate_total_runtime_s": float(candidate_total_runtime),
+        "baseline_total_runtime_s": float(baseline_total_runtime),
+        "selected_preaggregation_count": len(selected_preaggregations),
+    }
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README.md b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README.md
new file mode 100644
index 00000000..6555acba
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README.md
@@ -0,0 +1,55 @@
+# Dynamic-Current Minimum-Time Routing
+
+Route a ship across a frozen coastal grid while minimizing travel time under deterministic current and depth constraints.
+
+## Why This Benchmark Matters
+
+This benchmark stands in for channel navigation and port-access planning. A fast route improves schedule reliability, but the shortest geometric route can be illegal or slow once current assistance and draft limits matter.
+
+Algorithmically, it is a constrained shortest-path problem on a fixed grid graph with physics-induced edge costs.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/evaluator.py \
+  benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/scripts/init.py \
+  --metrics-out /tmp/DynamicCurrentMinimumTimeRouting_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/DynamicCurrentMinimumTimeRouting \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README_zh-CN.md b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README_zh-CN.md
new file mode 100644
index 00000000..ebbe4130
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 动态流场最短航时船舶路径规划
+
+在冻结的沿海栅格上规划船舶航线，利用确定性流场并满足最小水深约束，使总航时尽量短。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 可以看作航道通行和港口进出规划的代理问题。更快的路线意味着更好的时刻可靠性，但一旦把流场增益和吃水限制考虑进去，几何最短路径往往既不合法，也不一定最快。
+
+从算法角度看，它是在固定栅格图上的受约束最短路问题，只不过边代价会受到物理场影响。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/evaluator.py \
+  benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/scripts/init.py \
+  --metrics-out /tmp/DynamicCurrentMinimumTimeRouting_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/DynamicCurrentMinimumTimeRouting \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task.md b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task.md
new file mode 100644
index 00000000..b24e011c
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task.md
@@ -0,0 +1,52 @@
+# Dynamic-Current Minimum-Time Routing Task
+
+## Problem
+
+Route a ship across a frozen coastal grid while minimizing travel time under deterministic current and depth constraints.
+
+This benchmark stands in for channel navigation and port-access planning. A fast route improves schedule reliability, but the shortest geometric route can be illegal or slow once current assistance and draft limits matter.
+
+Algorithmically, it is a constrained shortest-path problem on a fixed grid graph with physics-induced edge costs.
+
+## What Is Frozen
+
+- The land mask, water cells, deterministic current field, and depth field in `runtime/problem.py`.
+- The start cell, goal cell, minimum draft requirement, and four-neighbor movement rule.
+- The travel-time computation and the reference metrics reported for baseline and Dijkstra-style routes.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a list of grid cells or a dict with key `path`. The path must start at `instance["start"]`, end at `instance["goal"]`, move only between adjacent cells, and stay on water cells with depth at least `instance["min_depth"]`.
+
+## Evaluation
+
+1. Load the frozen routing instance from `runtime/problem.py`.
+2. Validate the returned path against the start/end cells, adjacency rule, land mask, and minimum-depth constraint.
+3. Compute total travel time and hop count along the path.
+4. Report candidate time together with baseline and reference metrics for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_time_h`
+- `valid`: `1.0` only if the route is feasible
+- `candidate_time_h`
+- `baseline_time_h`
+- `reference_time_h`
+- `candidate_hops`
+- `baseline_hops`
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into a path
+- The path has the wrong start or goal, contains a non-adjacent move, or enters land/shallow water
+- Any reported metric becomes non-finite
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task_zh-CN.md b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task_zh-CN.md
new file mode 100644
index 00000000..74461021
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/Task_zh-CN.md
@@ -0,0 +1,52 @@
+# 动态流场最短航时船舶路径规划
+
+## 任务概览
+
+在冻结的沿海栅格上规划船舶航线，利用确定性流场并满足最小水深约束，使总航时尽量短。
+
+这个 benchmark 可以看作航道通行和港口进出规划的代理问题。更快的路线意味着更好的时刻可靠性，但一旦把流场增益和吃水限制考虑进去，几何最短路径往往既不合法，也不一定最快。
+
+从算法角度看，它是在固定栅格图上的受约束最短路问题，只不过边代价会受到物理场影响。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的陆地掩码、水域格点、确定性流场和水深场。
+- 起点、终点、最小吃水要求，以及四邻接移动规则。
+- 固定的航时计算方式，以及 baseline 与参考路径的报告指标。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个网格坐标列表，或带 `path` 字段的字典。路径必须从 `instance["start"]` 出发，到达 `instance["goal"]`，每一步只走相邻格点，并始终停留在深度不小于 `instance["min_depth"]` 的可航行水域。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结的航线实例。
+2. 检查返回路径的起终点、相邻移动规则、陆地掩码和最小水深约束。
+3. 计算整条路径的总航时与步数。
+4. 输出候选航时，并同时给出 baseline 和参考指标作对照。
+
+## 指标
+
+- `combined_score`：`-candidate_time_h`
+- `valid`：只有航线可行时才为 `1.0`
+- `candidate_time_h`
+- `baseline_time_h`
+- `reference_time_h`
+- `candidate_hops`
+- `baseline_hops`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为路径
+- 路径起终点错误、包含非相邻移动，或进入陆地/浅水区域
+- 任意报告指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/baseline/solution.py b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/baseline/solution.py
new file mode 100644
index 00000000..bd102573
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.OperationsResearch.DynamicCurrentMinimumTimeRouting.runtime.problem import baseline_path
+except ModuleNotFoundError:
+    from runtime.problem import baseline_path
+
+
+def solve(instance):
+    return baseline_path()
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/constraints.txt
new file mode 100644
index 00000000..88b1935c
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/references/source_manifest.md b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/references/source_manifest.md
new file mode 100644
index 00000000..b934d160
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/references/source_manifest.md
@@ -0,0 +1,9 @@
+# Source Manifest
+
+- Upstream lineage:
+  - TU Delft CITG `HALEM` repository and README
+  - Time-optimal ship routing with dynamic currents, variable velocity, and minimum-water-depth constraints
+- License lineage: upstream code lineage is MIT.
+- Data provenance: this benchmark does not vendor upstream hydrographic files. It uses a benchmark-local synthetic coastal grid, synthetic current field, and synthetic depth raster generated directly in `runtime/problem.py`.
+- Authenticity note: the routing objective and minimum-depth constraint follow official HALEM lineage, while the environmental data is a frozen synthetic stand-in for offline reproducibility.
+- Transformation path: no external preprocessing pipeline exists. All fields are generated from fixed formulas and constants inside the benchmark runtime.
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/runtime/problem.py b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/runtime/problem.py
new file mode 100644
index 00000000..57303e8a
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/runtime/problem.py
@@ -0,0 +1,195 @@
+from __future__ import annotations
+
+from collections import deque
+import math
+from typing import Any
+
+
+WIDTH = 20
+HEIGHT = 10
+START = (1, 4)
+GOAL = (18, 4)
+MIN_DEPTH = 2.5
+
+
+def is_land(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 8 <= x <= 12 and 2 <= y <= 6
+
+
+def depth_at(cell: tuple[int, int]) -> float:
+    x, y = cell
+    if is_land(cell):
+        return 0.0
+    depth = 3.8
+    if y == 1 and 7 <= x <= 13:
+        depth = 2.7
+    if y == 6 and 2 <= x <= 5:
+        depth = 2.2
+    if y == 7 and 3 <= x <= 6:
+        depth = 2.4
+    return depth
+
+
+def is_navigable(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= x < WIDTH and 0 <= y < HEIGHT and not is_land(cell) and depth_at(cell) >= MIN_DEPTH
+
+
+def _render_grid() -> tuple[str, ...]:
+    rows = []
+    for y in range(HEIGHT):
+        chars = []
+        for x in range(WIDTH):
+            cell = (x, y)
+            if cell == START:
+                chars.append("S")
+            elif cell == GOAL:
+                chars.append("G")
+            elif is_land(cell):
+                chars.append("#")
+            elif depth_at(cell) < MIN_DEPTH:
+                chars.append("~")
+            else:
+                chars.append(".")
+        rows.append("".join(chars))
+    return tuple(rows)
+
+
+GRID = _render_grid()
+
+
+def current_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    ripple = 0.03 * math.sin(0.4 * x)
+    if y <= 2:
+        return (-0.36 + ripple, 0.01 * math.cos(0.3 * x))
+    if y >= 7:
+        return (0.44 + ripple, -0.01 * math.cos(0.3 * x))
+    return (-0.05 + ripple, 0.02 * math.sin(0.2 * x))
+
+
+def _field_to_rows(field_fn) -> tuple[tuple[Any, ...], ...]:
+    rows = []
+    for y in range(HEIGHT):
+        row = []
+        for x in range(WIDTH):
+            value = field_fn((x, y))
+            if isinstance(value, tuple):
+                row.append(tuple(round(v, 4) for v in value))
+            else:
+                row.append(round(float(value), 4))
+        rows.append(tuple(row))
+    return tuple(rows)
+
+
+CURRENT_FIELD = _field_to_rows(current_at)
+DEPTH_FIELD = _field_to_rows(depth_at)
+
+
+def load_instance() -> dict[str, Any]:
+    return {
+        "grid": GRID,
+        "start": START,
+        "goal": GOAL,
+        "current_field": CURRENT_FIELD,
+        "depth_field": DEPTH_FIELD,
+        "min_depth": MIN_DEPTH,
+        "objective": "time",
+    }
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if "path" not in value:
+            raise ValueError("missing path")
+        value = value["path"]
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError("path is empty")
+    return path
+
+
+def neighbors(cell: tuple[int, int], directions=((0, -1), (1, 0), (0, 1), (-1, 0))) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for dx, dy in directions:
+        nxt = (x + dx, y + dy)
+        if is_navigable(nxt):
+            result.append(nxt)
+    return result
+
+
+def validate_path(value: Any) -> list[tuple[int, int]]:
+    path = extract_path(value)
+    if path[0] != START:
+        raise ValueError("path must start at START")
+    if path[-1] != GOAL:
+        raise ValueError("path must end at GOAL")
+    for cell in path:
+        if not is_navigable(cell):
+            raise ValueError("path enters land, leaves the map, or violates minimum depth")
+    for prev, curr in zip(path, path[1:]):
+        dx = abs(curr[0] - prev[0])
+        dy = abs(curr[1] - prev[1])
+        if dx + dy != 1:
+            raise ValueError("path contains a non-adjacent move")
+    return path
+
+
+def _leg_time(prev: tuple[int, int], curr: tuple[int, int]) -> float:
+    dx = curr[0] - prev[0]
+    dy = curr[1] - prev[1]
+    current_u, current_v = current_at(prev)
+    current_along = current_u * dx + current_v * dy
+    depth = depth_at(curr)
+    shallow_penalty = max(0.0, 3.0 - depth) * 0.22
+    speed = max(0.25, 1.0 + 0.9 * current_along - shallow_penalty)
+    return 1.0 / speed
+
+
+def route_metrics(value: Any) -> dict[str, float]:
+    path = validate_path(value)
+    total_time_h = 0.0
+    for prev, curr in zip(path, path[1:]):
+        total_time_h += _leg_time(prev, curr)
+    return {
+        "time_h": float(total_time_h),
+        "hops": float(len(path) - 1),
+    }
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def baseline_path() -> list[tuple[int, int]]:
+    queue = deque([START])
+    parent = {START: None}
+    while queue:
+        current = queue.popleft()
+        if current == GOAL:
+            return _retrace(parent, current)
+        for nxt in neighbors(current):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    raise RuntimeError("baseline path not found")
+
+
+BASELINE_PATH = baseline_path()
+BASELINE_TIME_H = route_metrics(BASELINE_PATH)["time_h"]
+BASELINE_HOPS = route_metrics(BASELINE_PATH)["hops"]
+REFERENCE_TIME_H = 20.012194145529936
+REFERENCE_HOPS = 23.0
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/scripts/init.py b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/scripts/init.py
new file mode 100644
index 00000000..48dc97ba
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/scripts/init.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.DynamicCurrentMinimumTimeRouting.baseline.solution import solve as _baseline_solve
+    from benchmarks.OperationsResearch.DynamicCurrentMinimumTimeRouting.runtime.problem import load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+    from runtime.problem import load_instance, route_metrics
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    result = solve(load_instance())
+    print(route_metrics(result))
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/evaluator.py b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/evaluator.py
new file mode 100644
index 00000000..91b03525
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/evaluator.py
@@ -0,0 +1,93 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.DynamicCurrentMinimumTimeRouting.baseline.solution import solve as baseline_solve
+    from benchmarks.OperationsResearch.DynamicCurrentMinimumTimeRouting.runtime.problem import BASELINE_HOPS, BASELINE_TIME_H, REFERENCE_TIME_H, load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as baseline_solve
+    from runtime.problem import BASELINE_HOPS, BASELINE_TIME_H, REFERENCE_TIME_H, load_instance, route_metrics
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_time_h": 0.0,
+        "baseline_time_h": float(BASELINE_TIME_H),
+        "reference_time_h": float(REFERENCE_TIME_H),
+        "candidate_hops": 0.0,
+        "baseline_hops": float(BASELINE_HOPS),
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    solve_fn = namespace.get("solve")
+    if not callable(solve_fn):
+        artifacts["error_message"] = "candidate must define solve(instance)"
+        return metrics, artifacts
+
+    instance = load_instance()
+    try:
+        baseline_metrics = route_metrics(baseline_solve(instance))
+        candidate_metrics = route_metrics(solve_fn(instance))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    candidate_time_h = float(candidate_metrics["time_h"])
+    if not math.isfinite(candidate_time_h) or candidate_time_h <= 0:
+        artifacts["error_message"] = "candidate time is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_time_h"] = candidate_time_h
+    metrics["candidate_hops"] = float(candidate_metrics["hops"])
+    metrics["baseline_time_h"] = float(baseline_metrics["time_h"])
+    metrics["baseline_hops"] = float(baseline_metrics["hops"])
+    metrics["combined_score"] = -candidate_time_h
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/requirements.txt b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/requirements.txt
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/benchmarks/OperationsResearch/DynamicCurrentMinimumTimeRouting/verification/requirements.txt
@@ -0,0 +1 @@
+
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README.md b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README.md
new file mode 100644
index 00000000..7a6490c2
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README.md
@@ -0,0 +1,55 @@
+# EOQ with All-Units Discounts
+
+Choose an order quantity for frozen EOQ cases with all-units discounts and minimize average annual cost.
+
+## Why This Benchmark Matters
+
+All-units discounts appear in packaging, chemicals, and contract manufacturing. Crossing a breakpoint changes the unit price of every unit in the order, so choosing the wrong region can dominate annual spend.
+
+This is a frozen piecewise optimization problem with regime switches. The output is still a single scalar `Q`, but the objective changes discontinuously when the chosen price region changes.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/scripts/init.py \
+  --metrics-out /tmp/EOQWithAllUnitsDiscounts_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithAllUnitsDiscounts \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README_zh-CN.md b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README_zh-CN.md
new file mode 100644
index 00000000..7e0161e3
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 全量折扣 EOQ
+
+在冻结的 EOQ 实例上选择订货量，并在存在全量折扣时最小化平均年成本。
+
+## 这个 Benchmark 在测什么
+
+全量折扣在包装、化工和合同制造里很常见。一旦跨过价格断点，订单里的每一个单位都会按更低单价计费，所以选错价格区间会显著拉高年度支出。
+
+从算法角度看，这是一个带分段切换的冻结优化问题。输出仍然只是一个标量 `Q`，但目标函数会在价格区间切换时出现不连续变化。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/scripts/init.py \
+  --metrics-out /tmp/EOQWithAllUnitsDiscounts_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithAllUnitsDiscounts \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task.md b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task.md
new file mode 100644
index 00000000..15387cbc
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task.md
@@ -0,0 +1,49 @@
+# EOQ with All-Units Discounts Task
+
+## Problem
+
+Choose an order quantity for frozen EOQ cases with all-units discounts and minimize average annual cost.
+
+All-units discounts appear in packaging, chemicals, and contract manufacturing. Crossing a breakpoint changes the unit price of every unit in the order, so choosing the wrong region can dominate annual spend.
+
+This is a frozen piecewise optimization problem with regime switches. The output is still a single scalar `Q`, but the objective changes discontinuously when the chosen price region changes.
+
+## What Is Frozen
+
+- The deterministic EOQ case table and cost model in `runtime/problem.py`.
+- The price-break schedule, demand, holding-cost, and order-cost parameters for every case.
+- The evaluator loop that averages cost across all frozen cases.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a raw numeric order quantity or a dict with key `order_quantity`.
+
+## Evaluation
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline on every case for diagnostics.
+3. Run your `solve(instance)` on every case and parse the returned order quantity.
+4. Convert that quantity into feasibility and annual cost, then average cost across all cases.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into an order quantity
+- Any order quantity is infeasible or non-finite
+- Any case evaluation produces a non-finite metric
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task_zh-CN.md b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task_zh-CN.md
new file mode 100644
index 00000000..c8c8473e
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/Task_zh-CN.md
@@ -0,0 +1,49 @@
+# EOQ 全量折扣优化
+
+## 任务概览
+
+在冻结的 EOQ 案例上选择订货量，在全量折扣定价下尽量降低平均年成本。
+
+全量折扣在包装、化工和代工采购里很常见。一旦跨过某个价格断点，整笔订单的每一件都会换成新的单价，所以选错区间会直接放大年度采购成本。
+
+从计算角度看，这是一个冻结的小型分段优化问题。输出虽然只有一个订货量 `Q`，但一旦落入不同价格区间，目标函数就会发生不连续变化。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 EOQ 案例表和成本模型。
+- 每个案例对应的价格断点、需求、持有成本和订货成本参数。
+- 对全体冻结案例取平均成本的评测循环。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个数值型订货量，或带 `order_quantity` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结案例集。
+2. 对每个案例运行参考 baseline，用于诊断对照。
+3. 在每个案例上运行你的 `solve(instance)`，并解析返回的订货量。
+4. 把订货量换算成可行性与年成本，最后对所有案例求平均。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：只有所有案例都可行且输出有限时才为 `1.0`
+- `avg_cost`
+- `avg_cost_ratio`：用于诊断的平均 `baseline_cost / candidate_cost`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为订货量
+- 任意案例的订货量不可行，或不是有限值
+- 任意案例的评测指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/baseline/solution.py b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/baseline/solution.py
new file mode 100644
index 00000000..d85a43cf
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/baseline/solution.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithAllUnitsDiscounts.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/constraints.txt
new file mode 100644
index 00000000..35ca1548
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/references/source_manifest.md b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/references/source_manifest.md
new file mode 100644
index 00000000..a38984d6
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/references/source_manifest.md
@@ -0,0 +1,9 @@
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          - `stockpyl.eoq.economic_order_quantity_with_all_units_discounts`
+- all-units discount EOQ formulas as documented in standard inventory theory references used by Stockpyl
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/runtime/problem.py b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/runtime/problem.py
new file mode 100644
index 00000000..329f6406
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/runtime/problem.py
@@ -0,0 +1,150 @@
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = [
+    {
+        "fixed_cost": 8.0,
+        "holding_cost_rate": 0.225,
+        "demand_rate": 1300.0,
+        "breakpoints": [
+            0.0,
+            350.0,
+            700.0
+        ],
+        "unit_costs": [
+            0.5,
+            0.47,
+            0.44
+        ]
+    },
+    {
+        "fixed_cost": 10.0,
+        "holding_cost_rate": 0.18,
+        "demand_rate": 2200.0,
+        "breakpoints": [
+            0.0,
+            300.0,
+            900.0
+        ],
+        "unit_costs": [
+            0.82,
+            0.79,
+            0.73
+        ]
+    },
+    {
+        "fixed_cost": 12.0,
+        "holding_cost_rate": 0.2,
+        "demand_rate": 1700.0,
+        "breakpoints": [
+            0.0,
+            500.0,
+            1000.0
+        ],
+        "unit_costs": [
+            1.1,
+            1.03,
+            0.98
+        ]
+    },
+    {
+        "fixed_cost": 6.0,
+        "holding_cost_rate": 0.16,
+        "demand_rate": 2400.0,
+        "breakpoints": [
+            0.0,
+            250.0,
+            600.0
+        ],
+        "unit_costs": [
+            0.42,
+            0.39,
+            0.36
+        ]
+    }
+]
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+
+def _region(instance: dict[str, float], q: float) -> int:
+    region = 0
+    for idx, bp in enumerate(instance["breakpoints"]):
+        if q >= bp:
+            region = idx
+    return region
+
+
+def _cost(instance: dict[str, float], q: float) -> float:
+    region = _region(instance, q)
+    unit_cost = instance["unit_costs"][region]
+    return (
+        unit_cost * instance["demand_rate"]
+        + instance["fixed_cost"] * instance["demand_rate"] / q
+        + instance["holding_cost_rate"] * unit_cost * q / 2.0
+    )
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q, region, cost = economic_order_quantity_with_all_units_discounts(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        list(instance["breakpoints"]),
+        list(instance["unit_costs"]),
+    )
+    return {"order_quantity": float(q), "region": int(region), "cost": float(cost)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    return {"valid": True, "cost": float(_cost(instance, q)), "order_quantity": float(q)}
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/scripts/init.py b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/scripts/init.py
new file mode 100644
index 00000000..4bca7fba
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithAllUnitsDiscounts.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.EOQWithAllUnitsDiscounts.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/evaluator.py b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/evaluator.py
new file mode 100644
index 00000000..4d349851
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/evaluator.py
@@ -0,0 +1,109 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithAllUnitsDiscounts.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.EOQWithAllUnitsDiscounts.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {idx}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {idx}\n{traceback.format_exc()}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {idx}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/requirements.txt b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/requirements.txt
new file mode 100644
index 00000000..513852e8
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithAllUnitsDiscounts/verification/requirements.txt
@@ -0,0 +1,3 @@
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README.md b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README.md
new file mode 100644
index 00000000..b58b132f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README.md
@@ -0,0 +1,55 @@
+# EOQ with Incremental Discounts
+
+Choose an order quantity for frozen EOQ cases with incremental discounts and minimize average annual cost.
+
+## Why This Benchmark Matters
+
+Incremental discount contracts are common in industrial purchasing: only the units beyond each breakpoint get the lower price. Correctly reasoning about the cumulative tiered purchase cost matters just as much as choosing a good order size.
+
+From a CS angle, this is again a small frozen search problem, but the cost accounting is cumulative across tiers rather than a simple breakpoint lookup.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/scripts/init.py \
+  --metrics-out /tmp/EOQWithIncrementalDiscounts_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithIncrementalDiscounts \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README_zh-CN.md b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README_zh-CN.md
new file mode 100644
index 00000000..b4e28071
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 增量折扣 EOQ
+
+在冻结的 EOQ 实例上选择订货量，并在存在增量折扣时最小化平均年成本。
+
+## 这个 Benchmark 在测什么
+
+增量折扣合同在工业采购里很常见：只有超过各个断点的那部分单位，才会按更低单价计费。要得到好的订货量，不仅要选对 `Q`，还要正确处理分层累积的采购成本。
+
+从 CS 角度看，这仍然是一个小型冻结搜索问题，只不过成本核算不是简单地查价格区间，而是要沿各个折扣层逐段累积。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/scripts/init.py \
+  --metrics-out /tmp/EOQWithIncrementalDiscounts_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithIncrementalDiscounts \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task.md b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task.md
new file mode 100644
index 00000000..98fd98d3
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task.md
@@ -0,0 +1,49 @@
+# EOQ with Incremental Discounts Task
+
+## Problem
+
+Choose an order quantity for frozen EOQ cases with incremental discounts and minimize average annual cost.
+
+Incremental discount contracts are common in industrial purchasing: only the units beyond each breakpoint get the lower price. Correctly reasoning about the cumulative tiered purchase cost matters just as much as choosing a good order size.
+
+From a CS angle, this is again a small frozen search problem, but the cost accounting is cumulative across tiers rather than a simple breakpoint lookup.
+
+## What Is Frozen
+
+- The deterministic EOQ case table and incremental-discount cost model in `runtime/problem.py`.
+- The tier boundaries and price schedule for every frozen case.
+- The evaluator loop that averages annual cost across the entire case set.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a raw numeric order quantity or a dict with key `order_quantity`.
+
+## Evaluation
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline on every case for diagnostics.
+3. Run your `solve(instance)` on every case and parse the returned order quantity.
+4. Convert that quantity into feasibility and annual cost under the incremental schedule, then average cost across cases.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into an order quantity
+- Any order quantity is infeasible or non-finite
+- Any case evaluation produces a non-finite metric
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task_zh-CN.md b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task_zh-CN.md
new file mode 100644
index 00000000..a6a0edc2
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/Task_zh-CN.md
@@ -0,0 +1,49 @@
+# EOQ 增量折扣优化
+
+## 任务概览
+
+在冻结的 EOQ 案例上选择订货量，在增量折扣定价下尽量降低平均年成本。
+
+增量折扣在工业采购里也很常见：只有超过某个断点之后的那部分边际单位，才会享受更低单价。这里不仅要选好订货量，还要把累计分层采购成本算对。
+
+从计算角度看，这依然是一个冻结的小型搜索问题，但成本不是简单查区间，而是按分层区间做累计计算。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 EOQ 案例表和增量折扣成本模型。
+- 每个冻结案例的区间边界与价格分层表。
+- 对整组案例平均年成本的固定评测循环。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个数值型订货量，或带 `order_quantity` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结案例集。
+2. 对每个案例运行参考 baseline，用于诊断对照。
+3. 在每个案例上运行你的 `solve(instance)`，并解析返回的订货量。
+4. 按照增量折扣规则把订货量换算成可行性与年成本，再对全体案例求平均。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：只有所有案例都可行且输出有限时才为 `1.0`
+- `avg_cost`
+- `avg_cost_ratio`：用于诊断的平均 `baseline_cost / candidate_cost`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为订货量
+- 任意案例的订货量不可行，或不是有限值
+- 任意案例的评测指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/baseline/solution.py b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/baseline/solution.py
new file mode 100644
index 00000000..6f07504a
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/baseline/solution.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithIncrementalDiscounts.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/constraints.txt
new file mode 100644
index 00000000..35ca1548
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/references/source_manifest.md b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/references/source_manifest.md
new file mode 100644
index 00000000..aa4d2b54
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/references/source_manifest.md
@@ -0,0 +1,9 @@
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          - `stockpyl.eoq.economic_order_quantity_with_incremental_discounts`
+- incremental discount EOQ formulas as documented in standard inventory theory references used by Stockpyl
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/runtime/problem.py b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/runtime/problem.py
new file mode 100644
index 00000000..f40c6c81
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/runtime/problem.py
@@ -0,0 +1,160 @@
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = [
+    {
+        "fixed_cost": 150.0,
+        "holding_cost_rate": 0.25,
+        "demand_rate": 2400.0,
+        "breakpoints": [
+            0.0,
+            300.0,
+            600.0
+        ],
+        "unit_costs": [
+            100.0,
+            90.0,
+            80.0
+        ]
+    },
+    {
+        "fixed_cost": 60.0,
+        "holding_cost_rate": 0.18,
+        "demand_rate": 3000.0,
+        "breakpoints": [
+            0.0,
+            200.0,
+            400.0
+        ],
+        "unit_costs": [
+            15.0,
+            14.0,
+            12.5
+        ]
+    },
+    {
+        "fixed_cost": 90.0,
+        "holding_cost_rate": 0.22,
+        "demand_rate": 1600.0,
+        "breakpoints": [
+            0.0,
+            250.0,
+            550.0
+        ],
+        "unit_costs": [
+            24.0,
+            22.5,
+            21.0
+        ]
+    },
+    {
+        "fixed_cost": 45.0,
+        "holding_cost_rate": 0.15,
+        "demand_rate": 4200.0,
+        "breakpoints": [
+            0.0,
+            500.0,
+            1200.0
+        ],
+        "unit_costs": [
+            9.0,
+            8.7,
+            8.2
+        ]
+    }
+]
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+
+def _c_bar(instance: dict[str, float], region: int) -> float:
+    if region == 0:
+        return 0.0
+    breakpoints = instance["breakpoints"]
+    unit_costs = instance["unit_costs"]
+    return sum(unit_costs[i] * (breakpoints[i + 1] - breakpoints[i]) for i in range(region)) - unit_costs[region] * breakpoints[region]
+
+
+def _region(instance: dict[str, float], q: float) -> int:
+    region = 0
+    for idx, bp in enumerate(instance["breakpoints"]):
+        if q >= bp:
+            region = idx
+    return region
+
+
+def _cost(instance: dict[str, float], q: float) -> float:
+    region = _region(instance, q)
+    unit_cost = instance["unit_costs"][region]
+    c_bar = _c_bar(instance, region)
+    return (
+        unit_cost * instance["demand_rate"]
+        + instance["holding_cost_rate"] * c_bar / 2.0
+        + (instance["fixed_cost"] + c_bar) * instance["demand_rate"] / q
+        + instance["holding_cost_rate"] * unit_cost * q / 2.0
+    )
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q, region, cost = economic_order_quantity_with_incremental_discounts(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        list(instance["breakpoints"]),
+        list(instance["unit_costs"]),
+    )
+    return {"order_quantity": float(q), "region": int(region), "cost": float(cost)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    return {"valid": True, "cost": float(_cost(instance, q)), "order_quantity": float(q)}
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/scripts/init.py b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/scripts/init.py
new file mode 100644
index 00000000..499b71f5
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithIncrementalDiscounts.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.EOQWithIncrementalDiscounts.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/evaluator.py b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/evaluator.py
new file mode 100644
index 00000000..92f426ce
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/evaluator.py
@@ -0,0 +1,109 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithIncrementalDiscounts.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.EOQWithIncrementalDiscounts.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {idx}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {idx}\n{traceback.format_exc()}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {idx}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/requirements.txt b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/requirements.txt
new file mode 100644
index 00000000..513852e8
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithIncrementalDiscounts/verification/requirements.txt
@@ -0,0 +1,3 @@
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README.md b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README.md
new file mode 100644
index 00000000..b013d48f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README.md
@@ -0,0 +1,55 @@
+# EOQ with Minimum Order Quantity
+
+Choose an order quantity for frozen deterministic EOQ cases with a hard minimum order quantity and minimize average annual cost.
+
+## Why This Benchmark Matters
+
+Supplier MOQs are a routine constraint in procurement. They change working-capital usage and warehouse occupancy, and they often push the feasible optimum onto a boundary that a naive EOQ formula misses.
+
+This is a small constrained optimization problem over a frozen analytic cost model. The important part is boundary-aware decision logic, not systems integration.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/scripts/init.py \
+  --metrics-out /tmp/EOQWithMinimumOrderQuantity_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithMinimumOrderQuantity \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README_zh-CN.md b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README_zh-CN.md
new file mode 100644
index 00000000..2e9d7f78
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 带最小起订量的 EOQ
+
+在冻结的确定性 EOQ 实例上选择订货量，并在存在硬性最小起订量时最小化平均年成本。
+
+## 这个 Benchmark 在测什么
+
+供应商 MOQ 是采购里非常常见的约束。它会直接影响营运资金占用和仓储压力，也经常把最优解从经典 EOQ 公式的内部点推到可行域边界上。
+
+从算法角度看，这是一个建立在冻结解析成本模型上的小型约束优化问题。关键不在系统集成，而在于是否能正确处理边界和约束条件。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/evaluator.py \
+  benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/scripts/init.py \
+  --metrics-out /tmp/EOQWithMinimumOrderQuantity_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithMinimumOrderQuantity \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task.md b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task.md
new file mode 100644
index 00000000..390a99db
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task.md
@@ -0,0 +1,49 @@
+# EOQ with Minimum Order Quantity Task
+
+## Problem
+
+Choose an order quantity for frozen deterministic EOQ cases with a hard minimum order quantity and minimize average annual cost.
+
+Supplier MOQs are a routine constraint in procurement. They change working-capital usage and warehouse occupancy, and they often push the feasible optimum onto a boundary that a naive EOQ formula misses.
+
+This is a small constrained optimization problem over a frozen analytic cost model. The important part is boundary-aware decision logic, not systems integration.
+
+## What Is Frozen
+
+- The deterministic EOQ case table and annual-cost model in `runtime/problem.py`.
+- The demand, setup cost, holding cost, and MOQ parameters for every frozen case.
+- The evaluator loop that averages candidate cost across all cases.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a raw numeric order quantity or a dict with key `order_quantity`.
+
+## Evaluation
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline on every case for diagnostics.
+3. Run your `solve(instance)` on every case and parse the returned order quantity.
+4. Check the MOQ constraint, compute annual cost, and average cost across all cases.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into an order quantity
+- Any order quantity violates the MOQ or is non-finite
+- Any case evaluation produces a non-finite metric
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task_zh-CN.md b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task_zh-CN.md
new file mode 100644
index 00000000..f4badb81
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/Task_zh-CN.md
@@ -0,0 +1,49 @@
+# EOQ 最小起订量优化
+
+## 任务概览
+
+在冻结的确定性 EOQ 案例上选择订货量，在硬性最小起订量约束下尽量降低平均年成本。
+
+供应商 MOQ 是采购里非常常见的硬约束。它会直接改变占用资金和仓储压力，而且经常把最优解推到边界位置，简单套一个 EOQ 公式往往会错。
+
+从计算角度看，它是在冻结解析成本模型上的一个小型约束优化问题。难点在于边界意识，而不是系统集成。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的确定性 EOQ 案例表和年成本模型。
+- 每个冻结案例的需求、订货成本、持有成本和 MOQ 参数。
+- 对所有案例平均候选成本的固定评测循环。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个数值型订货量，或带 `order_quantity` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结案例集。
+2. 对每个案例运行参考 baseline，用于诊断对照。
+3. 在每个案例上运行你的 `solve(instance)`，并解析返回的订货量。
+4. 检查 MOQ 约束，计算年成本，并对全体案例求平均。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：只有所有案例都可行且输出有限时才为 `1.0`
+- `avg_cost`
+- `avg_cost_ratio`：用于诊断的平均 `baseline_cost / candidate_cost`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为订货量
+- 任意案例的订货量违反 MOQ 约束，或不是有限值
+- 任意案例的评测指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/baseline/solution.py b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/baseline/solution.py
new file mode 100644
index 00000000..3bde3a83
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/baseline/solution.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithMinimumOrderQuantity.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/constraints.txt
new file mode 100644
index 00000000..35ca1548
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/references/source_manifest.md b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/references/source_manifest.md
new file mode 100644
index 00000000..e0637fe7
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/references/source_manifest.md
@@ -0,0 +1,9 @@
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          - `stockpyl.eoq.economic_order_quantity`
+- deterministic EOQ formulas as documented in standard inventory theory references used by Stockpyl
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/runtime/problem.py b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/runtime/problem.py
new file mode 100644
index 00000000..40c1433f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/runtime/problem.py
@@ -0,0 +1,101 @@
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = [
+    {
+        "fixed_cost": 8.0,
+        "holding_cost_rate": 0.225,
+        "demand_rate": 1300.0,
+        "minimum_order_quantity": 80.0
+    },
+    {
+        "fixed_cost": 14.0,
+        "holding_cost_rate": 0.18,
+        "demand_rate": 1800.0,
+        "minimum_order_quantity": 140.0
+    },
+    {
+        "fixed_cost": 11.0,
+        "holding_cost_rate": 0.25,
+        "demand_rate": 950.0,
+        "minimum_order_quantity": 100.0
+    },
+    {
+        "fixed_cost": 6.0,
+        "holding_cost_rate": 0.16,
+        "demand_rate": 2200.0,
+        "minimum_order_quantity": 120.0
+    }
+]
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q_star, _ = economic_order_quantity(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+    )
+    q = max(q_star, instance["minimum_order_quantity"])
+    return {"order_quantity": float(q)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q < instance["minimum_order_quantity"] or q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    _, cost = economic_order_quantity(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        order_quantity=q,
+    )
+    return {"valid": True, "cost": float(cost), "order_quantity": float(q)}
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/scripts/init.py b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/scripts/init.py
new file mode 100644
index 00000000..1799b742
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithMinimumOrderQuantity.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.EOQWithMinimumOrderQuantity.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/evaluator.py b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/evaluator.py
new file mode 100644
index 00000000..3d36e06f
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/evaluator.py
@@ -0,0 +1,109 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.EOQWithMinimumOrderQuantity.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.EOQWithMinimumOrderQuantity.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {idx}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {idx}\n{traceback.format_exc()}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {idx}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/requirements.txt b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/requirements.txt
new file mode 100644
index 00000000..513852e8
--- /dev/null
+++ b/benchmarks/OperationsResearch/EOQWithMinimumOrderQuantity/verification/requirements.txt
@@ -0,0 +1,3 @@
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README.md b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README.md
new file mode 100644
index 00000000..bf41eb44
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README.md
@@ -0,0 +1,55 @@
+# FT10 Dispatching Rule Optimization
+
+Design a greedy dispatching rule for the canonical FT10 Fisher-Thompson 10x10 job shop and minimize makespan.
+
+## Why This Benchmark Matters
+
+This benchmark stands in for online shop-floor dispatching, where lightweight priority rules are still used because they are easy to deploy and can materially change throughput and overtime.
+
+You are not returning a full schedule. You are writing the priority function inside a frozen scheduler, so the task is policy design under a fixed simulator.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `score_operation(operation, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/evaluator.py \
+  benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/scripts/init.py \
+  --metrics-out /tmp/FT10DispatchingRuleOptimization_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FT10DispatchingRuleOptimization \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README_zh-CN.md b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README_zh-CN.md
new file mode 100644
index 00000000..0d3f620c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/README_zh-CN.md
@@ -0,0 +1,55 @@
+# FT10 派工规则优化
+
+为经典 FT10 Fisher-Thompson 10x10 作业车间设计一个贪心派工规则，并最小化 makespan。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是车间里的在线派工问题。轻量级优先级规则今天仍然被广泛使用，因为它们部署简单，但对吞吐、延误和加班成本的影响却很大。
+
+你并不是直接返回一份完整排程，而是在一个冻结调度器内部编写优先级函数，所以这个任务本质上是“固定模拟器里的策略设计”。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`score_operation(operation, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/evaluator.py \
+  benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/scripts/init.py \
+  --metrics-out /tmp/FT10DispatchingRuleOptimization_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FT10DispatchingRuleOptimization \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task.md b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task.md
new file mode 100644
index 00000000..30eb3869
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task.md
@@ -0,0 +1,50 @@
+# FT10 Dispatching Rule Optimization Task
+
+## Problem
+
+Design a greedy dispatching rule for the canonical FT10 Fisher-Thompson 10x10 job shop and minimize makespan.
+
+This benchmark stands in for online shop-floor dispatching, where lightweight priority rules are still used because they are easy to deploy and can materially change throughput and overtime.
+
+You are not returning a full schedule. You are writing the priority function inside a frozen scheduler, so the task is policy design under a fixed simulator.
+
+## What Is Frozen
+
+- The canonical `ft10` instance and the known optimum `930`.
+- The schedule builder, feasibility logic, and tie-handling protocol in `runtime/problem.py`.
+- The rule that only operations with the earliest feasible start time are compared by your score.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+Return any finite scalar priority. Among operations tied on earliest feasible start time, larger scores are scheduled first.
+
+## Evaluation
+
+1. Load the canonical `ft10` instance from `runtime/problem.py`.
+2. Start from an empty schedule and repeatedly collect the next unscheduled operation from each job.
+3. Among operations tied on earliest feasible start time, pick the one with the highest `score_operation(...)`.
+4. Build a complete schedule, compute candidate makespan, and report the baseline and relative gap to the optimum.
+
+## Metrics
+
+- `combined_score`: `-candidate_makespan`
+- `valid`: `1.0` only if a complete feasible schedule is produced
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## Invalid Submissions
+
+- `score_operation(...)` is missing or crashes
+- The returned priority is non-finite
+- The induced schedule is infeasible or incomplete
+- Evaluation fails before a valid makespan is produced
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task_zh-CN.md b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task_zh-CN.md
new file mode 100644
index 00000000..4ffce90c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/Task_zh-CN.md
@@ -0,0 +1,50 @@
+# FT10 派工规则优化
+
+## 任务概览
+
+为经典 FT10 Fisher-Thompson 10x10 作业车间设计贪心派工规则，并尽量缩短 makespan。
+
+这个 benchmark 对应的是车间在线派工场景。现实里，轻量级优先级规则依然很常用，因为它们易于部署，而且真的会影响吞吐、延误和加班。
+
+你并不是直接输出完整排程，而是在一个冻结调度器内部写优先级函数，所以这道题本质上是在固定模拟器里的策略设计。
+
+## 哪些部分是冻结的
+
+- 经典 `ft10` 实例，以及已知最优值 `930`。
+- `runtime/problem.py` 中冻结的调度构造器、可行性逻辑和 tie-breaking 协议。
+- 只有最早可开工的操作才会交给你的评分函数比较的这条规则。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+返回任意有限标量优先级。在最早可开工时间相同的候选操作里，分数更高的会被优先调度。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入经典 `ft10` 实例。
+2. 从空排程开始，反复收集每个 job 的下一个未调度操作。
+3. 在最早可开工时间相同的操作中，选择 `score_operation(...)` 最高的那个。
+4. 构造完整排程，计算候选 makespan，并同时报告 baseline 与相对最优差距。
+
+## 指标
+
+- `combined_score`：`-candidate_makespan`
+- `valid`：只有生成完整可行排程时才为 `1.0`
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## 判为无效的情况
+
+- 缺少 `score_operation(...)`，或函数在评测中报错
+- 返回的优先级不是有限值
+- 诱导出的排程不可行，或没有排完整
+- 在得到有效 makespan 之前评测就失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/baseline/solution.py b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/baseline/solution.py
new file mode 100644
index 00000000..8742a90b
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/baseline/solution.py
@@ -0,0 +1,9 @@
+from __future__ import annotations
+
+
+def score_operation(operation, state):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/constraints.txt
new file mode 100644
index 00000000..10b71922
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/constraints.txt
@@ -0,0 +1,6 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files in `runtime/`, `verification/`, `references/`, or `baseline/`.
+For dispatch tasks, define `score_operation(operation, state)`.
+For neighborhood tasks, define `score_move(move, state)`.
+Return only finite scalar scores.
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..44d55c3c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+runtime/problem.py
+runtime/instance.json
+verification/evaluator.py
+baseline/solution.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/references/source_manifest.md b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/references/source_manifest.md
new file mode 100644
index 00000000..c017c20b
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/references/source_manifest.md
@@ -0,0 +1,11 @@
+# Source Manifest
+
+- Canonical instance: `ft10`
+- Upstream package: `job_shop_lib`
+- Upstream file: `job_shop_lib/benchmarking/benchmark_instances.json`
+- Canonical optimum recorded in upstream metadata: `930`
+- Original academic provenance:
+  - `ft10`: Fisher and Thompson, *Industrial Scheduling*, 1963.
+  - `la16`: Lawrence benchmark set, 1984.
+
+This benchmark vendors only the specific frozen instance JSON required for evaluation.
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/instance.json b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/instance.json
new file mode 100644
index 00000000..bbf75fac
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/instance.json
@@ -0,0 +1,253 @@
+{
+  "name": "ft10",
+  "duration_matrix": [
+    [
+      29,
+      78,
+      9,
+      36,
+      49,
+      11,
+      62,
+      56,
+      44,
+      21
+    ],
+    [
+      43,
+      90,
+      75,
+      11,
+      69,
+      28,
+      46,
+      46,
+      72,
+      30
+    ],
+    [
+      91,
+      85,
+      39,
+      74,
+      90,
+      10,
+      12,
+      89,
+      45,
+      33
+    ],
+    [
+      81,
+      95,
+      71,
+      99,
+      9,
+      52,
+      85,
+      98,
+      22,
+      43
+    ],
+    [
+      14,
+      6,
+      22,
+      61,
+      26,
+      69,
+      21,
+      49,
+      72,
+      53
+    ],
+    [
+      84,
+      2,
+      52,
+      95,
+      48,
+      72,
+      47,
+      65,
+      6,
+      25
+    ],
+    [
+      46,
+      37,
+      61,
+      13,
+      32,
+      21,
+      32,
+      89,
+      30,
+      55
+    ],
+    [
+      31,
+      86,
+      46,
+      74,
+      32,
+      88,
+      19,
+      48,
+      36,
+      79
+    ],
+    [
+      76,
+      69,
+      76,
+      51,
+      85,
+      11,
+      40,
+      89,
+      26,
+      74
+    ],
+    [
+      85,
+      13,
+      61,
+      7,
+      64,
+      76,
+      47,
+      52,
+      90,
+      45
+    ]
+  ],
+  "machines_matrix": [
+    [
+      0,
+      1,
+      2,
+      3,
+      4,
+      5,
+      6,
+      7,
+      8,
+      9
+    ],
+    [
+      0,
+      2,
+      4,
+      9,
+      3,
+      1,
+      6,
+      5,
+      7,
+      8
+    ],
+    [
+      1,
+      0,
+      3,
+      2,
+      8,
+      5,
+      7,
+      6,
+      9,
+      4
+    ],
+    [
+      1,
+      2,
+      0,
+      4,
+      6,
+      8,
+      7,
+      3,
+      9,
+      5
+    ],
+    [
+      2,
+      0,
+      1,
+      5,
+      3,
+      4,
+      8,
+      7,
+      9,
+      6
+    ],
+    [
+      2,
+      1,
+      5,
+      3,
+      8,
+      9,
+      0,
+      6,
+      4,
+      7
+    ],
+    [
+      1,
+      0,
+      3,
+      2,
+      6,
+      5,
+      9,
+      8,
+      7,
+      4
+    ],
+    [
+      2,
+      0,
+      1,
+      5,
+      4,
+      6,
+      8,
+      9,
+      7,
+      3
+    ],
+    [
+      0,
+      1,
+      3,
+      5,
+      2,
+      9,
+      6,
+      7,
+      4,
+      8
+    ],
+    [
+      1,
+      0,
+      2,
+      6,
+      8,
+      9,
+      5,
+      3,
+      4,
+      7
+    ]
+  ],
+  "metadata": {
+    "optimum": 930,
+    "upper_bound": 930,
+    "lower_bound": 930,
+    "reference": "J.F. Muth, G.L. Thompson. 'Industrial scheduling.', Englewood Cliffs, NJ, Prentice-Hall, 1963."
+  }
+}
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/problem.py b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/problem.py
new file mode 100644
index 00000000..5bcfd5af
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/runtime/problem.py
@@ -0,0 +1,270 @@
+from __future__ import annotations
+
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+
+
+INSTANCE_PATH = Path(__file__).resolve().with_name("instance.json")
+KNOWN_OPTIMUM = 930
+
+
+def load_instance() -> dict[str, Any]:
+    return json.loads(INSTANCE_PATH.read_text(encoding="utf-8"))
+
+
+def relative_gap(value: float, optimum: float) -> float:
+    return float((value - optimum) / optimum)
+
+
+def baseline_dispatch_score(operation: dict[str, Any], state: dict[str, Any]):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+
+
+def baseline_move_score(move: dict[str, Any], state: dict[str, Any]):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def _build_operation_tables(instance: dict[str, Any]) -> tuple[list[list[int]], list[list[int]], dict[tuple[int, int], tuple[int, int]]]:
+    durations = instance["duration_matrix"]
+    machines = instance["machines_matrix"]
+    op_map: dict[tuple[int, int], tuple[int, int]] = {}
+    for j, row in enumerate(machines):
+        for k, machine in enumerate(row):
+            op_map[(j, k)] = (machine, durations[j][k])
+    return durations, machines, op_map
+
+
+def schedule_with_dispatch(instance: dict[str, Any], score_operation) -> dict[str, Any]:
+    durations, machines, _ = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    job_next = [0] * num_jobs
+    job_ready = [0] * num_jobs
+    machine_ready = [0] * num_machines
+    scheduled_ops: list[dict[str, Any]] = []
+
+    total_ops = num_jobs * num_machines
+    while len(scheduled_ops) < total_ops:
+        candidates: list[dict[str, Any]] = []
+        for job_id in range(num_jobs):
+            op_index = job_next[job_id]
+            if op_index >= num_machines:
+                continue
+            machine_id = machines[job_id][op_index]
+            duration = durations[job_id][op_index]
+            earliest_start = max(job_ready[job_id], machine_ready[machine_id])
+            remaining_job_work = sum(durations[job_id][op_index:])
+            remaining_job_ops = num_machines - op_index
+            candidates.append(
+                {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "earliest_start": earliest_start,
+                    "remaining_job_work": remaining_job_work,
+                    "remaining_job_ops": remaining_job_ops,
+                }
+            )
+        min_start = min(op["earliest_start"] for op in candidates)
+        ready = [op for op in candidates if op["earliest_start"] == min_start]
+        state = {
+            "step": len(scheduled_ops),
+            "job_ready_times": tuple(job_ready),
+            "machine_ready_times": tuple(machine_ready),
+            "current_makespan": max(max(job_ready), max(machine_ready)),
+        }
+        scored: list[tuple[Any, dict[str, Any]]] = []
+        for op in ready:
+            score = score_operation(op, state)
+            scored.append((score, op))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                -item[1]["duration"],
+                -item[1]["remaining_job_work"],
+                -item[1]["job_id"],
+            ),
+            reverse=True,
+        )
+        chosen = scored[0][1]
+        start = chosen["earliest_start"]
+        end = start + chosen["duration"]
+        scheduled = dict(chosen)
+        scheduled["start"] = start
+        scheduled["end"] = end
+        scheduled_ops.append(scheduled)
+        job_ready[chosen["job_id"]] = end
+        machine_ready[chosen["machine_id"]] = end
+        job_next[chosen["job_id"]] += 1
+
+    return {
+        "valid": True,
+        "schedule": scheduled_ops,
+        "makespan": max(op["end"] for op in scheduled_ops),
+        "machine_sequences": machine_sequences_from_schedule(instance, scheduled_ops),
+    }
+
+
+def machine_sequences_from_schedule(instance: dict[str, Any], schedule: list[dict[str, Any]]) -> list[list[tuple[int, int]]]:
+    num_machines = len(instance["machines_matrix"][0])
+    sequences: list[list[tuple[int, int, int, int]]] = [[] for _ in range(num_machines)]
+    for op in schedule:
+        sequences[op["machine_id"]].append((op["start"], op["job_id"], op["op_index"], op["end"]))
+    out: list[list[tuple[int, int]]] = []
+    for machine_ops in sequences:
+        machine_ops.sort()
+        out.append([(job_id, op_index) for _, job_id, op_index, _ in machine_ops])
+    return out
+
+
+def build_schedule_from_machine_sequences(instance: dict[str, Any], machine_sequences: list[list[tuple[int, int]]]) -> dict[str, Any]:
+    durations, machines, op_map = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    machine_pred: dict[tuple[int, int], tuple[int, int] | None] = {}
+    for seq in machine_sequences:
+        for idx, op in enumerate(seq):
+            machine_pred[op] = seq[idx - 1] if idx > 0 else None
+
+    scheduled: dict[tuple[int, int], dict[str, Any]] = {}
+    total_ops = num_jobs * num_machines
+    while len(scheduled) < total_ops:
+        progress = False
+        for job_id in range(num_jobs):
+            for op_index in range(num_machines):
+                op = (job_id, op_index)
+                if op in scheduled:
+                    continue
+                job_prev = (job_id, op_index - 1) if op_index > 0 else None
+                mach_prev = machine_pred.get(op)
+                if job_prev is not None and job_prev not in scheduled:
+                    continue
+                if mach_prev is not None and mach_prev not in scheduled:
+                    continue
+                machine_id, duration = op_map[op]
+                start = 0
+                if job_prev is not None:
+                    start = max(start, scheduled[job_prev]["end"])
+                if mach_prev is not None:
+                    start = max(start, scheduled[mach_prev]["end"])
+                scheduled[op] = {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "start": start,
+                    "end": start + duration,
+                }
+                progress = True
+        if not progress:
+            return {"valid": False, "schedule": [], "makespan": float("inf"), "machine_sequences": machine_sequences}
+
+    schedule = list(scheduled.values())
+    schedule.sort(key=lambda item: (item["start"], item["machine_id"], item["job_id"], item["op_index"]))
+    return {
+        "valid": True,
+        "schedule": schedule,
+        "makespan": max(op["end"] for op in schedule),
+        "machine_sequences": machine_sequences,
+    }
+
+
+def initial_machine_sequences(instance: dict[str, Any]) -> list[list[tuple[int, int]]]:
+    baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+    return baseline["machine_sequences"]
+
+
+def generate_adjacent_moves(instance: dict[str, Any], current: dict[str, Any]) -> list[dict[str, Any]]:
+    durations, machines, _ = _build_operation_tables(instance)
+    schedule_by_op = {
+        (op["job_id"], op["op_index"]): op
+        for op in current["schedule"]
+    }
+    moves: list[dict[str, Any]] = []
+    for machine_id, seq in enumerate(current["machine_sequences"]):
+        for pos in range(len(seq) - 1):
+            a = seq[pos]
+            b = seq[pos + 1]
+            a_sched = schedule_by_op[a]
+            b_sched = schedule_by_op[b]
+            moves.append(
+                {
+                    "machine_id": machine_id,
+                    "machine_position": pos,
+                    "op_a": {
+                        "job_id": a[0],
+                        "op_index": a[1],
+                        "duration": durations[a[0]][a[1]],
+                        "start": a_sched["start"],
+                        "end": a_sched["end"],
+                    },
+                    "op_b": {
+                        "job_id": b[0],
+                        "op_index": b[1],
+                        "duration": durations[b[0]][b[1]],
+                        "start": b_sched["start"],
+                        "end": b_sched["end"],
+                    },
+                    "delta_duration": durations[a[0]][a[1]] - durations[b[0]][b[1]],
+                    "current_makespan": current["makespan"],
+                }
+            )
+    return moves
+
+
+def apply_adjacent_swap(machine_sequences: list[list[tuple[int, int]]], machine_id: int, position: int) -> list[list[tuple[int, int]]]:
+    new_sequences = copy.deepcopy(machine_sequences)
+    new_sequences[machine_id][position], new_sequences[machine_id][position + 1] = (
+        new_sequences[machine_id][position + 1],
+        new_sequences[machine_id][position],
+    )
+    return new_sequences
+
+
+def run_local_search(instance: dict[str, Any], score_move, max_iterations: int = 50) -> dict[str, Any]:
+    current = schedule_with_dispatch(instance, baseline_dispatch_score)
+    if not current["valid"]:
+        return current
+
+    for iteration in range(max_iterations):
+        moves = generate_adjacent_moves(instance, current)
+        state = {
+            "iteration": iteration,
+            "current_makespan": current["makespan"],
+        }
+        scored = []
+        for move in moves:
+            score = score_move(move, state)
+            scored.append((score, move))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                item[1]["delta_duration"],
+                -item[1]["machine_position"],
+            ),
+            reverse=True,
+        )
+        improved = False
+        for _, move in scored:
+            new_sequences = apply_adjacent_swap(current["machine_sequences"], move["machine_id"], move["machine_position"])
+            candidate = build_schedule_from_machine_sequences(instance, new_sequences)
+            if candidate["valid"] and candidate["makespan"] < current["makespan"]:
+                current = candidate
+                improved = True
+                break
+        if not improved:
+            break
+
+    return current
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/scripts/init.py b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/scripts/init.py
new file mode 100644
index 00000000..f02b6252
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/scripts/init.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FT10DispatchingRuleOptimization.baseline.solution import score_operation as _baseline_score_operation
+except ModuleNotFoundError:
+    from baseline.solution import score_operation as _baseline_score_operation
+
+
+# EVOLVE-BLOCK-START
+def score_operation(operation, state):
+    return _baseline_score_operation(operation, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.FT10DispatchingRuleOptimization.runtime.problem import load_instance, schedule_with_dispatch
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, schedule_with_dispatch
+    instance = load_instance()
+    result = schedule_with_dispatch(instance, score_operation)
+    print(result["makespan"])
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/evaluator.py b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/evaluator.py
new file mode 100644
index 00000000..40d15135
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/evaluator.py
@@ -0,0 +1,125 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FT10DispatchingRuleOptimization.runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+
+
+TASK_KIND = "dispatch"
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_makespan": 0.0,
+        "baseline_makespan": 0.0,
+        "relative_gap_to_optimum": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    instance = load_instance()
+
+    try:
+        if TASK_KIND == "dispatch":
+            score_fn = namespace.get("score_operation")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_operation(operation, state)")
+            baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+            candidate = schedule_with_dispatch(instance, score_fn)
+        else:
+            score_fn = namespace.get("score_move")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_move(move, state)")
+            max_iterations = int(namespace.get("MAX_ITERATIONS", 50))
+            baseline = run_local_search(instance, baseline_move_score, max_iterations=50)
+            candidate = run_local_search(instance, score_fn, max_iterations=max_iterations)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    if not baseline["valid"]:
+        artifacts["error_message"] = "internal baseline produced an invalid schedule"
+        return metrics, artifacts
+    if not candidate["valid"]:
+        artifacts["error_message"] = "candidate produced an invalid schedule"
+        return metrics, artifacts
+
+    makespan = float(candidate["makespan"])
+    baseline_makespan = float(baseline["makespan"])
+    if not math.isfinite(makespan) or makespan <= 0:
+        artifacts["error_message"] = "candidate makespan is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_makespan"] = makespan
+    metrics["baseline_makespan"] = baseline_makespan
+    metrics["relative_gap_to_optimum"] = relative_gap(makespan, KNOWN_OPTIMUM)
+    metrics["combined_score"] = -makespan
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/requirements.txt b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/requirements.txt
new file mode 100644
index 00000000..4adfed0b
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10DispatchingRuleOptimization/verification/requirements.txt
@@ -0,0 +1 @@
+ortools
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README.md b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README.md
new file mode 100644
index 00000000..9b001095
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README.md
@@ -0,0 +1,55 @@
+# FT10 Neighborhood Move Selection
+
+Rank adjacent-swap moves for a frozen local-search shell on the canonical FT10 job shop and minimize makespan.
+
+## Why This Benchmark Matters
+
+This benchmark models schedule refinement under a limited search budget. The scheduler already has a feasible incumbent; what matters is which neighboring move it tries first.
+
+You are controlling move ranking inside a fixed local-search loop rather than searching the schedule space end to end yourself.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `score_move(move, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/evaluator.py \
+  benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/scripts/init.py \
+  --metrics-out /tmp/FT10NeighborhoodMoveSelection_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FT10NeighborhoodMoveSelection \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README_zh-CN.md b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README_zh-CN.md
new file mode 100644
index 00000000..10d23061
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/README_zh-CN.md
@@ -0,0 +1,55 @@
+# FT10 邻域移动选择
+
+在经典 FT10 作业车间的冻结局部搜索壳层里，对相邻交换动作排序，并最小化 makespan。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是有限搜索预算下的排程改进问题。调度器已经有一个可行的当前解，真正重要的是它下一步优先尝试哪个邻域动作。
+
+你控制的是固定局部搜索循环里的动作排序，而不是自己从头到尾搜索整个排程空间。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`score_move(move, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/evaluator.py \
+  benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/scripts/init.py \
+  --metrics-out /tmp/FT10NeighborhoodMoveSelection_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FT10NeighborhoodMoveSelection \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task.md b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task.md
new file mode 100644
index 00000000..36fcfd90
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task.md
@@ -0,0 +1,53 @@
+# FT10 Neighborhood Move Selection Task
+
+## Problem
+
+Rank adjacent-swap moves for a frozen local-search shell on the canonical FT10 job shop and minimize makespan.
+
+This benchmark models schedule refinement under a limited search budget. The scheduler already has a feasible incumbent; what matters is which neighboring move it tries first.
+
+You are controlling move ranking inside a fixed local-search loop rather than searching the schedule space end to end yourself.
+
+## What Is Frozen
+
+- The canonical `ft10` instance and the known optimum `930`.
+- The baseline SPT dispatch schedule used as the incumbent.
+- The adjacent-swap move generator and first-improving acceptance rule in `runtime/problem.py`.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    ...
+```
+
+Define `score_move(move, state)` and return any finite scalar; larger scores are tried first. You may also set `MAX_ITERATIONS` to any positive integer if you want to change the search budget.
+
+## Evaluation
+
+1. Load the canonical `ft10` instance from `runtime/problem.py`.
+2. Start from the frozen baseline dispatch schedule.
+3. Repeatedly generate adjacent machine-order swap moves, rank them by `score_move(...)`, and apply the first improving move.
+4. Stop when no improving move exists or `MAX_ITERATIONS` is reached, then report candidate makespan and diagnostics.
+
+## Metrics
+
+- `combined_score`: `-candidate_makespan`
+- `valid`: `1.0` only if a complete feasible schedule is produced
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## Invalid Submissions
+
+- `score_move(...)` is missing or crashes
+- The returned move score is non-finite
+- `MAX_ITERATIONS` is invalid or evaluation fails before a valid schedule is built
+- The induced schedule becomes infeasible
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task_zh-CN.md b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task_zh-CN.md
new file mode 100644
index 00000000..5dbe8859
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/Task_zh-CN.md
@@ -0,0 +1,53 @@
+# FT10 邻域移动选择
+
+## 任务概览
+
+为经典 FT10 作业车间上的冻结局部搜索壳层排序相邻交换动作，并尽量缩短 makespan。
+
+这个 benchmark 对应的是有限搜索预算下的排程改进问题。系统已经有一个可行初始解，真正关键的是它下一步先尝试哪个邻域动作。
+
+你控制的是固定局部搜索循环里的动作排序，而不是自己从头到尾搜索整个排程空间。
+
+## 哪些部分是冻结的
+
+- 经典 `ft10` 实例，以及已知最优值 `930`。
+- 作为初始 incumbent 的 baseline SPT 派工排程。
+- `runtime/problem.py` 中冻结的相邻交换邻域生成器和 first-improving 接受规则。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    ...
+```
+
+定义 `score_move(move, state)` 并返回任意有限标量；分数更高的动作会被优先尝试。你也可以把 `MAX_ITERATIONS` 设成任意正整数，以调整搜索预算。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入经典 `ft10` 实例。
+2. 从冻结的 baseline 派工排程开始。
+3. 反复生成相邻机器顺序交换动作，按 `score_move(...)` 排序，并应用第一个能改进的动作。
+4. 当不存在改进动作或达到 `MAX_ITERATIONS` 时停止，并输出候选 makespan 与诊断指标。
+
+## 指标
+
+- `combined_score`：`-candidate_makespan`
+- `valid`：只有生成完整可行排程时才为 `1.0`
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## 判为无效的情况
+
+- 缺少 `score_move(...)`，或函数在评测中报错
+- 返回的动作分数不是有限值
+- `MAX_ITERATIONS` 不合法，或在得到有效排程之前评测就失败
+- 诱导出的排程变得不可行
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/baseline/solution.py b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/baseline/solution.py
new file mode 100644
index 00000000..bf5ef33a
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/baseline/solution.py
@@ -0,0 +1,11 @@
+from __future__ import annotations
+
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/constraints.txt
new file mode 100644
index 00000000..10b71922
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/constraints.txt
@@ -0,0 +1,6 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files in `runtime/`, `verification/`, `references/`, or `baseline/`.
+For dispatch tasks, define `score_operation(operation, state)`.
+For neighborhood tasks, define `score_move(move, state)`.
+Return only finite scalar scores.
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..44d55c3c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+runtime/problem.py
+runtime/instance.json
+verification/evaluator.py
+baseline/solution.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/references/source_manifest.md b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/references/source_manifest.md
new file mode 100644
index 00000000..c017c20b
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/references/source_manifest.md
@@ -0,0 +1,11 @@
+# Source Manifest
+
+- Canonical instance: `ft10`
+- Upstream package: `job_shop_lib`
+- Upstream file: `job_shop_lib/benchmarking/benchmark_instances.json`
+- Canonical optimum recorded in upstream metadata: `930`
+- Original academic provenance:
+  - `ft10`: Fisher and Thompson, *Industrial Scheduling*, 1963.
+  - `la16`: Lawrence benchmark set, 1984.
+
+This benchmark vendors only the specific frozen instance JSON required for evaluation.
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/instance.json b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/instance.json
new file mode 100644
index 00000000..bbf75fac
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/instance.json
@@ -0,0 +1,253 @@
+{
+  "name": "ft10",
+  "duration_matrix": [
+    [
+      29,
+      78,
+      9,
+      36,
+      49,
+      11,
+      62,
+      56,
+      44,
+      21
+    ],
+    [
+      43,
+      90,
+      75,
+      11,
+      69,
+      28,
+      46,
+      46,
+      72,
+      30
+    ],
+    [
+      91,
+      85,
+      39,
+      74,
+      90,
+      10,
+      12,
+      89,
+      45,
+      33
+    ],
+    [
+      81,
+      95,
+      71,
+      99,
+      9,
+      52,
+      85,
+      98,
+      22,
+      43
+    ],
+    [
+      14,
+      6,
+      22,
+      61,
+      26,
+      69,
+      21,
+      49,
+      72,
+      53
+    ],
+    [
+      84,
+      2,
+      52,
+      95,
+      48,
+      72,
+      47,
+      65,
+      6,
+      25
+    ],
+    [
+      46,
+      37,
+      61,
+      13,
+      32,
+      21,
+      32,
+      89,
+      30,
+      55
+    ],
+    [
+      31,
+      86,
+      46,
+      74,
+      32,
+      88,
+      19,
+      48,
+      36,
+      79
+    ],
+    [
+      76,
+      69,
+      76,
+      51,
+      85,
+      11,
+      40,
+      89,
+      26,
+      74
+    ],
+    [
+      85,
+      13,
+      61,
+      7,
+      64,
+      76,
+      47,
+      52,
+      90,
+      45
+    ]
+  ],
+  "machines_matrix": [
+    [
+      0,
+      1,
+      2,
+      3,
+      4,
+      5,
+      6,
+      7,
+      8,
+      9
+    ],
+    [
+      0,
+      2,
+      4,
+      9,
+      3,
+      1,
+      6,
+      5,
+      7,
+      8
+    ],
+    [
+      1,
+      0,
+      3,
+      2,
+      8,
+      5,
+      7,
+      6,
+      9,
+      4
+    ],
+    [
+      1,
+      2,
+      0,
+      4,
+      6,
+      8,
+      7,
+      3,
+      9,
+      5
+    ],
+    [
+      2,
+      0,
+      1,
+      5,
+      3,
+      4,
+      8,
+      7,
+      9,
+      6
+    ],
+    [
+      2,
+      1,
+      5,
+      3,
+      8,
+      9,
+      0,
+      6,
+      4,
+      7
+    ],
+    [
+      1,
+      0,
+      3,
+      2,
+      6,
+      5,
+      9,
+      8,
+      7,
+      4
+    ],
+    [
+      2,
+      0,
+      1,
+      5,
+      4,
+      6,
+      8,
+      9,
+      7,
+      3
+    ],
+    [
+      0,
+      1,
+      3,
+      5,
+      2,
+      9,
+      6,
+      7,
+      4,
+      8
+    ],
+    [
+      1,
+      0,
+      2,
+      6,
+      8,
+      9,
+      5,
+      3,
+      4,
+      7
+    ]
+  ],
+  "metadata": {
+    "optimum": 930,
+    "upper_bound": 930,
+    "lower_bound": 930,
+    "reference": "J.F. Muth, G.L. Thompson. 'Industrial scheduling.', Englewood Cliffs, NJ, Prentice-Hall, 1963."
+  }
+}
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/problem.py b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/problem.py
new file mode 100644
index 00000000..5bcfd5af
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/problem.py
@@ -0,0 +1,270 @@
+from __future__ import annotations
+
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+
+
+INSTANCE_PATH = Path(__file__).resolve().with_name("instance.json")
+KNOWN_OPTIMUM = 930
+
+
+def load_instance() -> dict[str, Any]:
+    return json.loads(INSTANCE_PATH.read_text(encoding="utf-8"))
+
+
+def relative_gap(value: float, optimum: float) -> float:
+    return float((value - optimum) / optimum)
+
+
+def baseline_dispatch_score(operation: dict[str, Any], state: dict[str, Any]):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+
+
+def baseline_move_score(move: dict[str, Any], state: dict[str, Any]):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def _build_operation_tables(instance: dict[str, Any]) -> tuple[list[list[int]], list[list[int]], dict[tuple[int, int], tuple[int, int]]]:
+    durations = instance["duration_matrix"]
+    machines = instance["machines_matrix"]
+    op_map: dict[tuple[int, int], tuple[int, int]] = {}
+    for j, row in enumerate(machines):
+        for k, machine in enumerate(row):
+            op_map[(j, k)] = (machine, durations[j][k])
+    return durations, machines, op_map
+
+
+def schedule_with_dispatch(instance: dict[str, Any], score_operation) -> dict[str, Any]:
+    durations, machines, _ = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    job_next = [0] * num_jobs
+    job_ready = [0] * num_jobs
+    machine_ready = [0] * num_machines
+    scheduled_ops: list[dict[str, Any]] = []
+
+    total_ops = num_jobs * num_machines
+    while len(scheduled_ops) < total_ops:
+        candidates: list[dict[str, Any]] = []
+        for job_id in range(num_jobs):
+            op_index = job_next[job_id]
+            if op_index >= num_machines:
+                continue
+            machine_id = machines[job_id][op_index]
+            duration = durations[job_id][op_index]
+            earliest_start = max(job_ready[job_id], machine_ready[machine_id])
+            remaining_job_work = sum(durations[job_id][op_index:])
+            remaining_job_ops = num_machines - op_index
+            candidates.append(
+                {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "earliest_start": earliest_start,
+                    "remaining_job_work": remaining_job_work,
+                    "remaining_job_ops": remaining_job_ops,
+                }
+            )
+        min_start = min(op["earliest_start"] for op in candidates)
+        ready = [op for op in candidates if op["earliest_start"] == min_start]
+        state = {
+            "step": len(scheduled_ops),
+            "job_ready_times": tuple(job_ready),
+            "machine_ready_times": tuple(machine_ready),
+            "current_makespan": max(max(job_ready), max(machine_ready)),
+        }
+        scored: list[tuple[Any, dict[str, Any]]] = []
+        for op in ready:
+            score = score_operation(op, state)
+            scored.append((score, op))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                -item[1]["duration"],
+                -item[1]["remaining_job_work"],
+                -item[1]["job_id"],
+            ),
+            reverse=True,
+        )
+        chosen = scored[0][1]
+        start = chosen["earliest_start"]
+        end = start + chosen["duration"]
+        scheduled = dict(chosen)
+        scheduled["start"] = start
+        scheduled["end"] = end
+        scheduled_ops.append(scheduled)
+        job_ready[chosen["job_id"]] = end
+        machine_ready[chosen["machine_id"]] = end
+        job_next[chosen["job_id"]] += 1
+
+    return {
+        "valid": True,
+        "schedule": scheduled_ops,
+        "makespan": max(op["end"] for op in scheduled_ops),
+        "machine_sequences": machine_sequences_from_schedule(instance, scheduled_ops),
+    }
+
+
+def machine_sequences_from_schedule(instance: dict[str, Any], schedule: list[dict[str, Any]]) -> list[list[tuple[int, int]]]:
+    num_machines = len(instance["machines_matrix"][0])
+    sequences: list[list[tuple[int, int, int, int]]] = [[] for _ in range(num_machines)]
+    for op in schedule:
+        sequences[op["machine_id"]].append((op["start"], op["job_id"], op["op_index"], op["end"]))
+    out: list[list[tuple[int, int]]] = []
+    for machine_ops in sequences:
+        machine_ops.sort()
+        out.append([(job_id, op_index) for _, job_id, op_index, _ in machine_ops])
+    return out
+
+
+def build_schedule_from_machine_sequences(instance: dict[str, Any], machine_sequences: list[list[tuple[int, int]]]) -> dict[str, Any]:
+    durations, machines, op_map = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    machine_pred: dict[tuple[int, int], tuple[int, int] | None] = {}
+    for seq in machine_sequences:
+        for idx, op in enumerate(seq):
+            machine_pred[op] = seq[idx - 1] if idx > 0 else None
+
+    scheduled: dict[tuple[int, int], dict[str, Any]] = {}
+    total_ops = num_jobs * num_machines
+    while len(scheduled) < total_ops:
+        progress = False
+        for job_id in range(num_jobs):
+            for op_index in range(num_machines):
+                op = (job_id, op_index)
+                if op in scheduled:
+                    continue
+                job_prev = (job_id, op_index - 1) if op_index > 0 else None
+                mach_prev = machine_pred.get(op)
+                if job_prev is not None and job_prev not in scheduled:
+                    continue
+                if mach_prev is not None and mach_prev not in scheduled:
+                    continue
+                machine_id, duration = op_map[op]
+                start = 0
+                if job_prev is not None:
+                    start = max(start, scheduled[job_prev]["end"])
+                if mach_prev is not None:
+                    start = max(start, scheduled[mach_prev]["end"])
+                scheduled[op] = {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "start": start,
+                    "end": start + duration,
+                }
+                progress = True
+        if not progress:
+            return {"valid": False, "schedule": [], "makespan": float("inf"), "machine_sequences": machine_sequences}
+
+    schedule = list(scheduled.values())
+    schedule.sort(key=lambda item: (item["start"], item["machine_id"], item["job_id"], item["op_index"]))
+    return {
+        "valid": True,
+        "schedule": schedule,
+        "makespan": max(op["end"] for op in schedule),
+        "machine_sequences": machine_sequences,
+    }
+
+
+def initial_machine_sequences(instance: dict[str, Any]) -> list[list[tuple[int, int]]]:
+    baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+    return baseline["machine_sequences"]
+
+
+def generate_adjacent_moves(instance: dict[str, Any], current: dict[str, Any]) -> list[dict[str, Any]]:
+    durations, machines, _ = _build_operation_tables(instance)
+    schedule_by_op = {
+        (op["job_id"], op["op_index"]): op
+        for op in current["schedule"]
+    }
+    moves: list[dict[str, Any]] = []
+    for machine_id, seq in enumerate(current["machine_sequences"]):
+        for pos in range(len(seq) - 1):
+            a = seq[pos]
+            b = seq[pos + 1]
+            a_sched = schedule_by_op[a]
+            b_sched = schedule_by_op[b]
+            moves.append(
+                {
+                    "machine_id": machine_id,
+                    "machine_position": pos,
+                    "op_a": {
+                        "job_id": a[0],
+                        "op_index": a[1],
+                        "duration": durations[a[0]][a[1]],
+                        "start": a_sched["start"],
+                        "end": a_sched["end"],
+                    },
+                    "op_b": {
+                        "job_id": b[0],
+                        "op_index": b[1],
+                        "duration": durations[b[0]][b[1]],
+                        "start": b_sched["start"],
+                        "end": b_sched["end"],
+                    },
+                    "delta_duration": durations[a[0]][a[1]] - durations[b[0]][b[1]],
+                    "current_makespan": current["makespan"],
+                }
+            )
+    return moves
+
+
+def apply_adjacent_swap(machine_sequences: list[list[tuple[int, int]]], machine_id: int, position: int) -> list[list[tuple[int, int]]]:
+    new_sequences = copy.deepcopy(machine_sequences)
+    new_sequences[machine_id][position], new_sequences[machine_id][position + 1] = (
+        new_sequences[machine_id][position + 1],
+        new_sequences[machine_id][position],
+    )
+    return new_sequences
+
+
+def run_local_search(instance: dict[str, Any], score_move, max_iterations: int = 50) -> dict[str, Any]:
+    current = schedule_with_dispatch(instance, baseline_dispatch_score)
+    if not current["valid"]:
+        return current
+
+    for iteration in range(max_iterations):
+        moves = generate_adjacent_moves(instance, current)
+        state = {
+            "iteration": iteration,
+            "current_makespan": current["makespan"],
+        }
+        scored = []
+        for move in moves:
+            score = score_move(move, state)
+            scored.append((score, move))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                item[1]["delta_duration"],
+                -item[1]["machine_position"],
+            ),
+            reverse=True,
+        )
+        improved = False
+        for _, move in scored:
+            new_sequences = apply_adjacent_swap(current["machine_sequences"], move["machine_id"], move["machine_position"])
+            candidate = build_schedule_from_machine_sequences(instance, new_sequences)
+            if candidate["valid"] and candidate["makespan"] < current["makespan"]:
+                current = candidate
+                improved = True
+                break
+        if not improved:
+            break
+
+    return current
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/scripts/init.py b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/scripts/init.py
new file mode 100644
index 00000000..777d19f6
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/scripts/init.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+except ModuleNotFoundError:
+    from baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+
+
+# EVOLVE-BLOCK-START
+MAX_ITERATIONS = _baseline_MAX_ITERATIONS
+
+
+def score_move(move, state):
+    return _baseline_score_move(move, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.runtime.problem import load_instance, run_local_search
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, run_local_search
+    instance = load_instance()
+    result = run_local_search(instance, score_move, MAX_ITERATIONS)
+    print(result["makespan"])
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/evaluator.py b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/evaluator.py
new file mode 100644
index 00000000..9d32c18a
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/evaluator.py
@@ -0,0 +1,125 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+
+
+TASK_KIND = "move"
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_makespan": 0.0,
+        "baseline_makespan": 0.0,
+        "relative_gap_to_optimum": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    instance = load_instance()
+
+    try:
+        if TASK_KIND == "dispatch":
+            score_fn = namespace.get("score_operation")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_operation(operation, state)")
+            baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+            candidate = schedule_with_dispatch(instance, score_fn)
+        else:
+            score_fn = namespace.get("score_move")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_move(move, state)")
+            max_iterations = int(namespace.get("MAX_ITERATIONS", 50))
+            baseline = run_local_search(instance, baseline_move_score, max_iterations=50)
+            candidate = run_local_search(instance, score_fn, max_iterations=max_iterations)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    if not baseline["valid"]:
+        artifacts["error_message"] = "internal baseline produced an invalid schedule"
+        return metrics, artifacts
+    if not candidate["valid"]:
+        artifacts["error_message"] = "candidate produced an invalid schedule"
+        return metrics, artifacts
+
+    makespan = float(candidate["makespan"])
+    baseline_makespan = float(baseline["makespan"])
+    if not math.isfinite(makespan) or makespan <= 0:
+        artifacts["error_message"] = "candidate makespan is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_makespan"] = makespan
+    metrics["baseline_makespan"] = baseline_makespan
+    metrics["relative_gap_to_optimum"] = relative_gap(makespan, KNOWN_OPTIMUM)
+    metrics["combined_score"] = -makespan
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/requirements.txt b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/requirements.txt
new file mode 100644
index 00000000..4adfed0b
--- /dev/null
+++ b/benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/verification/requirements.txt
@@ -0,0 +1 @@
+ortools
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README.md b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README.md
new file mode 100644
index 00000000..461c90ea
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README.md
@@ -0,0 +1,55 @@
+# Fuel-Minimizing Ship Weather Routing
+
+Route a ship across a frozen coastal grid while minimizing fuel consumption under deterministic wind and current fields.
+
+## Why This Benchmark Matters
+
+This benchmark stands in for weather-aware voyage planning. The shortest geometric route is rarely the cheapest once headwind, crosswind, and current penalties are folded into the fuel model.
+
+It is a constrained routing problem on a fixed grid graph whose edge costs are induced by environmental fields.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/evaluator.py \
+  benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/scripts/init.py \
+  --metrics-out /tmp/FuelMinimizingShipWeatherRouting_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FuelMinimizingShipWeatherRouting \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README_zh-CN.md b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README_zh-CN.md
new file mode 100644
index 00000000..58715f1c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 燃油最小化船舶气象航线规划
+
+在冻结的沿海栅格上规划船舶航线，在确定性的风场和流场影响下尽量降低燃油消耗。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是考虑天气影响的航线规划。一旦把逆风、侧风和流场影响都折算进燃油模型，几何最短路通常就不再是最省油的路线。
+
+从算法角度看，它是在固定栅格图上的受约束路径规划问题，只不过边代价由环境场诱导出来。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/evaluator.py \
+  benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/scripts/init.py \
+  --metrics-out /tmp/FuelMinimizingShipWeatherRouting_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/FuelMinimizingShipWeatherRouting \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task.md b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task.md
new file mode 100644
index 00000000..ee83ee8c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task.md
@@ -0,0 +1,52 @@
+# Fuel-Minimizing Ship Weather Routing Task
+
+## Problem
+
+Route a ship across a frozen coastal grid while minimizing fuel consumption under deterministic wind and current fields.
+
+This benchmark stands in for weather-aware voyage planning. The shortest geometric route is rarely the cheapest once headwind, crosswind, and current penalties are folded into the fuel model.
+
+It is a constrained routing problem on a fixed grid graph whose edge costs are induced by environmental fields.
+
+## What Is Frozen
+
+- The coastal land mask, water cells, deterministic wind field, and deterministic current field in `runtime/problem.py`.
+- The start cell, goal cell, and the rule that paths move only between adjacent navigable cells.
+- The fuel and travel-time model used to score the returned route.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a list of grid cells or a dict with key `path`. The path must start at `instance["start"]`, end at `instance["goal"]`, move only between adjacent cells, and stay on navigable water cells.
+
+## Evaluation
+
+1. Load the frozen routing instance from `runtime/problem.py`.
+2. Validate the returned path against the start/end cells, adjacency rule, and land mask.
+3. Compute total fuel and travel time along the route.
+4. Report candidate fuel together with baseline and reference metrics for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_fuel`
+- `valid`: `1.0` only if the route is feasible
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into a path
+- The path has the wrong start or goal, contains a non-adjacent move, or touches land
+- Any reported metric becomes non-finite
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task_zh-CN.md b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task_zh-CN.md
new file mode 100644
index 00000000..295dbe7f
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/Task_zh-CN.md
@@ -0,0 +1,52 @@
+# 燃油最小化船舶气象航线规划
+
+## 任务概览
+
+在冻结的沿海栅格上规划船舶航线，在确定性的风场和流场影响下尽量降低燃油消耗。
+
+这个 benchmark 对应的是考虑天气影响的航线规划。一旦把逆风、侧风和流场影响都折算进燃油模型，几何最短路通常就不再是最省油的路线。
+
+从算法角度看，它是在固定栅格图上的受约束路径规划问题，只不过边代价由环境场诱导出来。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的沿海陆地掩码、水域格点、确定性风场和确定性流场。
+- 起点、终点，以及路径只能在相邻可航行格点之间移动的规则。
+- 用于给返回航线打分的燃油和航时模型。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个网格坐标列表，或带 `path` 字段的字典。路径必须从 `instance["start"]` 出发，到达 `instance["goal"]`，每一步只走相邻格点，并始终停留在可航行水域内。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结的航线实例。
+2. 检查返回路径的起终点、相邻移动规则和陆地掩码。
+3. 计算整条航线的总燃油和总航时。
+4. 输出候选燃油消耗，并同时给出 baseline 和参考指标作对照。
+
+## 指标
+
+- `combined_score`：`-candidate_fuel`
+- `valid`：只有航线可行时才为 `1.0`
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为路径
+- 路径起终点错误、包含非相邻移动，或触碰陆地
+- 任意报告指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/baseline/solution.py b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/baseline/solution.py
new file mode 100644
index 00000000..40f8f494
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import baseline_path
+except ModuleNotFoundError:
+    from runtime.problem import baseline_path
+
+
+def solve(instance):
+    return baseline_path()
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/constraints.txt
new file mode 100644
index 00000000..88b1935c
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/references/source_manifest.md b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/references/source_manifest.md
new file mode 100644
index 00000000..69ee9b1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/references/source_manifest.md
@@ -0,0 +1,9 @@
+# Source Manifest
+
+- Upstream lineage:
+  - 52North `WeatherRoutingTool` repository and README
+  - Fuel-aware ship routing under weather-dependent operating conditions
+- License lineage: upstream code lineage is MIT.
+- Data provenance: this benchmark does not redistribute upstream weather rasters. Instead it uses a benchmark-local synthetic coastal grid and deterministic wind/current fields generated directly in `runtime/problem.py`.
+- Authenticity note: the optimization shape follows official weather-routing tool lineage, while the environment data is a frozen synthetic stand-in chosen for offline reproducibility.
+- Transformation path: no external preprocessing pipeline exists. The map, land mask, current field, and wind field are generated from fixed formulas and constants inside the benchmark runtime.
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/runtime/problem.py b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/runtime/problem.py
new file mode 100644
index 00000000..96b24e71
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/runtime/problem.py
@@ -0,0 +1,193 @@
+from __future__ import annotations
+
+from collections import deque
+import math
+from typing import Any
+
+
+WIDTH = 20
+HEIGHT = 10
+START = (1, 4)
+GOAL = (18, 4)
+
+
+def is_land(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 8 <= x <= 12 and 2 <= y <= 6
+
+
+def is_water(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= x < WIDTH and 0 <= y < HEIGHT and not is_land(cell)
+
+
+def _render_grid() -> tuple[str, ...]:
+    rows = []
+    for y in range(HEIGHT):
+        chars = []
+        for x in range(WIDTH):
+            cell = (x, y)
+            if cell == START:
+                chars.append("S")
+            elif cell == GOAL:
+                chars.append("G")
+            elif is_land(cell):
+                chars.append("#")
+            else:
+                chars.append(".")
+        rows.append("".join(chars))
+    return tuple(rows)
+
+
+GRID = _render_grid()
+
+
+def current_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    east = 0.04 * math.sin(0.45 * x)
+    north = 0.02 * math.cos(0.35 * x)
+    if y <= 2:
+        return (-0.32 + east, north)
+    if y >= 6:
+        return (0.26 + east, -north)
+    return (0.04 + east, 0.01 * math.sin(0.25 * x))
+
+
+def wind_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    side = 0.04 * math.sin(0.3 * x)
+    if y <= 2:
+        return (-0.60, side)
+    if y >= 6:
+        return (0.22, -side)
+    return (-0.08, 0.02 * math.cos(0.2 * x))
+
+
+def _field_to_rows(field_fn) -> tuple[tuple[tuple[float, float], ...], ...]:
+    rows = []
+    for y in range(HEIGHT):
+        row = []
+        for x in range(WIDTH):
+            row.append(tuple(round(v, 4) for v in field_fn((x, y))))
+        rows.append(tuple(row))
+    return tuple(rows)
+
+
+CURRENT_FIELD = _field_to_rows(current_at)
+WIND_FIELD = _field_to_rows(wind_at)
+
+
+def load_instance() -> dict[str, Any]:
+    return {
+        "grid": GRID,
+        "start": START,
+        "goal": GOAL,
+        "current_field": CURRENT_FIELD,
+        "wind_field": WIND_FIELD,
+        "objective": "fuel",
+    }
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if "path" not in value:
+            raise ValueError("missing path")
+        value = value["path"]
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError("path is empty")
+    return path
+
+
+def neighbors(cell: tuple[int, int], directions=((0, -1), (1, 0), (0, 1), (-1, 0))) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for dx, dy in directions:
+        nxt = (x + dx, y + dy)
+        if is_water(nxt):
+            result.append(nxt)
+    return result
+
+
+def validate_path(value: Any) -> list[tuple[int, int]]:
+    path = extract_path(value)
+    if path[0] != START:
+        raise ValueError("path must start at START")
+    if path[-1] != GOAL:
+        raise ValueError("path must end at GOAL")
+    for cell in path:
+        if not is_water(cell):
+            raise ValueError("path enters land or leaves the map")
+    for prev, curr in zip(path, path[1:]):
+        dx = abs(curr[0] - prev[0])
+        dy = abs(curr[1] - prev[1])
+        if dx + dy != 1:
+            raise ValueError("path contains a non-adjacent move")
+    return path
+
+
+def _leg_metrics(prev: tuple[int, int], curr: tuple[int, int]) -> tuple[float, float]:
+    dx = curr[0] - prev[0]
+    dy = curr[1] - prev[1]
+    current_u, current_v = current_at(prev)
+    wind_u, wind_v = wind_at(prev)
+    current_along = current_u * dx + current_v * dy
+    wind_along = wind_u * dx + wind_v * dy
+    headwind = max(0.0, -wind_along)
+    crosswind = abs(-dy * wind_u + dx * wind_v)
+    speed = max(0.35, 1.0 + 0.65 * current_along - 0.45 * headwind)
+    leg_time_h = 1.0 / speed
+    fuel_rate = 1.05 + 0.55 * headwind + 0.20 * crosswind + 0.25 * max(0.0, -current_along)
+    leg_fuel = leg_time_h * fuel_rate
+    return leg_fuel, leg_time_h
+
+
+def route_metrics(value: Any) -> dict[str, float]:
+    path = validate_path(value)
+    total_fuel = 0.0
+    total_time_h = 0.0
+    for prev, curr in zip(path, path[1:]):
+        leg_fuel, leg_time_h = _leg_metrics(prev, curr)
+        total_fuel += leg_fuel
+        total_time_h += leg_time_h
+    return {
+        "fuel": float(total_fuel),
+        "time_h": float(total_time_h),
+        "hops": float(len(path) - 1),
+    }
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def baseline_path() -> list[tuple[int, int]]:
+    queue = deque([START])
+    parent = {START: None}
+    while queue:
+        current = queue.popleft()
+        if current == GOAL:
+            return _retrace(parent, current)
+        for nxt in neighbors(current):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    raise RuntimeError("baseline path not found")
+
+
+BASELINE_PATH = baseline_path()
+BASELINE_FUEL = route_metrics(BASELINE_PATH)["fuel"]
+BASELINE_TIME_H = route_metrics(BASELINE_PATH)["time_h"]
+REFERENCE_FUEL = 21.839377308460037
+REFERENCE_TIME_H = 20.501439186435814
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/scripts/init.py b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/scripts/init.py
new file mode 100644
index 00000000..fe6b6069
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/scripts/init.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.baseline.solution import solve as _baseline_solve
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+    from runtime.problem import load_instance, route_metrics
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    result = solve(load_instance())
+    print(route_metrics(result))
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/evaluator.py b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/evaluator.py
new file mode 100644
index 00000000..dc6ffdc3
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/evaluator.py
@@ -0,0 +1,94 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.baseline.solution import solve as baseline_solve
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import BASELINE_FUEL, BASELINE_TIME_H, REFERENCE_FUEL, load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as baseline_solve
+    from runtime.problem import BASELINE_FUEL, BASELINE_TIME_H, REFERENCE_FUEL, load_instance, route_metrics
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_fuel": 0.0,
+        "baseline_fuel": float(BASELINE_FUEL),
+        "reference_fuel": float(REFERENCE_FUEL),
+        "candidate_time_h": 0.0,
+        "baseline_time_h": float(BASELINE_TIME_H),
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    solve_fn = namespace.get("solve")
+    if not callable(solve_fn):
+        artifacts["error_message"] = "candidate must define solve(instance)"
+        return metrics, artifacts
+
+    instance = load_instance()
+    try:
+        baseline_metrics = route_metrics(baseline_solve(instance))
+        candidate_metrics = route_metrics(solve_fn(instance))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    candidate_fuel = float(candidate_metrics["fuel"])
+    candidate_time_h = float(candidate_metrics["time_h"])
+    if not math.isfinite(candidate_fuel) or candidate_fuel <= 0:
+        artifacts["error_message"] = "candidate fuel is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_fuel"] = candidate_fuel
+    metrics["candidate_time_h"] = candidate_time_h
+    metrics["baseline_fuel"] = float(baseline_metrics["fuel"])
+    metrics["baseline_time_h"] = float(baseline_metrics["time_h"])
+    metrics["combined_score"] = -candidate_fuel
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/requirements.txt b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/requirements.txt
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/verification/requirements.txt
@@ -0,0 +1 @@
+
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README.md b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README.md
new file mode 100644
index 00000000..b0f7315e
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README.md
@@ -0,0 +1,55 @@
+# LA16 Dispatching Rule Optimization
+
+Design a greedy dispatching rule for the canonical LA16 Lawrence 10x10 job shop and minimize makespan.
+
+## Why This Benchmark Matters
+
+This benchmark is the same policy-design problem as FT10, but on the canonical LA16 bottleneck structure. Small local scoring changes can still produce large throughput differences.
+
+You are again writing a local priority function inside a fixed scheduler rather than constructing the schedule yourself.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `score_operation(operation, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/evaluator.py \
+  benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/scripts/init.py \
+  --metrics-out /tmp/LA16DispatchingRuleOptimization_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/LA16DispatchingRuleOptimization \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README_zh-CN.md b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README_zh-CN.md
new file mode 100644
index 00000000..3138ad3b
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/README_zh-CN.md
@@ -0,0 +1,55 @@
+# LA16 派工规则优化
+
+为经典 LA16 Lawrence 10x10 作业车间设计一个贪心派工规则，并最小化 makespan。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 和 FT10 的策略设计问题类似，但实例换成了经典的 LA16 瓶颈结构。即便只是局部评分函数的细小变化，也可能带来很大的吞吐差异。
+
+你仍然是在一个冻结调度器内部编写局部优先级函数，而不是自己显式构造整份排程。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`score_operation(operation, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/evaluator.py \
+  benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/scripts/init.py \
+  --metrics-out /tmp/LA16DispatchingRuleOptimization_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/LA16DispatchingRuleOptimization \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task.md b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task.md
new file mode 100644
index 00000000..75d810bb
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task.md
@@ -0,0 +1,50 @@
+# LA16 Dispatching Rule Optimization Task
+
+## Problem
+
+Design a greedy dispatching rule for the canonical LA16 Lawrence 10x10 job shop and minimize makespan.
+
+This benchmark is the same policy-design problem as FT10, but on the canonical LA16 bottleneck structure. Small local scoring changes can still produce large throughput differences.
+
+You are again writing a local priority function inside a fixed scheduler rather than constructing the schedule yourself.
+
+## What Is Frozen
+
+- The canonical `la16` instance and the known optimum `945`.
+- The schedule builder, feasibility logic, and tie-handling protocol in `runtime/problem.py`.
+- The rule that only operations with the earliest feasible start time are compared by your score.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+Return any finite scalar priority. Among operations tied on earliest feasible start time, larger scores are scheduled first.
+
+## Evaluation
+
+1. Load the canonical `la16` instance from `runtime/problem.py`.
+2. Start from an empty schedule and repeatedly collect the next unscheduled operation from each job.
+3. Among operations tied on earliest feasible start time, pick the one with the highest `score_operation(...)`.
+4. Build a complete schedule, compute candidate makespan, and report the baseline and relative gap to the optimum.
+
+## Metrics
+
+- `combined_score`: `-candidate_makespan`
+- `valid`: `1.0` only if a complete feasible schedule is produced
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## Invalid Submissions
+
+- `score_operation(...)` is missing or crashes
+- The returned priority is non-finite
+- The induced schedule is infeasible or incomplete
+- Evaluation fails before a valid makespan is produced
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task_zh-CN.md b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task_zh-CN.md
new file mode 100644
index 00000000..abe8a7b3
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/Task_zh-CN.md
@@ -0,0 +1,50 @@
+# LA16 派工规则优化
+
+## 任务概览
+
+为经典 LA16 Lawrence 10x10 作业车间设计贪心派工规则，并尽量缩短 makespan。
+
+这个 benchmark 和 FT10 一样，都是策略设计问题，只不过实例换成了经典 LA16。局部评分规则的细小变化，依然可能带来显著的吞吐差异。
+
+你依然是在冻结调度器内部写局部优先级函数，而不是自己手工构造整张排程。
+
+## 哪些部分是冻结的
+
+- 经典 `la16` 实例，以及已知最优值 `945`。
+- `runtime/problem.py` 中冻结的调度构造器、可行性逻辑和 tie-breaking 协议。
+- 只有最早可开工的操作才会交给你的评分函数比较的这条规则。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+返回任意有限标量优先级。在最早可开工时间相同的候选操作里，分数更高的会被优先调度。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入经典 `la16` 实例。
+2. 从空排程开始，反复收集每个 job 的下一个未调度操作。
+3. 在最早可开工时间相同的操作中，选择 `score_operation(...)` 最高的那个。
+4. 构造完整排程，计算候选 makespan，并同时报告 baseline 与相对最优差距。
+
+## 指标
+
+- `combined_score`：`-candidate_makespan`
+- `valid`：只有生成完整可行排程时才为 `1.0`
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## 判为无效的情况
+
+- 缺少 `score_operation(...)`，或函数在评测中报错
+- 返回的优先级不是有限值
+- 诱导出的排程不可行，或没有排完整
+- 在得到有效 makespan 之前评测就失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/baseline/solution.py b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/baseline/solution.py
new file mode 100644
index 00000000..8742a90b
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/baseline/solution.py
@@ -0,0 +1,9 @@
+from __future__ import annotations
+
+
+def score_operation(operation, state):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/constraints.txt
new file mode 100644
index 00000000..10b71922
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/constraints.txt
@@ -0,0 +1,6 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files in `runtime/`, `verification/`, `references/`, or `baseline/`.
+For dispatch tasks, define `score_operation(operation, state)`.
+For neighborhood tasks, define `score_move(move, state)`.
+Return only finite scalar scores.
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..44d55c3c
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+runtime/problem.py
+runtime/instance.json
+verification/evaluator.py
+baseline/solution.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/references/source_manifest.md b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/references/source_manifest.md
new file mode 100644
index 00000000..69751f24
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/references/source_manifest.md
@@ -0,0 +1,11 @@
+# Source Manifest
+
+- Canonical instance: `la16`
+- Upstream package: `job_shop_lib`
+- Upstream file: `job_shop_lib/benchmarking/benchmark_instances.json`
+- Canonical optimum recorded in upstream metadata: `945`
+- Original academic provenance:
+  - `ft10`: Fisher and Thompson, *Industrial Scheduling*, 1963.
+  - `la16`: Lawrence benchmark set, 1984.
+
+This benchmark vendors only the specific frozen instance JSON required for evaluation.
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/instance.json b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/instance.json
new file mode 100644
index 00000000..5619e07e
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/instance.json
@@ -0,0 +1,253 @@
+{
+  "name": "la16",
+  "duration_matrix": [
+    [
+      21,
+      71,
+      16,
+      52,
+      26,
+      34,
+      53,
+      21,
+      55,
+      95
+    ],
+    [
+      55,
+      31,
+      98,
+      79,
+      12,
+      66,
+      42,
+      77,
+      77,
+      39
+    ],
+    [
+      34,
+      64,
+      62,
+      19,
+      92,
+      79,
+      43,
+      54,
+      83,
+      37
+    ],
+    [
+      87,
+      69,
+      87,
+      38,
+      24,
+      83,
+      41,
+      93,
+      77,
+      60
+    ],
+    [
+      98,
+      44,
+      25,
+      75,
+      43,
+      49,
+      96,
+      77,
+      17,
+      79
+    ],
+    [
+      35,
+      76,
+      28,
+      10,
+      61,
+      9,
+      95,
+      35,
+      7,
+      95
+    ],
+    [
+      16,
+      59,
+      46,
+      91,
+      43,
+      50,
+      52,
+      59,
+      28,
+      27
+    ],
+    [
+      45,
+      87,
+      41,
+      20,
+      54,
+      43,
+      14,
+      9,
+      39,
+      71
+    ],
+    [
+      33,
+      37,
+      66,
+      33,
+      26,
+      8,
+      28,
+      89,
+      42,
+      78
+    ],
+    [
+      69,
+      81,
+      94,
+      96,
+      27,
+      69,
+      45,
+      78,
+      74,
+      84
+    ]
+  ],
+  "machines_matrix": [
+    [
+      1,
+      6,
+      9,
+      8,
+      7,
+      2,
+      0,
+      4,
+      3,
+      5
+    ],
+    [
+      4,
+      2,
+      5,
+      9,
+      0,
+      7,
+      1,
+      8,
+      6,
+      3
+    ],
+    [
+      3,
+      2,
+      8,
+      1,
+      4,
+      9,
+      7,
+      6,
+      0,
+      5
+    ],
+    [
+      1,
+      3,
+      2,
+      7,
+      8,
+      9,
+      6,
+      0,
+      5,
+      4
+    ],
+    [
+      2,
+      0,
+      5,
+      6,
+      7,
+      1,
+      4,
+      9,
+      3,
+      8
+    ],
+    [
+      2,
+      3,
+      5,
+      9,
+      4,
+      6,
+      0,
+      8,
+      1,
+      7
+    ],
+    [
+      3,
+      2,
+      0,
+      1,
+      9,
+      8,
+      6,
+      5,
+      4,
+      7
+    ],
+    [
+      1,
+      0,
+      3,
+      4,
+      6,
+      9,
+      8,
+      5,
+      2,
+      7
+    ],
+    [
+      4,
+      2,
+      8,
+      5,
+      3,
+      7,
+      1,
+      6,
+      9,
+      0
+    ],
+    [
+      8,
+      9,
+      2,
+      4,
+      3,
+      0,
+      7,
+      6,
+      1,
+      5
+    ]
+  ],
+  "metadata": {
+    "optimum": 945,
+    "upper_bound": 945,
+    "lower_bound": 945,
+    "reference": "S. Lawrence. 'Resource constrained project scheduling: an experimental investigation of heuristic scheduling techniques (Supplement).', Graduate School of Industrial Administration. Pittsburgh, Pennsylvania, Carnegie-Mellon University, 1984."
+  }
+}
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/problem.py b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/problem.py
new file mode 100644
index 00000000..f36b492a
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/runtime/problem.py
@@ -0,0 +1,270 @@
+from __future__ import annotations
+
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+
+
+INSTANCE_PATH = Path(__file__).resolve().with_name("instance.json")
+KNOWN_OPTIMUM = 945
+
+
+def load_instance() -> dict[str, Any]:
+    return json.loads(INSTANCE_PATH.read_text(encoding="utf-8"))
+
+
+def relative_gap(value: float, optimum: float) -> float:
+    return float((value - optimum) / optimum)
+
+
+def baseline_dispatch_score(operation: dict[str, Any], state: dict[str, Any]):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+
+
+def baseline_move_score(move: dict[str, Any], state: dict[str, Any]):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def _build_operation_tables(instance: dict[str, Any]) -> tuple[list[list[int]], list[list[int]], dict[tuple[int, int], tuple[int, int]]]:
+    durations = instance["duration_matrix"]
+    machines = instance["machines_matrix"]
+    op_map: dict[tuple[int, int], tuple[int, int]] = {}
+    for j, row in enumerate(machines):
+        for k, machine in enumerate(row):
+            op_map[(j, k)] = (machine, durations[j][k])
+    return durations, machines, op_map
+
+
+def schedule_with_dispatch(instance: dict[str, Any], score_operation) -> dict[str, Any]:
+    durations, machines, _ = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    job_next = [0] * num_jobs
+    job_ready = [0] * num_jobs
+    machine_ready = [0] * num_machines
+    scheduled_ops: list[dict[str, Any]] = []
+
+    total_ops = num_jobs * num_machines
+    while len(scheduled_ops) < total_ops:
+        candidates: list[dict[str, Any]] = []
+        for job_id in range(num_jobs):
+            op_index = job_next[job_id]
+            if op_index >= num_machines:
+                continue
+            machine_id = machines[job_id][op_index]
+            duration = durations[job_id][op_index]
+            earliest_start = max(job_ready[job_id], machine_ready[machine_id])
+            remaining_job_work = sum(durations[job_id][op_index:])
+            remaining_job_ops = num_machines - op_index
+            candidates.append(
+                {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "earliest_start": earliest_start,
+                    "remaining_job_work": remaining_job_work,
+                    "remaining_job_ops": remaining_job_ops,
+                }
+            )
+        min_start = min(op["earliest_start"] for op in candidates)
+        ready = [op for op in candidates if op["earliest_start"] == min_start]
+        state = {
+            "step": len(scheduled_ops),
+            "job_ready_times": tuple(job_ready),
+            "machine_ready_times": tuple(machine_ready),
+            "current_makespan": max(max(job_ready), max(machine_ready)),
+        }
+        scored: list[tuple[Any, dict[str, Any]]] = []
+        for op in ready:
+            score = score_operation(op, state)
+            scored.append((score, op))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                -item[1]["duration"],
+                -item[1]["remaining_job_work"],
+                -item[1]["job_id"],
+            ),
+            reverse=True,
+        )
+        chosen = scored[0][1]
+        start = chosen["earliest_start"]
+        end = start + chosen["duration"]
+        scheduled = dict(chosen)
+        scheduled["start"] = start
+        scheduled["end"] = end
+        scheduled_ops.append(scheduled)
+        job_ready[chosen["job_id"]] = end
+        machine_ready[chosen["machine_id"]] = end
+        job_next[chosen["job_id"]] += 1
+
+    return {
+        "valid": True,
+        "schedule": scheduled_ops,
+        "makespan": max(op["end"] for op in scheduled_ops),
+        "machine_sequences": machine_sequences_from_schedule(instance, scheduled_ops),
+    }
+
+
+def machine_sequences_from_schedule(instance: dict[str, Any], schedule: list[dict[str, Any]]) -> list[list[tuple[int, int]]]:
+    num_machines = len(instance["machines_matrix"][0])
+    sequences: list[list[tuple[int, int, int, int]]] = [[] for _ in range(num_machines)]
+    for op in schedule:
+        sequences[op["machine_id"]].append((op["start"], op["job_id"], op["op_index"], op["end"]))
+    out: list[list[tuple[int, int]]] = []
+    for machine_ops in sequences:
+        machine_ops.sort()
+        out.append([(job_id, op_index) for _, job_id, op_index, _ in machine_ops])
+    return out
+
+
+def build_schedule_from_machine_sequences(instance: dict[str, Any], machine_sequences: list[list[tuple[int, int]]]) -> dict[str, Any]:
+    durations, machines, op_map = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    machine_pred: dict[tuple[int, int], tuple[int, int] | None] = {}
+    for seq in machine_sequences:
+        for idx, op in enumerate(seq):
+            machine_pred[op] = seq[idx - 1] if idx > 0 else None
+
+    scheduled: dict[tuple[int, int], dict[str, Any]] = {}
+    total_ops = num_jobs * num_machines
+    while len(scheduled) < total_ops:
+        progress = False
+        for job_id in range(num_jobs):
+            for op_index in range(num_machines):
+                op = (job_id, op_index)
+                if op in scheduled:
+                    continue
+                job_prev = (job_id, op_index - 1) if op_index > 0 else None
+                mach_prev = machine_pred.get(op)
+                if job_prev is not None and job_prev not in scheduled:
+                    continue
+                if mach_prev is not None and mach_prev not in scheduled:
+                    continue
+                machine_id, duration = op_map[op]
+                start = 0
+                if job_prev is not None:
+                    start = max(start, scheduled[job_prev]["end"])
+                if mach_prev is not None:
+                    start = max(start, scheduled[mach_prev]["end"])
+                scheduled[op] = {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "start": start,
+                    "end": start + duration,
+                }
+                progress = True
+        if not progress:
+            return {"valid": False, "schedule": [], "makespan": float("inf"), "machine_sequences": machine_sequences}
+
+    schedule = list(scheduled.values())
+    schedule.sort(key=lambda item: (item["start"], item["machine_id"], item["job_id"], item["op_index"]))
+    return {
+        "valid": True,
+        "schedule": schedule,
+        "makespan": max(op["end"] for op in schedule),
+        "machine_sequences": machine_sequences,
+    }
+
+
+def initial_machine_sequences(instance: dict[str, Any]) -> list[list[tuple[int, int]]]:
+    baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+    return baseline["machine_sequences"]
+
+
+def generate_adjacent_moves(instance: dict[str, Any], current: dict[str, Any]) -> list[dict[str, Any]]:
+    durations, machines, _ = _build_operation_tables(instance)
+    schedule_by_op = {
+        (op["job_id"], op["op_index"]): op
+        for op in current["schedule"]
+    }
+    moves: list[dict[str, Any]] = []
+    for machine_id, seq in enumerate(current["machine_sequences"]):
+        for pos in range(len(seq) - 1):
+            a = seq[pos]
+            b = seq[pos + 1]
+            a_sched = schedule_by_op[a]
+            b_sched = schedule_by_op[b]
+            moves.append(
+                {
+                    "machine_id": machine_id,
+                    "machine_position": pos,
+                    "op_a": {
+                        "job_id": a[0],
+                        "op_index": a[1],
+                        "duration": durations[a[0]][a[1]],
+                        "start": a_sched["start"],
+                        "end": a_sched["end"],
+                    },
+                    "op_b": {
+                        "job_id": b[0],
+                        "op_index": b[1],
+                        "duration": durations[b[0]][b[1]],
+                        "start": b_sched["start"],
+                        "end": b_sched["end"],
+                    },
+                    "delta_duration": durations[a[0]][a[1]] - durations[b[0]][b[1]],
+                    "current_makespan": current["makespan"],
+                }
+            )
+    return moves
+
+
+def apply_adjacent_swap(machine_sequences: list[list[tuple[int, int]]], machine_id: int, position: int) -> list[list[tuple[int, int]]]:
+    new_sequences = copy.deepcopy(machine_sequences)
+    new_sequences[machine_id][position], new_sequences[machine_id][position + 1] = (
+        new_sequences[machine_id][position + 1],
+        new_sequences[machine_id][position],
+    )
+    return new_sequences
+
+
+def run_local_search(instance: dict[str, Any], score_move, max_iterations: int = 50) -> dict[str, Any]:
+    current = schedule_with_dispatch(instance, baseline_dispatch_score)
+    if not current["valid"]:
+        return current
+
+    for iteration in range(max_iterations):
+        moves = generate_adjacent_moves(instance, current)
+        state = {
+            "iteration": iteration,
+            "current_makespan": current["makespan"],
+        }
+        scored = []
+        for move in moves:
+            score = score_move(move, state)
+            scored.append((score, move))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                item[1]["delta_duration"],
+                -item[1]["machine_position"],
+            ),
+            reverse=True,
+        )
+        improved = False
+        for _, move in scored:
+            new_sequences = apply_adjacent_swap(current["machine_sequences"], move["machine_id"], move["machine_position"])
+            candidate = build_schedule_from_machine_sequences(instance, new_sequences)
+            if candidate["valid"] and candidate["makespan"] < current["makespan"]:
+                current = candidate
+                improved = True
+                break
+        if not improved:
+            break
+
+    return current
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/scripts/init.py b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/scripts/init.py
new file mode 100644
index 00000000..4ef149c5
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/scripts/init.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.LA16DispatchingRuleOptimization.baseline.solution import score_operation as _baseline_score_operation
+except ModuleNotFoundError:
+    from baseline.solution import score_operation as _baseline_score_operation
+
+
+# EVOLVE-BLOCK-START
+def score_operation(operation, state):
+    return _baseline_score_operation(operation, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.LA16DispatchingRuleOptimization.runtime.problem import load_instance, schedule_with_dispatch
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, schedule_with_dispatch
+    instance = load_instance()
+    result = schedule_with_dispatch(instance, score_operation)
+    print(result["makespan"])
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/evaluator.py b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/evaluator.py
new file mode 100644
index 00000000..d3601c09
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/evaluator.py
@@ -0,0 +1,125 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.LA16DispatchingRuleOptimization.runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+
+
+TASK_KIND = "dispatch"
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_makespan": 0.0,
+        "baseline_makespan": 0.0,
+        "relative_gap_to_optimum": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    instance = load_instance()
+
+    try:
+        if TASK_KIND == "dispatch":
+            score_fn = namespace.get("score_operation")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_operation(operation, state)")
+            baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+            candidate = schedule_with_dispatch(instance, score_fn)
+        else:
+            score_fn = namespace.get("score_move")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_move(move, state)")
+            max_iterations = int(namespace.get("MAX_ITERATIONS", 50))
+            baseline = run_local_search(instance, baseline_move_score, max_iterations=50)
+            candidate = run_local_search(instance, score_fn, max_iterations=max_iterations)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    if not baseline["valid"]:
+        artifacts["error_message"] = "internal baseline produced an invalid schedule"
+        return metrics, artifacts
+    if not candidate["valid"]:
+        artifacts["error_message"] = "candidate produced an invalid schedule"
+        return metrics, artifacts
+
+    makespan = float(candidate["makespan"])
+    baseline_makespan = float(baseline["makespan"])
+    if not math.isfinite(makespan) or makespan <= 0:
+        artifacts["error_message"] = "candidate makespan is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_makespan"] = makespan
+    metrics["baseline_makespan"] = baseline_makespan
+    metrics["relative_gap_to_optimum"] = relative_gap(makespan, KNOWN_OPTIMUM)
+    metrics["combined_score"] = -makespan
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/requirements.txt b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/requirements.txt
new file mode 100644
index 00000000..4adfed0b
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16DispatchingRuleOptimization/verification/requirements.txt
@@ -0,0 +1 @@
+ortools
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README.md b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README.md
new file mode 100644
index 00000000..bc5286cd
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README.md
@@ -0,0 +1,55 @@
+# LA16 Neighborhood Move Selection
+
+Rank adjacent-swap moves for a frozen local-search shell on the canonical LA16 job shop and minimize makespan.
+
+## Why This Benchmark Matters
+
+This benchmark targets schedule refinement on LA16 under a limited search budget. The search shell is fixed; only your move-ranking policy can change its trajectory.
+
+You are tuning heuristic control inside a fixed combinatorial optimizer rather than emitting a schedule from scratch.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `score_move(move, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/evaluator.py \
+  benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/scripts/init.py \
+  --metrics-out /tmp/LA16NeighborhoodMoveSelection_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/LA16NeighborhoodMoveSelection \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README_zh-CN.md b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README_zh-CN.md
new file mode 100644
index 00000000..5b6d723c
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/README_zh-CN.md
@@ -0,0 +1,55 @@
+# LA16 邻域移动选择
+
+在经典 LA16 作业车间的冻结局部搜索壳层里，对相邻交换动作排序，并最小化 makespan。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是 LA16 实例上的排程细化问题，搜索预算有限，搜索壳层固定，唯一能改变搜索轨迹的只有你的邻域动作排序策略。
+
+你调的是一个固定组合优化器里的启发式控制逻辑，而不是自己从零输出一份排程。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`score_move(move, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/evaluator.py \
+  benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/scripts/init.py \
+  --metrics-out /tmp/LA16NeighborhoodMoveSelection_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/LA16NeighborhoodMoveSelection \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task.md b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task.md
new file mode 100644
index 00000000..4652999c
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task.md
@@ -0,0 +1,53 @@
+# LA16 Neighborhood Move Selection Task
+
+## Problem
+
+Rank adjacent-swap moves for a frozen local-search shell on the canonical LA16 job shop and minimize makespan.
+
+This benchmark targets schedule refinement on LA16 under a limited search budget. The search shell is fixed; only your move-ranking policy can change its trajectory.
+
+You are tuning heuristic control inside a fixed combinatorial optimizer rather than emitting a schedule from scratch.
+
+## What Is Frozen
+
+- The canonical `la16` instance and the known optimum `945`.
+- The baseline SPT dispatch schedule used as the incumbent.
+- The adjacent-swap move generator and first-improving acceptance rule in `runtime/problem.py`.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    ...
+```
+
+Define `score_move(move, state)` and return any finite scalar; larger scores are tried first. You may also set `MAX_ITERATIONS` to any positive integer if you want to change the search budget.
+
+## Evaluation
+
+1. Load the canonical `la16` instance from `runtime/problem.py`.
+2. Start from the frozen baseline dispatch schedule.
+3. Repeatedly generate adjacent machine-order swap moves, rank them by `score_move(...)`, and apply the first improving move.
+4. Stop when no improving move exists or `MAX_ITERATIONS` is reached, then report candidate makespan and diagnostics.
+
+## Metrics
+
+- `combined_score`: `-candidate_makespan`
+- `valid`: `1.0` only if a complete feasible schedule is produced
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## Invalid Submissions
+
+- `score_move(...)` is missing or crashes
+- The returned move score is non-finite
+- `MAX_ITERATIONS` is invalid or evaluation fails before a valid schedule is built
+- The induced schedule becomes infeasible
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task_zh-CN.md b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task_zh-CN.md
new file mode 100644
index 00000000..4bf64535
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/Task_zh-CN.md
@@ -0,0 +1,53 @@
+# LA16 邻域移动选择
+
+## 任务概览
+
+为经典 LA16 作业车间上的冻结局部搜索壳层排序相邻交换动作，并尽量缩短 makespan。
+
+这个 benchmark 对应的是 LA16 上有限搜索预算下的排程改进。搜索壳层本身是冻结的，只有你的动作排序策略能改变搜索轨迹。
+
+你调的是一个固定组合优化器里的启发式控制策略，而不是从头输出一张排程。
+
+## 哪些部分是冻结的
+
+- 经典 `la16` 实例，以及已知最优值 `945`。
+- 作为初始 incumbent 的 baseline SPT 派工排程。
+- `runtime/problem.py` 中冻结的相邻交换邻域生成器和 first-improving 接受规则。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    ...
+```
+
+定义 `score_move(move, state)` 并返回任意有限标量；分数更高的动作会被优先尝试。你也可以把 `MAX_ITERATIONS` 设成任意正整数，以调整搜索预算。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入经典 `la16` 实例。
+2. 从冻结的 baseline 派工排程开始。
+3. 反复生成相邻机器顺序交换动作，按 `score_move(...)` 排序，并应用第一个能改进的动作。
+4. 当不存在改进动作或达到 `MAX_ITERATIONS` 时停止，并输出候选 makespan 与诊断指标。
+
+## 指标
+
+- `combined_score`：`-candidate_makespan`
+- `valid`：只有生成完整可行排程时才为 `1.0`
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## 判为无效的情况
+
+- 缺少 `score_move(...)`，或函数在评测中报错
+- 返回的动作分数不是有限值
+- `MAX_ITERATIONS` 不合法，或在得到有效排程之前评测就失败
+- 诱导出的排程变得不可行
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/baseline/solution.py b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/baseline/solution.py
new file mode 100644
index 00000000..bf5ef33a
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/baseline/solution.py
@@ -0,0 +1,11 @@
+from __future__ import annotations
+
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/constraints.txt
new file mode 100644
index 00000000..10b71922
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/constraints.txt
@@ -0,0 +1,6 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files in `runtime/`, `verification/`, `references/`, or `baseline/`.
+For dispatch tasks, define `score_operation(operation, state)`.
+For neighborhood tasks, define `score_move(move, state)`.
+Return only finite scalar scores.
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..44d55c3c
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+runtime/problem.py
+runtime/instance.json
+verification/evaluator.py
+baseline/solution.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/references/source_manifest.md b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/references/source_manifest.md
new file mode 100644
index 00000000..69751f24
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/references/source_manifest.md
@@ -0,0 +1,11 @@
+# Source Manifest
+
+- Canonical instance: `la16`
+- Upstream package: `job_shop_lib`
+- Upstream file: `job_shop_lib/benchmarking/benchmark_instances.json`
+- Canonical optimum recorded in upstream metadata: `945`
+- Original academic provenance:
+  - `ft10`: Fisher and Thompson, *Industrial Scheduling*, 1963.
+  - `la16`: Lawrence benchmark set, 1984.
+
+This benchmark vendors only the specific frozen instance JSON required for evaluation.
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/instance.json b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/instance.json
new file mode 100644
index 00000000..5619e07e
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/instance.json
@@ -0,0 +1,253 @@
+{
+  "name": "la16",
+  "duration_matrix": [
+    [
+      21,
+      71,
+      16,
+      52,
+      26,
+      34,
+      53,
+      21,
+      55,
+      95
+    ],
+    [
+      55,
+      31,
+      98,
+      79,
+      12,
+      66,
+      42,
+      77,
+      77,
+      39
+    ],
+    [
+      34,
+      64,
+      62,
+      19,
+      92,
+      79,
+      43,
+      54,
+      83,
+      37
+    ],
+    [
+      87,
+      69,
+      87,
+      38,
+      24,
+      83,
+      41,
+      93,
+      77,
+      60
+    ],
+    [
+      98,
+      44,
+      25,
+      75,
+      43,
+      49,
+      96,
+      77,
+      17,
+      79
+    ],
+    [
+      35,
+      76,
+      28,
+      10,
+      61,
+      9,
+      95,
+      35,
+      7,
+      95
+    ],
+    [
+      16,
+      59,
+      46,
+      91,
+      43,
+      50,
+      52,
+      59,
+      28,
+      27
+    ],
+    [
+      45,
+      87,
+      41,
+      20,
+      54,
+      43,
+      14,
+      9,
+      39,
+      71
+    ],
+    [
+      33,
+      37,
+      66,
+      33,
+      26,
+      8,
+      28,
+      89,
+      42,
+      78
+    ],
+    [
+      69,
+      81,
+      94,
+      96,
+      27,
+      69,
+      45,
+      78,
+      74,
+      84
+    ]
+  ],
+  "machines_matrix": [
+    [
+      1,
+      6,
+      9,
+      8,
+      7,
+      2,
+      0,
+      4,
+      3,
+      5
+    ],
+    [
+      4,
+      2,
+      5,
+      9,
+      0,
+      7,
+      1,
+      8,
+      6,
+      3
+    ],
+    [
+      3,
+      2,
+      8,
+      1,
+      4,
+      9,
+      7,
+      6,
+      0,
+      5
+    ],
+    [
+      1,
+      3,
+      2,
+      7,
+      8,
+      9,
+      6,
+      0,
+      5,
+      4
+    ],
+    [
+      2,
+      0,
+      5,
+      6,
+      7,
+      1,
+      4,
+      9,
+      3,
+      8
+    ],
+    [
+      2,
+      3,
+      5,
+      9,
+      4,
+      6,
+      0,
+      8,
+      1,
+      7
+    ],
+    [
+      3,
+      2,
+      0,
+      1,
+      9,
+      8,
+      6,
+      5,
+      4,
+      7
+    ],
+    [
+      1,
+      0,
+      3,
+      4,
+      6,
+      9,
+      8,
+      5,
+      2,
+      7
+    ],
+    [
+      4,
+      2,
+      8,
+      5,
+      3,
+      7,
+      1,
+      6,
+      9,
+      0
+    ],
+    [
+      8,
+      9,
+      2,
+      4,
+      3,
+      0,
+      7,
+      6,
+      1,
+      5
+    ]
+  ],
+  "metadata": {
+    "optimum": 945,
+    "upper_bound": 945,
+    "lower_bound": 945,
+    "reference": "S. Lawrence. 'Resource constrained project scheduling: an experimental investigation of heuristic scheduling techniques (Supplement).', Graduate School of Industrial Administration. Pittsburgh, Pennsylvania, Carnegie-Mellon University, 1984."
+  }
+}
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/problem.py b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/problem.py
new file mode 100644
index 00000000..f36b492a
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/runtime/problem.py
@@ -0,0 +1,270 @@
+from __future__ import annotations
+
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+
+
+INSTANCE_PATH = Path(__file__).resolve().with_name("instance.json")
+KNOWN_OPTIMUM = 945
+
+
+def load_instance() -> dict[str, Any]:
+    return json.loads(INSTANCE_PATH.read_text(encoding="utf-8"))
+
+
+def relative_gap(value: float, optimum: float) -> float:
+    return float((value - optimum) / optimum)
+
+
+def baseline_dispatch_score(operation: dict[str, Any], state: dict[str, Any]):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+
+
+def baseline_move_score(move: dict[str, Any], state: dict[str, Any]):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def _build_operation_tables(instance: dict[str, Any]) -> tuple[list[list[int]], list[list[int]], dict[tuple[int, int], tuple[int, int]]]:
+    durations = instance["duration_matrix"]
+    machines = instance["machines_matrix"]
+    op_map: dict[tuple[int, int], tuple[int, int]] = {}
+    for j, row in enumerate(machines):
+        for k, machine in enumerate(row):
+            op_map[(j, k)] = (machine, durations[j][k])
+    return durations, machines, op_map
+
+
+def schedule_with_dispatch(instance: dict[str, Any], score_operation) -> dict[str, Any]:
+    durations, machines, _ = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    job_next = [0] * num_jobs
+    job_ready = [0] * num_jobs
+    machine_ready = [0] * num_machines
+    scheduled_ops: list[dict[str, Any]] = []
+
+    total_ops = num_jobs * num_machines
+    while len(scheduled_ops) < total_ops:
+        candidates: list[dict[str, Any]] = []
+        for job_id in range(num_jobs):
+            op_index = job_next[job_id]
+            if op_index >= num_machines:
+                continue
+            machine_id = machines[job_id][op_index]
+            duration = durations[job_id][op_index]
+            earliest_start = max(job_ready[job_id], machine_ready[machine_id])
+            remaining_job_work = sum(durations[job_id][op_index:])
+            remaining_job_ops = num_machines - op_index
+            candidates.append(
+                {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "earliest_start": earliest_start,
+                    "remaining_job_work": remaining_job_work,
+                    "remaining_job_ops": remaining_job_ops,
+                }
+            )
+        min_start = min(op["earliest_start"] for op in candidates)
+        ready = [op for op in candidates if op["earliest_start"] == min_start]
+        state = {
+            "step": len(scheduled_ops),
+            "job_ready_times": tuple(job_ready),
+            "machine_ready_times": tuple(machine_ready),
+            "current_makespan": max(max(job_ready), max(machine_ready)),
+        }
+        scored: list[tuple[Any, dict[str, Any]]] = []
+        for op in ready:
+            score = score_operation(op, state)
+            scored.append((score, op))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                -item[1]["duration"],
+                -item[1]["remaining_job_work"],
+                -item[1]["job_id"],
+            ),
+            reverse=True,
+        )
+        chosen = scored[0][1]
+        start = chosen["earliest_start"]
+        end = start + chosen["duration"]
+        scheduled = dict(chosen)
+        scheduled["start"] = start
+        scheduled["end"] = end
+        scheduled_ops.append(scheduled)
+        job_ready[chosen["job_id"]] = end
+        machine_ready[chosen["machine_id"]] = end
+        job_next[chosen["job_id"]] += 1
+
+    return {
+        "valid": True,
+        "schedule": scheduled_ops,
+        "makespan": max(op["end"] for op in scheduled_ops),
+        "machine_sequences": machine_sequences_from_schedule(instance, scheduled_ops),
+    }
+
+
+def machine_sequences_from_schedule(instance: dict[str, Any], schedule: list[dict[str, Any]]) -> list[list[tuple[int, int]]]:
+    num_machines = len(instance["machines_matrix"][0])
+    sequences: list[list[tuple[int, int, int, int]]] = [[] for _ in range(num_machines)]
+    for op in schedule:
+        sequences[op["machine_id"]].append((op["start"], op["job_id"], op["op_index"], op["end"]))
+    out: list[list[tuple[int, int]]] = []
+    for machine_ops in sequences:
+        machine_ops.sort()
+        out.append([(job_id, op_index) for _, job_id, op_index, _ in machine_ops])
+    return out
+
+
+def build_schedule_from_machine_sequences(instance: dict[str, Any], machine_sequences: list[list[tuple[int, int]]]) -> dict[str, Any]:
+    durations, machines, op_map = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    machine_pred: dict[tuple[int, int], tuple[int, int] | None] = {}
+    for seq in machine_sequences:
+        for idx, op in enumerate(seq):
+            machine_pred[op] = seq[idx - 1] if idx > 0 else None
+
+    scheduled: dict[tuple[int, int], dict[str, Any]] = {}
+    total_ops = num_jobs * num_machines
+    while len(scheduled) < total_ops:
+        progress = False
+        for job_id in range(num_jobs):
+            for op_index in range(num_machines):
+                op = (job_id, op_index)
+                if op in scheduled:
+                    continue
+                job_prev = (job_id, op_index - 1) if op_index > 0 else None
+                mach_prev = machine_pred.get(op)
+                if job_prev is not None and job_prev not in scheduled:
+                    continue
+                if mach_prev is not None and mach_prev not in scheduled:
+                    continue
+                machine_id, duration = op_map[op]
+                start = 0
+                if job_prev is not None:
+                    start = max(start, scheduled[job_prev]["end"])
+                if mach_prev is not None:
+                    start = max(start, scheduled[mach_prev]["end"])
+                scheduled[op] = {
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "start": start,
+                    "end": start + duration,
+                }
+                progress = True
+        if not progress:
+            return {"valid": False, "schedule": [], "makespan": float("inf"), "machine_sequences": machine_sequences}
+
+    schedule = list(scheduled.values())
+    schedule.sort(key=lambda item: (item["start"], item["machine_id"], item["job_id"], item["op_index"]))
+    return {
+        "valid": True,
+        "schedule": schedule,
+        "makespan": max(op["end"] for op in schedule),
+        "machine_sequences": machine_sequences,
+    }
+
+
+def initial_machine_sequences(instance: dict[str, Any]) -> list[list[tuple[int, int]]]:
+    baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+    return baseline["machine_sequences"]
+
+
+def generate_adjacent_moves(instance: dict[str, Any], current: dict[str, Any]) -> list[dict[str, Any]]:
+    durations, machines, _ = _build_operation_tables(instance)
+    schedule_by_op = {
+        (op["job_id"], op["op_index"]): op
+        for op in current["schedule"]
+    }
+    moves: list[dict[str, Any]] = []
+    for machine_id, seq in enumerate(current["machine_sequences"]):
+        for pos in range(len(seq) - 1):
+            a = seq[pos]
+            b = seq[pos + 1]
+            a_sched = schedule_by_op[a]
+            b_sched = schedule_by_op[b]
+            moves.append(
+                {
+                    "machine_id": machine_id,
+                    "machine_position": pos,
+                    "op_a": {
+                        "job_id": a[0],
+                        "op_index": a[1],
+                        "duration": durations[a[0]][a[1]],
+                        "start": a_sched["start"],
+                        "end": a_sched["end"],
+                    },
+                    "op_b": {
+                        "job_id": b[0],
+                        "op_index": b[1],
+                        "duration": durations[b[0]][b[1]],
+                        "start": b_sched["start"],
+                        "end": b_sched["end"],
+                    },
+                    "delta_duration": durations[a[0]][a[1]] - durations[b[0]][b[1]],
+                    "current_makespan": current["makespan"],
+                }
+            )
+    return moves
+
+
+def apply_adjacent_swap(machine_sequences: list[list[tuple[int, int]]], machine_id: int, position: int) -> list[list[tuple[int, int]]]:
+    new_sequences = copy.deepcopy(machine_sequences)
+    new_sequences[machine_id][position], new_sequences[machine_id][position + 1] = (
+        new_sequences[machine_id][position + 1],
+        new_sequences[machine_id][position],
+    )
+    return new_sequences
+
+
+def run_local_search(instance: dict[str, Any], score_move, max_iterations: int = 50) -> dict[str, Any]:
+    current = schedule_with_dispatch(instance, baseline_dispatch_score)
+    if not current["valid"]:
+        return current
+
+    for iteration in range(max_iterations):
+        moves = generate_adjacent_moves(instance, current)
+        state = {
+            "iteration": iteration,
+            "current_makespan": current["makespan"],
+        }
+        scored = []
+        for move in moves:
+            score = score_move(move, state)
+            scored.append((score, move))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                item[1]["delta_duration"],
+                -item[1]["machine_position"],
+            ),
+            reverse=True,
+        )
+        improved = False
+        for _, move in scored:
+            new_sequences = apply_adjacent_swap(current["machine_sequences"], move["machine_id"], move["machine_position"])
+            candidate = build_schedule_from_machine_sequences(instance, new_sequences)
+            if candidate["valid"] and candidate["makespan"] < current["makespan"]:
+                current = candidate
+                improved = True
+                break
+        if not improved:
+            break
+
+    return current
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/scripts/init.py b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/scripts/init.py
new file mode 100644
index 00000000..4352b72e
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/scripts/init.py
@@ -0,0 +1,51 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.LA16NeighborhoodMoveSelection.baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+except ModuleNotFoundError:
+    from baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+
+
+# EVOLVE-BLOCK-START
+MAX_ITERATIONS = _baseline_MAX_ITERATIONS
+
+
+def score_move(move, state):
+    return _baseline_score_move(move, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.LA16NeighborhoodMoveSelection.runtime.problem import load_instance, run_local_search
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, run_local_search
+    instance = load_instance()
+    result = run_local_search(instance, score_move, MAX_ITERATIONS)
+    print(result["makespan"])
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/evaluator.py b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/evaluator.py
new file mode 100644
index 00000000..42a186ad
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/evaluator.py
@@ -0,0 +1,125 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.LA16NeighborhoodMoveSelection.runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+
+
+TASK_KIND = "move"
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_makespan": 0.0,
+        "baseline_makespan": 0.0,
+        "relative_gap_to_optimum": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    instance = load_instance()
+
+    try:
+        if TASK_KIND == "dispatch":
+            score_fn = namespace.get("score_operation")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_operation(operation, state)")
+            baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+            candidate = schedule_with_dispatch(instance, score_fn)
+        else:
+            score_fn = namespace.get("score_move")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_move(move, state)")
+            max_iterations = int(namespace.get("MAX_ITERATIONS", 50))
+            baseline = run_local_search(instance, baseline_move_score, max_iterations=50)
+            candidate = run_local_search(instance, score_fn, max_iterations=max_iterations)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    if not baseline["valid"]:
+        artifacts["error_message"] = "internal baseline produced an invalid schedule"
+        return metrics, artifacts
+    if not candidate["valid"]:
+        artifacts["error_message"] = "candidate produced an invalid schedule"
+        return metrics, artifacts
+
+    makespan = float(candidate["makespan"])
+    baseline_makespan = float(baseline["makespan"])
+    if not math.isfinite(makespan) or makespan <= 0:
+        artifacts["error_message"] = "candidate makespan is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_makespan"] = makespan
+    metrics["baseline_makespan"] = baseline_makespan
+    metrics["relative_gap_to_optimum"] = relative_gap(makespan, KNOWN_OPTIMUM)
+    metrics["combined_score"] = -makespan
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/requirements.txt b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/requirements.txt
new file mode 100644
index 00000000..4adfed0b
--- /dev/null
+++ b/benchmarks/OperationsResearch/LA16NeighborhoodMoveSelection/verification/requirements.txt
@@ -0,0 +1 @@
+ortools
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/README.md b/benchmarks/OperationsResearch/NormalRQServiceLevel95/README.md
new file mode 100644
index 00000000..ff21658e
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/README.md
@@ -0,0 +1,55 @@
+# Normal (r,Q) with 95% Service-Level Constraint
+
+Choose `(r, Q)` policies for frozen Normal-demand inventory cases with a hard 95% service-level target and minimize average cost.
+
+## Why This Benchmark Matters
+
+This benchmark captures policy tuning near a service-level boundary. Small changes in reorder point can materially change stockout risk and working capital when the target is fixed around 95%.
+
+Algorithmically, it is a small constrained discrete optimization problem over a frozen probabilistic model.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/evaluator.py \
+  benchmarks/OperationsResearch/NormalRQServiceLevel95/scripts/init.py \
+  --metrics-out /tmp/NormalRQServiceLevel95_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/NormalRQServiceLevel95 \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/README_zh-CN.md b/benchmarks/OperationsResearch/NormalRQServiceLevel95/README_zh-CN.md
new file mode 100644
index 00000000..8d62eeef
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 正态需求 `(r,Q)` 95% 服务水平约束
+
+为冻结的正态需求库存实例选择 `(r, Q)` 策略，并在满足硬性 95% 服务水平目标的前提下最小化平均成本。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是在服务水平边界附近做库存策略调优。补货点的微小变化，就可能在 95% 左右这个固定目标附近明显改变缺货风险和资金占用。
+
+从算法角度看，它是一个建立在冻结概率模型上的小型离散约束优化问题。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/evaluator.py \
+  benchmarks/OperationsResearch/NormalRQServiceLevel95/scripts/init.py \
+  --metrics-out /tmp/NormalRQServiceLevel95_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/NormalRQServiceLevel95 \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task.md b/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task.md
new file mode 100644
index 00000000..271e844d
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task.md
@@ -0,0 +1,49 @@
+# Normal (r,Q) with 95% Service-Level Constraint Task
+
+## Problem
+
+Choose `(r, Q)` policies for frozen Normal-demand inventory cases with a hard 95% service-level target and minimize average cost.
+
+This benchmark captures policy tuning near a service-level boundary. Small changes in reorder point can materially change stockout risk and working capital when the target is fixed around 95%.
+
+Algorithmically, it is a small constrained discrete optimization problem over a frozen probabilistic model.
+
+## What Is Frozen
+
+- The Normal-demand case table, service-level target, and cost model in `runtime/problem.py`.
+- The candidate-pair audit used to check service-level feasibility.
+- The evaluator loop that averages candidate cost across all frozen cases.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a 2-tuple `(reorder_point, order_quantity)` or a dict with keys `reorder_point` and `order_quantity`.
+
+## Evaluation
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline on every case for diagnostics.
+3. Run your `solve(instance)` on every case and parse the returned `(r, Q)` pair.
+4. Check the hard service-level constraint, compute annual cost, and average cost across all cases.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into an `(r, Q)` pair
+- Any case misses the 95% service-level target or returns non-finite values
+- Any case evaluation produces a non-finite metric
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task_zh-CN.md b/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task_zh-CN.md
new file mode 100644
index 00000000..67b21fb3
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/Task_zh-CN.md
@@ -0,0 +1,49 @@
+# 正态需求 (r,Q) 95% 服务水平优化
+
+## 任务概览
+
+为冻结的正态需求库存案例选择 `(r, Q)` 策略，在硬性的 95% 服务水平约束下尽量降低平均成本。
+
+这个 benchmark 的难点就在服务水平边界附近。目标固定在 95% 左右时，补货点稍微变一点，就可能同时影响缺货风险和占用资金。
+
+从算法角度看，它是在冻结概率模型上的一个小型离散约束优化问题。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的正态需求案例表、服务水平目标和成本模型。
+- 用于校验服务水平可行性的候选 `(r, Q)` 审核逻辑。
+- 对所有冻结案例平均候选成本的评测循环。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个二元组 `(reorder_point, order_quantity)`，或带 `reorder_point` 和 `order_quantity` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结案例集。
+2. 对每个案例运行参考 baseline，用于诊断对照。
+3. 在每个案例上运行你的 `solve(instance)`，并解析返回的 `(r, Q)`。
+4. 检查硬性服务水平约束，计算年成本，并对全体案例求平均。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：只有所有案例都可行且输出有限时才为 `1.0`
+- `avg_cost`
+- `avg_cost_ratio`：用于诊断的平均 `baseline_cost / candidate_cost`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为 `(r, Q)`
+- 任意案例没有达到 95% 服务水平目标，或返回了非有限值
+- 任意案例的评测指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/baseline/solution.py b/benchmarks/OperationsResearch/NormalRQServiceLevel95/baseline/solution.py
new file mode 100644
index 00000000..54ca68eb
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/baseline/solution.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.NormalRQServiceLevel95.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/constraints.txt
new file mode 100644
index 00000000..35ca1548
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/references/source_manifest.md b/benchmarks/OperationsResearch/NormalRQServiceLevel95/references/source_manifest.md
new file mode 100644
index 00000000..6cb9ce17
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/references/source_manifest.md
@@ -0,0 +1,11 @@
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          - `stockpyl.rq.r_q_eil_approximation`
+- `stockpyl.rq.r_q_eoqss_approximation`
+- `stockpyl.rq.r_q_loss_function_approximation`
+- single-echelon `(r,Q)` formulas for Normal demand used in Stockpyl
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/runtime/problem.py b/benchmarks/OperationsResearch/NormalRQServiceLevel95/runtime/problem.py
new file mode 100644
index 00000000..3028aea3
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/runtime/problem.py
@@ -0,0 +1,159 @@
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = [
+    {
+        "holding_cost": 0.18,
+        "stockout_cost": 0.7,
+        "fixed_cost": 4.0,
+        "demand_mean": 1300.0,
+        "demand_sd": 120.0,
+        "lead_time": 0.05,
+        "target_csl": 0.95
+    },
+    {
+        "holding_cost": 0.2,
+        "stockout_cost": 0.85,
+        "fixed_cost": 5.5,
+        "demand_mean": 950.0,
+        "demand_sd": 90.0,
+        "lead_time": 0.08,
+        "target_csl": 0.95
+    },
+    {
+        "holding_cost": 0.16,
+        "stockout_cost": 0.92,
+        "fixed_cost": 6.0,
+        "demand_mean": 1500.0,
+        "demand_sd": 170.0,
+        "lead_time": 0.04,
+        "target_csl": 0.97
+    },
+    {
+        "holding_cost": 0.24,
+        "stockout_cost": 1.25,
+        "fixed_cost": 7.0,
+        "demand_mean": 720.0,
+        "demand_sd": 75.0,
+        "lead_time": 0.12,
+        "target_csl": 0.95
+    }
+]
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+
+def _service_level(instance: dict[str, float], r: int) -> float:
+    mean_lt = instance["demand_mean"] * instance["lead_time"]
+    sd_lt = instance["demand_sd"] * math.sqrt(instance["lead_time"])
+    z = (r - mean_lt) / sd_lt
+    return float(norm.cdf(z))
+
+
+def _candidate_pairs(instance: dict[str, float]) -> list[tuple[int, int]]:
+    pairs: list[tuple[int, int]] = []
+    for fn in (r_q_eil_approximation, r_q_eoqss_approximation, r_q_loss_function_approximation):
+        result = fn(
+            instance["holding_cost"],
+            instance["stockout_cost"],
+            instance["fixed_cost"],
+            instance["demand_mean"],
+            instance["demand_sd"],
+            instance["lead_time"],
+        )
+        if len(result) >= 2:
+            r = int(round(float(result[0])))
+            q = max(1, int(round(float(result[1]))))
+            pairs.append((r, q))
+    return pairs
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    best = None
+    for r, q in _candidate_pairs(instance):
+        while _service_level(instance, r) < instance["target_csl"]:
+            r += 1
+        cost = r_q_cost(
+            r,
+            q,
+            instance["holding_cost"],
+            instance["stockout_cost"],
+            instance["fixed_cost"],
+            instance["demand_mean"],
+            instance["demand_sd"],
+            instance["lead_time"],
+        )
+        candidate = (float(cost), int(r), int(q))
+        if best is None or candidate < best:
+            best = candidate
+    if best is None:
+        raise RuntimeError("no feasible baseline candidate")
+    _, r, q = best
+    return {"reorder_point": r, "order_quantity": q}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        r, q = _extract_rq(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    csl = _service_level(instance, r)
+    if csl < instance["target_csl"]:
+        return {"valid": False, "cost": float("inf")}
+    cost = r_q_cost(
+        r,
+        q,
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["demand_sd"],
+        instance["lead_time"],
+    )
+    return {"valid": True, "cost": float(cost), "reorder_point": int(r), "order_quantity": int(q), "service_level": float(csl)}
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/scripts/init.py b/benchmarks/OperationsResearch/NormalRQServiceLevel95/scripts/init.py
new file mode 100644
index 00000000..9df56f3b
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.NormalRQServiceLevel95.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.NormalRQServiceLevel95.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/evaluator.py b/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/evaluator.py
new file mode 100644
index 00000000..7b6d6969
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/evaluator.py
@@ -0,0 +1,109 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.NormalRQServiceLevel95.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.NormalRQServiceLevel95.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {idx}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {idx}\n{traceback.format_exc()}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {idx}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/requirements.txt b/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/requirements.txt
new file mode 100644
index 00000000..513852e8
--- /dev/null
+++ b/benchmarks/OperationsResearch/NormalRQServiceLevel95/verification/requirements.txt
@@ -0,0 +1,3 @@
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/README.md b/benchmarks/OperationsResearch/PoissonRQServiceLevel/README.md
new file mode 100644
index 00000000..d35fcc39
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/README.md
@@ -0,0 +1,55 @@
+# Poisson (r,Q) with Service-Level Constraint
+
+Choose `(r, Q)` policies for frozen Poisson-demand inventory cases with a hard service-level target and minimize average cost.
+
+## Why This Benchmark Matters
+
+This benchmark models replenishment for spare parts and MRO inventory, where demand arrives as discrete events and service commitments still matter. Good policies cut stockouts without overspending on safety stock.
+
+It is a small stochastic-policy tuning problem: the evaluator freezes the demand model and cost accounting, and your code only chooses the `(r, Q)` pair.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `solve(instance)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/evaluator.py \
+  benchmarks/OperationsResearch/PoissonRQServiceLevel/scripts/init.py \
+  --metrics-out /tmp/PoissonRQServiceLevel_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/PoissonRQServiceLevel \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/README_zh-CN.md b/benchmarks/OperationsResearch/PoissonRQServiceLevel/README_zh-CN.md
new file mode 100644
index 00000000..9a5dd411
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 泊松需求 `(r,Q)` 服务水平约束
+
+为冻结的泊松需求库存实例选择 `(r, Q)` 策略，并在满足硬性服务水平目标的前提下最小化平均成本。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是备件和 MRO 库存补货问题，需求以离散事件的形式到来，但仍然需要满足服务承诺。好的策略既要减少缺货，也不能在安全库存上花费过多。
+
+它本质上是一个小型随机库存策略调优问题：评测器冻结了需求模型和成本核算，而你的代码只负责选择 `(r, Q)`。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`solve(instance)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/evaluator.py \
+  benchmarks/OperationsResearch/PoissonRQServiceLevel/scripts/init.py \
+  --metrics-out /tmp/PoissonRQServiceLevel_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/PoissonRQServiceLevel \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task.md b/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task.md
new file mode 100644
index 00000000..8ec00c12
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task.md
@@ -0,0 +1,49 @@
+# Poisson (r,Q) with Service-Level Constraint Task
+
+## Problem
+
+Choose `(r, Q)` policies for frozen Poisson-demand inventory cases with a hard service-level target and minimize average cost.
+
+This benchmark models replenishment for spare parts and MRO inventory, where demand arrives as discrete events and service commitments still matter. Good policies cut stockouts without overspending on safety stock.
+
+It is a small stochastic-policy tuning problem: the evaluator freezes the demand model and cost accounting, and your code only chooses the `(r, Q)` pair.
+
+## What Is Frozen
+
+- The Poisson-demand case table, service-level target, and cost model in `runtime/problem.py`.
+- The feasibility audit used to check whether a returned `(r, Q)` pair meets the target.
+- The evaluator loop that averages candidate cost across all frozen cases.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a 2-tuple `(reorder_point, order_quantity)` or a dict with keys `reorder_point` and `order_quantity`.
+
+## Evaluation
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline on every case for diagnostics.
+3. Run your `solve(instance)` on every case and parse the returned `(r, Q)` pair.
+4. Check the hard service-level constraint, compute annual cost, and average cost across all cases.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics
+
+## Invalid Submissions
+
+- `solve(...)` is missing or crashes
+- The returned value cannot be parsed into an `(r, Q)` pair
+- Any case misses the service-level target or returns non-finite values
+- Any case evaluation produces a non-finite metric
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task_zh-CN.md b/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task_zh-CN.md
new file mode 100644
index 00000000..a97e803f
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/Task_zh-CN.md
@@ -0,0 +1,49 @@
+# 泊松需求 (r,Q) 服务水平优化
+
+## 任务概览
+
+为冻结的泊松需求库存案例选择 `(r, Q)` 策略，在硬性的服务水平约束下尽量降低平均成本。
+
+这个 benchmark 对应的是备件和 MRO 库存补货，需求是离散事件到达，但服务承诺依然很重要。好的策略既要减少缺货，也不能因为过度安全库存而多花钱。
+
+从计算角度看，它是一个小型随机策略调优问题：评测器冻结了需求模型和成本核算，而你只需要选出 `(r, Q)`。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的泊松需求案例表、服务水平目标和成本模型。
+- 用于检查返回 `(r, Q)` 是否满足目标的可行性审核逻辑。
+- 对所有冻结案例平均候选成本的评测循环。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回一个二元组 `(reorder_point, order_quantity)`，或带 `reorder_point` 和 `order_quantity` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结案例集。
+2. 对每个案例运行参考 baseline，用于诊断对照。
+3. 在每个案例上运行你的 `solve(instance)`，并解析返回的 `(r, Q)`。
+4. 检查硬性服务水平约束，计算年成本，并对全体案例求平均。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：只有所有案例都可行且输出有限时才为 `1.0`
+- `avg_cost`
+- `avg_cost_ratio`：用于诊断的平均 `baseline_cost / candidate_cost`
+
+## 判为无效的情况
+
+- 缺少 `solve(...)`，或函数在评测中报错
+- 返回值无法解析为 `(r, Q)`
+- 任意案例没有达到服务水平目标，或返回了非有限值
+- 任意案例的评测指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/baseline/solution.py b/benchmarks/OperationsResearch/PoissonRQServiceLevel/baseline/solution.py
new file mode 100644
index 00000000..9c900226
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/baseline/solution.py
@@ -0,0 +1,30 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.PoissonRQServiceLevel.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/agent_files.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/candidate_destination.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/constraints.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/constraints.txt
new file mode 100644
index 00000000..35ca1548
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_command.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_cwd.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/initial_program.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/readonly_files.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/references/source_manifest.md b/benchmarks/OperationsResearch/PoissonRQServiceLevel/references/source_manifest.md
new file mode 100644
index 00000000..2c8533b5
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/references/source_manifest.md
@@ -0,0 +1,9 @@
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          - `stockpyl.rq.r_q_poisson_exact`
+- single-echelon `(r,Q)` formulas for Poisson demand used in Stockpyl
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/runtime/problem.py b/benchmarks/OperationsResearch/PoissonRQServiceLevel/runtime/problem.py
new file mode 100644
index 00000000..94918ae2
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/runtime/problem.py
@@ -0,0 +1,125 @@
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = [
+    {
+        "holding_cost": 0.18,
+        "stockout_cost": 0.7,
+        "fixed_cost": 4.0,
+        "demand_mean": 1300.0,
+        "lead_time": 0.05,
+        "target_csl": 0.95
+    },
+    {
+        "holding_cost": 0.25,
+        "stockout_cost": 0.95,
+        "fixed_cost": 6.0,
+        "demand_mean": 900.0,
+        "lead_time": 0.1,
+        "target_csl": 0.95
+    },
+    {
+        "holding_cost": 0.14,
+        "stockout_cost": 0.8,
+        "fixed_cost": 5.0,
+        "demand_mean": 1500.0,
+        "lead_time": 0.04,
+        "target_csl": 0.97
+    },
+    {
+        "holding_cost": 0.22,
+        "stockout_cost": 1.1,
+        "fixed_cost": 7.0,
+        "demand_mean": 700.0,
+        "lead_time": 0.12,
+        "target_csl": 0.95
+    }
+]
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+
+def _service_level(instance: dict[str, float], r: int) -> float:
+    mean_lt = instance["demand_mean"] * instance["lead_time"]
+    return float(poisson.cdf(r, mean_lt))
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    r, q, _ = r_q_poisson_exact(
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["lead_time"],
+    )
+    r = int(round(r))
+    q = max(1, int(round(q)))
+    while _service_level(instance, r) < instance["target_csl"]:
+        r += 1
+    return {"reorder_point": r, "order_quantity": q}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        r, q = _extract_rq(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    csl = _service_level(instance, r)
+    if csl < instance["target_csl"]:
+        return {"valid": False, "cost": float("inf")}
+    cost = r_q_cost_poisson(
+        r,
+        q,
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["lead_time"],
+    )
+    return {"valid": True, "cost": float(cost), "reorder_point": int(r), "order_quantity": int(q), "service_level": float(csl)}
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/scripts/init.py b/benchmarks/OperationsResearch/PoissonRQServiceLevel/scripts/init.py
new file mode 100644
index 00000000..f303127d
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.PoissonRQServiceLevel.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.PoissonRQServiceLevel.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/evaluator.py b/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/evaluator.py
new file mode 100644
index 00000000..e94b49df
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/evaluator.py
@@ -0,0 +1,109 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.PoissonRQServiceLevel.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.PoissonRQServiceLevel.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {idx}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {idx}\n{traceback.format_exc()}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {idx}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/requirements.txt b/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/requirements.txt
new file mode 100644
index 00000000..513852e8
--- /dev/null
+++ b/benchmarks/OperationsResearch/PoissonRQServiceLevel/verification/requirements.txt
@@ -0,0 +1,3 @@
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/README.md b/benchmarks/Robotics/GridPathPlanningWithObstacles/README.md
new file mode 100644
index 00000000..d8a8a9c6
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/README.md
@@ -0,0 +1,55 @@
+# Grid Path Planning with Obstacles
+
+Plan a collision-free path on a frozen 2D occupancy grid with static obstacles and keep path cost low.
+
+## Why This Benchmark Matters
+
+This benchmark mirrors warehouse-like navigation with blocked aisles and shelves. A shorter valid path reduces cycle time, battery use, and congestion.
+
+It is a graph-search problem on a frozen grid map. The evaluator already defines the graph, legality checks, and cost function; you only supply the path.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `plan_path(grid, start, goal)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/GridPathPlanningWithObstacles/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/Robotics/GridPathPlanningWithObstacles/verification/evaluator.py \
+  benchmarks/Robotics/GridPathPlanningWithObstacles/scripts/init.py \
+  --metrics-out /tmp/GridPathPlanningWithObstacles_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/GridPathPlanningWithObstacles \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/README_zh-CN.md b/benchmarks/Robotics/GridPathPlanningWithObstacles/README_zh-CN.md
new file mode 100644
index 00000000..697b9bf3
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 带障碍栅格路径规划
+
+在冻结的二维占据栅格上规划一条无碰撞路径，并尽量降低路径代价。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是类似仓库巷道的单机器人导航场景。更短且合法的路径能直接减少周期时间、电量消耗和拥堵。
+
+从计算角度看，它就是冻结栅格图上的搜索问题。图结构、合法性检查和代价函数都已经定义好，你只需要给出路径。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`plan_path(grid, start, goal)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/GridPathPlanningWithObstacles/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/Robotics/GridPathPlanningWithObstacles/verification/evaluator.py \
+  benchmarks/Robotics/GridPathPlanningWithObstacles/scripts/init.py \
+  --metrics-out /tmp/GridPathPlanningWithObstacles_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/GridPathPlanningWithObstacles \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/Task.md b/benchmarks/Robotics/GridPathPlanningWithObstacles/Task.md
new file mode 100644
index 00000000..f2257ab8
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/Task.md
@@ -0,0 +1,50 @@
+# Grid Path Planning with Obstacles Task
+
+## Problem
+
+Plan a collision-free path on a frozen 2D occupancy grid with static obstacles and keep path cost low.
+
+This benchmark mirrors warehouse-like navigation with blocked aisles and shelves. A shorter valid path reduces cycle time, battery use, and congestion.
+
+It is a graph-search problem on a frozen grid map. The evaluator already defines the graph, legality checks, and cost function; you only supply the path.
+
+## What Is Frozen
+
+- The occupancy grid, start cell, goal cell, and path validator in `runtime/problem.py`.
+- The movement rule: each step must stay in free space and move between adjacent grid cells.
+- The baseline path and the shortest-path reference cost reported for context.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def plan_path(grid, start, goal):
+    ...
+```
+
+Return a path as a sequence of `(x, y)` cells. A dict with key `path` is also accepted.
+
+## Evaluation
+
+1. Load the frozen grid, start, and goal from `runtime/problem.py`.
+2. Validate the returned path against the start/end cells, adjacency rule, and obstacle mask.
+3. Compute candidate path cost as path length minus one.
+4. Report candidate cost together with baseline and shortest-path reference costs.
+
+## Metrics
+
+- `combined_score`: `-candidate_cost`
+- `valid`: `1.0` only if the path is finite and collision-free
+- `candidate_cost`
+- `baseline_cost`
+- `reference_cost`
+
+## Invalid Submissions
+
+- `plan_path(...)` is missing or crashes
+- The returned value cannot be parsed into a path
+- The path has the wrong start or goal, contains a non-adjacent move, or enters an obstacle
+- Any reported metric becomes non-finite
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/Task_zh-CN.md b/benchmarks/Robotics/GridPathPlanningWithObstacles/Task_zh-CN.md
new file mode 100644
index 00000000..5b1a9baf
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/Task_zh-CN.md
@@ -0,0 +1,50 @@
+# 带障碍栅格路径规划
+
+## 任务概览
+
+在冻结的二维占据栅格上规划一条无碰撞路径，并尽量降低路径代价。
+
+这个 benchmark 对应的是类似仓库巷道的单机器人导航场景。更短且合法的路径能直接减少周期时间、电量消耗和拥堵。
+
+从计算角度看，它就是冻结栅格图上的搜索问题。图结构、合法性检查和代价函数都已经定义好，你只需要给出路径。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的占据栅格、起点、终点和路径校验器。
+- 固定的移动规则：每一步都必须在空闲区域内，并且只能在相邻格点之间移动。
+- 用于对照的 baseline 路径和最短路参考代价。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def plan_path(grid, start, goal):
+    ...
+```
+
+返回一条由 `(x, y)` 单元组成的路径序列；也接受带 `path` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结的栅格、起点和终点。
+2. 检查返回路径的起终点、相邻移动规则和障碍掩码。
+3. 按路径长度减一的方式计算候选路径代价。
+4. 输出候选代价，并同时给出 baseline 与最短路参考代价。
+
+## 指标
+
+- `combined_score`：`-candidate_cost`
+- `valid`：只有路径有限且无碰撞时才为 `1.0`
+- `candidate_cost`
+- `baseline_cost`
+- `reference_cost`
+
+## 判为无效的情况
+
+- 缺少 `plan_path(...)`，或函数在评测中报错
+- 返回值无法解析为路径
+- 路径起终点错误、包含非相邻移动，或进入障碍物
+- 任意报告指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/baseline/solution.py b/benchmarks/Robotics/GridPathPlanningWithObstacles/baseline/solution.py
new file mode 100644
index 00000000..9c5a264a
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.Robotics.GridPathPlanningWithObstacles.runtime.problem import baseline_plan
+except ModuleNotFoundError:
+    from runtime.problem import baseline_plan
+
+
+def plan_path(grid, start, goal):
+    return baseline_plan()
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/agent_files.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/candidate_destination.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/constraints.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/constraints.txt
new file mode 100644
index 00000000..ea087e19
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Return finite, collision-free paths.
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_command.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_cwd.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/initial_program.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/readonly_files.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/references/source_manifest.md b/benchmarks/Robotics/GridPathPlanningWithObstacles/references/source_manifest.md
new file mode 100644
index 00000000..666c56bf
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/references/source_manifest.md
@@ -0,0 +1,9 @@
+# Source Manifest
+
+- Upstream algorithm lineage: `motion-planners`
+- Upstream files:
+  - `motion_planners/search.py`
+- Frozen map provenance: locally frozen synthetic occupancy grid with a fixed start and goal.
+- Authenticity note: the algorithm family is traceable to the upstream repository, but the map itself is a benchmark-local synthetic asset rather than a real sensor map or an upstream canonical data file.
+- License lineage: `motion-planners` is released under the MIT License.
+- Provenance class: fixed synthetic grid with official algorithm lineage.
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/runtime/problem.py b/benchmarks/Robotics/GridPathPlanningWithObstacles/runtime/problem.py
new file mode 100644
index 00000000..8d61d217
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/runtime/problem.py
@@ -0,0 +1,172 @@
+from __future__ import annotations
+
+from heapq import heappop, heappush
+import random
+from typing import Any
+
+GRID = (
+    '####################',
+    '#S........####.....#',
+    '#..###..#.#.#..##..#',
+    '#...#....##.......##',
+    '#...#.#.......##..##',
+    '#.#.#......#...###.#',
+    '##......#.......#..#',
+    '#................#.#',
+    '#....##.#.......#..#',
+    '#.........#....#.#.#',
+    '##..#.#.#..##...#..#',
+    '#.......##.........#',
+    '#..##.......#...##G#',
+    '####################',
+)
+BASELINE_KIND = "greedy"
+BASELINE_SEED = 0
+BASELINE_ITERATIONS = 0
+
+
+def _parse_grid() -> tuple[tuple[str, ...], tuple[int, int], tuple[int, int]]:
+    start = None
+    goal = None
+    rows = []
+    for y, row in enumerate(GRID):
+        new_row = []
+        for x, cell in enumerate(row):
+            if cell == 'S':
+                start = (x, y)
+                new_row.append('.')
+            elif cell == 'G':
+                goal = (x, y)
+                new_row.append('.')
+            else:
+                new_row.append(cell)
+        rows.append(''.join(new_row))
+    if start is None or goal is None:
+        raise ValueError('grid must contain both S and G')
+    return tuple(rows), start, goal
+
+
+FREE_GRID, START, GOAL = _parse_grid()
+
+
+def load_instance() -> dict[str, Any]:
+    return {'grid': FREE_GRID, 'start': START, 'goal': GOAL}
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError('cell must be a length-2 sequence')
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def _extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if 'path' not in value:
+            raise ValueError('missing path')
+        value = value['path']
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError('path is empty')
+    return path
+
+
+def is_free(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= y < len(FREE_GRID) and 0 <= x < len(FREE_GRID[0]) and FREE_GRID[y][x] != '#'
+
+
+def neighbors(cell: tuple[int, int]) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for nx, ny in ((x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)):
+        candidate = (nx, ny)
+        if is_free(candidate):
+            result.append(candidate)
+    return result
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def greedy_best_first_path(grid, start, goal):
+    frontier = [(abs(start[0] - goal[0]) + abs(start[1] - goal[1]), start)]
+    parent = {start: None}
+    visited = set()
+    while frontier:
+        _, current = heappop(frontier)
+        if current in visited:
+            continue
+        visited.add(current)
+        if current == goal:
+            return _retrace(parent, current)
+        for nxt in neighbors(current):
+            if nxt in visited or nxt in parent:
+                continue
+            parent[nxt] = current
+            h = abs(nxt[0] - goal[0]) + abs(nxt[1] - goal[1])
+            heappush(frontier, (h, nxt))
+    return None
+
+
+def rrt_path(grid, start, goal, seed, iterations, goal_probability=0.2):
+    rng = random.Random(seed)
+    free_cells = [(x, y) for y, row in enumerate(grid) for x, cell in enumerate(row) if cell != '#']
+    parent = {start: None}
+    nodes = [start]
+    for _ in range(iterations):
+        target = goal if rng.random() < goal_probability else rng.choice(free_cells)
+        nearest = min(nodes, key=lambda cell: abs(cell[0] - target[0]) + abs(cell[1] - target[1]))
+        candidates = neighbors(nearest)
+        rng.shuffle(candidates)
+        candidates.sort(key=lambda cell: abs(cell[0] - target[0]) + abs(cell[1] - target[1]))
+        for nxt in candidates:
+            if nxt in parent:
+                continue
+            parent[nxt] = nearest
+            nodes.append(nxt)
+            if nxt == goal:
+                return _retrace(parent, nxt)
+            break
+    return None
+
+
+def baseline_plan():
+    if BASELINE_KIND == 'greedy':
+        path = greedy_best_first_path(FREE_GRID, START, GOAL)
+    elif BASELINE_KIND == 'rrt':
+        path = rrt_path(FREE_GRID, START, GOAL, BASELINE_SEED, BASELINE_ITERATIONS)
+    else:
+        raise ValueError(f'unsupported baseline kind: {BASELINE_KIND}')
+    if path is None:
+        raise RuntimeError('baseline planner failed to find a path')
+    return path
+
+
+def validate_path(path_value: Any):
+    path = _extract_path(path_value)
+    if path[0] != START:
+        raise ValueError('path does not start at START')
+    if path[-1] != GOAL:
+        raise ValueError('path does not end at GOAL')
+    for cell in path:
+        if not is_free(cell):
+            raise ValueError('path enters an obstacle or leaves the grid')
+    for previous, current in zip(path, path[1:]):
+        dx = abs(previous[0] - current[0])
+        dy = abs(previous[1] - current[1])
+        if dx + dy not in {0, 1}:
+            raise ValueError('path contains a non-adjacent move')
+    return path
+
+
+def path_cost(path_value: Any) -> int:
+    return len(validate_path(path_value)) - 1
+
+
+REFERENCE_COST = 28
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/scripts/init.py b/benchmarks/Robotics/GridPathPlanningWithObstacles/scripts/init.py
new file mode 100644
index 00000000..965f32f0
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.GridPathPlanningWithObstacles.baseline.solution import plan_path as _baseline_plan_path
+except ModuleNotFoundError:
+    from baseline.solution import plan_path as _baseline_plan_path
+
+
+# EVOLVE-BLOCK-START
+def plan_path(grid, start, goal):
+    return _baseline_plan_path(grid, start, goal)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.Robotics.GridPathPlanningWithObstacles.runtime.problem import GOAL, FREE_GRID, START, path_cost
+    except ModuleNotFoundError:
+        from runtime.problem import GOAL, FREE_GRID, START, path_cost
+    print(path_cost(plan_path(FREE_GRID, START, GOAL)))
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/evaluator.py b/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/evaluator.py
new file mode 100644
index 00000000..24e25168
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/evaluator.py
@@ -0,0 +1,84 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / 'benchmarks').is_dir() and (parent / 'frontier_eval').is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.GridPathPlanningWithObstacles.baseline.solution import plan_path as baseline_plan_path
+    from benchmarks.Robotics.GridPathPlanningWithObstacles.runtime.problem import GOAL, FREE_GRID, REFERENCE_COST, START, path_cost
+except ModuleNotFoundError:
+    from baseline.solution import plan_path as baseline_plan_path
+    from runtime.problem import GOAL, FREE_GRID, REFERENCE_COST, START, path_cost
+
+
+def evaluate(program_path: str):
+    metrics = {
+        'combined_score': -1e18,
+        'valid': 0.0,
+        'candidate_cost': 0.0,
+        'baseline_cost': 0.0,
+        'reference_cost': float(REFERENCE_COST),
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name='candidate_program')
+    plan_path_fn = namespace.get('plan_path')
+    if not callable(plan_path_fn):
+        artifacts['error_message'] = 'candidate must define plan_path(grid, start, goal)'
+        return metrics, artifacts
+    try:
+        baseline_cost = float(path_cost(baseline_plan_path(FREE_GRID, START, GOAL)))
+        candidate_cost = float(path_cost(plan_path_fn(FREE_GRID, START, GOAL)))
+    except Exception:
+        artifacts['error_message'] = traceback.format_exc()
+        return metrics, artifacts
+    if not math.isfinite(candidate_cost) or candidate_cost <= 0:
+        artifacts['error_message'] = 'candidate cost is invalid'
+        return metrics, artifacts
+    metrics['valid'] = 1.0
+    metrics['candidate_cost'] = candidate_cost
+    metrics['baseline_cost'] = baseline_cost
+    metrics['combined_score'] = -candidate_cost
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument('program')
+    parser.add_argument('--metrics-out', default='metrics.json')
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding='utf-8')
+    if artifacts:
+        Path('artifacts.json').write_text(json.dumps(artifacts, indent=2), encoding='utf-8')
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/requirements.txt b/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/requirements.txt
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/benchmarks/Robotics/GridPathPlanningWithObstacles/verification/requirements.txt
@@ -0,0 +1 @@
+
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README.md b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README.md
new file mode 100644
index 00000000..9bbe82d2
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README.md
@@ -0,0 +1,55 @@
+# Multi-Robot Prioritized Planning
+
+Plan collision-free paths for three robots on a frozen grid while minimizing total path cost.
+
+## Why This Benchmark Matters
+
+This benchmark models small-fleet coordination in shared aisles. Good path sets reduce blocking and deadlocks without inflating overall travel cost.
+
+This is small-scale multi-agent path finding: single-agent shortest paths are easy, but coordinating several paths without vertex or edge conflicts is the real challenge.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `plan_paths(grid, starts, goals)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/evaluator.py \
+  benchmarks/Robotics/MultiRobotPrioritizedPlanning/scripts/init.py \
+  --metrics-out /tmp/MultiRobotPrioritizedPlanning_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/MultiRobotPrioritizedPlanning \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README_zh-CN.md b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README_zh-CN.md
new file mode 100644
index 00000000..8d37ad17
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 多机器人优先级路径规划
+
+在冻结的栅格地图上为 3 台机器人规划无碰撞路径，并尽量降低总路径代价。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是共享巷道里的小规模机器人协同规划。好的路径集合既要避免互相阻塞和死锁，也不能把总路程拉得太长。
+
+从计算角度看，它是一个小规模 multi-agent path finding 问题。单机器人最短路不难，真正的难点是让多条路径同时避开顶点冲突和边交换冲突。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`plan_paths(grid, starts, goals)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/evaluator.py \
+  benchmarks/Robotics/MultiRobotPrioritizedPlanning/scripts/init.py \
+  --metrics-out /tmp/MultiRobotPrioritizedPlanning_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/MultiRobotPrioritizedPlanning \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task.md b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task.md
new file mode 100644
index 00000000..7e768763
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task.md
@@ -0,0 +1,51 @@
+# Multi-Robot Prioritized Planning Task
+
+## Problem
+
+Plan collision-free paths for three robots on a frozen grid while minimizing total path cost.
+
+This benchmark models small-fleet coordination in shared aisles. Good path sets reduce blocking and deadlocks without inflating overall travel cost.
+
+This is small-scale multi-agent path finding: single-agent shortest paths are easy, but coordinating several paths without vertex or edge conflicts is the real challenge.
+
+## What Is Frozen
+
+- The occupancy grid, the three start-goal pairs, and the collision checker in `runtime/problem.py`.
+- The rule that each robot path may move to an adjacent cell or wait in place, but all robots must avoid vertex and edge-swap collisions.
+- The baseline prioritized planner and the individual-path lower bound reported for context.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def plan_paths(grid, starts, goals):
+    ...
+```
+
+Return a list of paths, one per robot. A dict with key `paths` is also accepted.
+
+## Evaluation
+
+1. Load the frozen grid, starts, and goals from `runtime/problem.py`.
+2. Validate every robot path mechanically, including starts, goals, adjacency-or-wait moves, and obstacle checks.
+3. Check joint execution for vertex collisions and edge-swap collisions across time.
+4. Report total path cost, makespan, baseline total cost, and the lower-bound diagnostic.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_cost`
+- `valid`: `1.0` only if all robot paths are collision-free
+- `candidate_total_cost`
+- `baseline_total_cost`
+- `candidate_makespan`
+- `lower_bound_total_cost`
+
+## Invalid Submissions
+
+- `plan_paths(...)` is missing or crashes
+- The returned value cannot be parsed into one path per robot
+- Any robot path has the wrong start or goal, contains an illegal move, or enters an obstacle
+- The joint path set contains a vertex collision or an edge-swap collision
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task_zh-CN.md b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task_zh-CN.md
new file mode 100644
index 00000000..d3debae8
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/Task_zh-CN.md
@@ -0,0 +1,51 @@
+# 多机器人优先级路径规划
+
+## 任务概览
+
+在冻结的栅格地图上为 3 台机器人规划无碰撞路径，并尽量降低总路径代价。
+
+这个 benchmark 对应的是共享巷道里的小规模机器人协同规划。好的路径集合既要避免互相阻塞和死锁，也不能把总路程拉得太长。
+
+从计算角度看，它是一个小规模 multi-agent path finding 问题。单机器人最短路不难，真正的难点是让多条路径同时避开顶点冲突和边交换冲突。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的占据栅格、3 组起终点以及冲突检查器。
+- 固定规则：每台机器人每一步可以移动到相邻格点，也可以原地等待，但全体路径必须同时避开顶点冲突和边交换冲突。
+- 用于对照的 baseline prioritized planner，以及各机器人独立最短路的下界。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def plan_paths(grid, starts, goals):
+    ...
+```
+
+返回一个路径列表，每台机器人对应一条；也接受带 `paths` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结的栅格、起点和终点。
+2. 逐条检查机器人路径，包括起终点、相邻/等待移动规则和障碍检查。
+3. 再按时间维度联合检查顶点冲突和边交换冲突。
+4. 输出总路径代价、makespan、baseline 总代价和理论下界诊断。
+
+## 指标
+
+- `combined_score`：`-candidate_total_cost`
+- `valid`：只有所有机器人路径都无冲突时才为 `1.0`
+- `candidate_total_cost`
+- `baseline_total_cost`
+- `candidate_makespan`
+- `lower_bound_total_cost`
+
+## 判为无效的情况
+
+- 缺少 `plan_paths(...)`，或函数在评测中报错
+- 返回值无法解析为每台机器人各一条路径
+- 任意机器人路径起终点错误、包含非法移动，或进入障碍物
+- 联合路径集合中出现顶点冲突或边交换冲突
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/baseline/solution.py b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/baseline/solution.py
new file mode 100644
index 00000000..89b7325a
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.Robotics.MultiRobotPrioritizedPlanning.runtime.problem import baseline_plan_paths
+except ModuleNotFoundError:
+    from runtime.problem import baseline_plan_paths
+
+
+def plan_paths(grid, starts, goals):
+    return baseline_plan_paths()
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/agent_files.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/candidate_destination.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/constraints.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/constraints.txt
new file mode 100644
index 00000000..ea087e19
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Return finite, collision-free paths.
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_command.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_cwd.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/initial_program.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/readonly_files.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/references/source_manifest.md b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/references/source_manifest.md
new file mode 100644
index 00000000..fe0801bc
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/references/source_manifest.md
@@ -0,0 +1,9 @@
+# Source Manifest
+
+- Upstream algorithm lineage: `motion-planners`
+- Upstream files:
+  - `motion_planners/search.py`
+- Frozen map provenance: locally frozen synthetic multi-robot occupancy grid with fixed start and goal assignments for robots `A`, `B`, and `C`.
+- Authenticity note: the search lineage is upstream-authentic, while the map and robot assignments are benchmark-local synthetic fixtures chosen to make priority order materially affect total path cost.
+- License lineage: `motion-planners` is released under the MIT License.
+- Provenance class: fixed synthetic grid with official algorithm lineage.
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/runtime/problem.py b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/runtime/problem.py
new file mode 100644
index 00000000..dd1e1391
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/runtime/problem.py
@@ -0,0 +1,234 @@
+from __future__ import annotations
+
+from collections import deque
+from heapq import heappop, heappush
+from typing import Any
+
+
+GRID = (
+    "##########",
+    "#....#..##",
+    "#..#..#.##",
+    "#..B##.bC#",
+    "#...c#...#",
+    "#......#A#",
+    "#.a..#...#",
+    "##########",
+)
+BASELINE_ORDER = (0, 1, 2)
+
+
+def _parse_grid():
+    start_map: dict[str, tuple[int, int]] = {}
+    goal_map: dict[str, tuple[int, int]] = {}
+    rows = []
+    for y, row in enumerate(GRID):
+        new_row = []
+        for x, cell in enumerate(row):
+            if cell in "ABC":
+                start_map[cell] = (x, y)
+                new_row.append(".")
+            elif cell in "abc":
+                goal_map[cell.upper()] = (x, y)
+                new_row.append(".")
+            else:
+                new_row.append(cell)
+        rows.append("".join(new_row))
+    robot_ids = tuple(sorted(start_map))
+    starts = tuple(start_map[robot_id] for robot_id in robot_ids)
+    goals = tuple(goal_map[robot_id] for robot_id in robot_ids)
+    return tuple(rows), robot_ids, starts, goals
+
+
+FREE_GRID, ROBOT_IDS, STARTS, GOALS = _parse_grid()
+
+
+def load_instance() -> dict[str, Any]:
+    return {"grid": FREE_GRID, "robot_ids": ROBOT_IDS, "starts": STARTS, "goals": GOALS}
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def _extract_paths(value: Any) -> list[list[tuple[int, int]]]:
+    if isinstance(value, dict):
+        if "paths" not in value:
+            raise ValueError("missing paths")
+        value = value["paths"]
+    paths = []
+    for raw_path in value:
+        path = [_to_cell(cell) for cell in raw_path]
+        if not path:
+            raise ValueError("robot path is empty")
+        paths.append(path)
+    if len(paths) != len(STARTS):
+        raise ValueError("incorrect number of robot paths")
+    return paths
+
+
+def is_free(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= y < len(FREE_GRID) and 0 <= x < len(FREE_GRID[0]) and FREE_GRID[y][x] != "#"
+
+
+def neighbors(cell: tuple[int, int], allow_wait: bool = False) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    candidates = [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)]
+    if allow_wait:
+        candidates.append((x, y))
+    for candidate in candidates:
+        if is_free(candidate):
+            result.append(candidate)
+    return result
+
+
+def _retrace(parent: dict[tuple[tuple[int, int], int], tuple[tuple[int, int], int] | None], node: tuple[tuple[int, int], int]) -> list[tuple[int, int]]:
+    path = []
+    current = node
+    while current is not None:
+        path.append(current[0])
+        current = parent[current]
+    return path[::-1]
+
+
+def breadth_first_shortest_path(start: tuple[int, int], goal: tuple[int, int]) -> list[tuple[int, int]] | None:
+    queue = deque([start])
+    parent = {start: None}
+    while queue:
+        current = queue.popleft()
+        if current == goal:
+            out = []
+            node = current
+            while node is not None:
+                out.append(node)
+                node = parent[node]
+            return out[::-1]
+        for nxt in neighbors(current, allow_wait=False):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    return None
+
+
+def space_time_astar(
+    start: tuple[int, int],
+    goal: tuple[int, int],
+    reserved_vertices: set[tuple[tuple[int, int], int]],
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]],
+    max_time: int = 40,
+) -> list[tuple[int, int]] | None:
+    def heuristic(cell: tuple[int, int]) -> int:
+        return abs(cell[0] - goal[0]) + abs(cell[1] - goal[1])
+
+    frontier = [(heuristic(start), 0, start)]
+    parent = {(start, 0): None}
+    best_time = {(start, 0): 0}
+    while frontier:
+        _, current_time, current = heappop(frontier)
+        if best_time[(current, current_time)] != current_time:
+            continue
+        if current == goal:
+            return _retrace(parent, (current, current_time))
+        if current_time >= max_time:
+            continue
+        for nxt in neighbors(current, allow_wait=True):
+            next_time = current_time + 1
+            if (nxt, next_time) in reserved_vertices:
+                continue
+            if ((current, nxt), next_time) in reserved_edges:
+                continue
+            if ((nxt, current), next_time) in reserved_edges:
+                continue
+            state = (nxt, next_time)
+            if state in best_time and next_time >= best_time[state]:
+                continue
+            best_time[state] = next_time
+            parent[state] = (current, current_time)
+            heappush(frontier, (next_time + heuristic(nxt), next_time, nxt))
+    return None
+
+
+def reserve_path(
+    path: list[tuple[int, int]],
+    reserved_vertices: set[tuple[tuple[int, int], int]],
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]],
+    horizon: int = 40,
+) -> None:
+    for t, cell in enumerate(path):
+        reserved_vertices.add((cell, t))
+        if t > 0:
+            reserved_edges.add(((path[t - 1], cell), t))
+        if t == len(path) - 1:
+            for future in range(t + 1, horizon + 1):
+                reserved_vertices.add((cell, future))
+
+
+def prioritized_plan(order: tuple[int, ...]) -> list[list[tuple[int, int]]] | None:
+    reserved_vertices: set[tuple[tuple[int, int], int]] = set()
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]] = set()
+    paths: list[list[tuple[int, int]] | None] = [None] * len(STARTS)
+    for robot_idx in order:
+        path = space_time_astar(STARTS[robot_idx], GOALS[robot_idx], reserved_vertices, reserved_edges)
+        if path is None:
+            return None
+        paths[robot_idx] = path
+        reserve_path(path, reserved_vertices, reserved_edges)
+    return [path for path in paths if path is not None]
+
+
+def baseline_plan_paths() -> list[list[tuple[int, int]]]:
+    result = prioritized_plan(tuple(BASELINE_ORDER))
+    if result is None:
+        raise RuntimeError("baseline prioritized planner failed")
+    return result
+
+
+def validate_paths(value: Any) -> list[list[tuple[int, int]]]:
+    paths = _extract_paths(value)
+    for idx, path in enumerate(paths):
+        if path[0] != STARTS[idx]:
+            raise ValueError(f"robot {idx} path does not start at the correct cell")
+        if path[-1] != GOALS[idx]:
+            raise ValueError(f"robot {idx} path does not end at the correct cell")
+        for cell in path:
+            if not is_free(cell):
+                raise ValueError("robot path enters an obstacle or leaves the grid")
+        for previous, current in zip(path, path[1:]):
+            dx = abs(previous[0] - current[0])
+            dy = abs(previous[1] - current[1])
+            if dx + dy not in {0, 1}:
+                raise ValueError("robot path contains a non-adjacent move")
+
+    horizon = max(len(path) for path in paths)
+    previous_positions = [path[0] for path in paths]
+    for t in range(horizon):
+        positions = [path[t] if t < len(path) else path[-1] for path in paths]
+        if len(set(positions)) != len(positions):
+            raise ValueError("vertex collision detected")
+        if t > 0:
+            for i in range(len(paths)):
+                for j in range(i + 1, len(paths)):
+                    if previous_positions[i] == positions[j] and previous_positions[j] == positions[i]:
+                        raise ValueError("edge-swap collision detected")
+        previous_positions = positions
+    return paths
+
+
+def total_cost(value: Any) -> int:
+    return sum(len(path) - 1 for path in validate_paths(value))
+
+
+def makespan(value: Any) -> int:
+    return max(len(path) - 1 for path in validate_paths(value))
+
+
+LOWER_BOUND_TOTAL_COST = 0
+for start, goal in zip(STARTS, GOALS):
+    shortest = breadth_first_shortest_path(start, goal)
+    if shortest is None:
+        raise RuntimeError("a robot has no individual shortest path")
+    LOWER_BOUND_TOTAL_COST += len(shortest) - 1
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/scripts/init.py b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/scripts/init.py
new file mode 100644
index 00000000..155b2c34
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/scripts/init.py
@@ -0,0 +1,47 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.MultiRobotPrioritizedPlanning.baseline.solution import plan_paths as _baseline_plan_paths
+except ModuleNotFoundError:
+    from baseline.solution import plan_paths as _baseline_plan_paths
+
+
+# EVOLVE-BLOCK-START
+def plan_paths(grid, starts, goals):
+    return _baseline_plan_paths(grid, starts, goals)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.Robotics.MultiRobotPrioritizedPlanning.runtime.problem import GOALS, FREE_GRID, STARTS, total_cost, validate_paths
+    except ModuleNotFoundError:
+        from runtime.problem import GOALS, FREE_GRID, STARTS, total_cost, validate_paths
+
+    print(total_cost(validate_paths(plan_paths(FREE_GRID, STARTS, GOALS))))
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/evaluator.py b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/evaluator.py
new file mode 100644
index 00000000..9f7748e7
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/evaluator.py
@@ -0,0 +1,90 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.MultiRobotPrioritizedPlanning.baseline.solution import plan_paths as baseline_plan_paths_fn
+    from benchmarks.Robotics.MultiRobotPrioritizedPlanning.runtime.problem import GOALS, FREE_GRID, LOWER_BOUND_TOTAL_COST, STARTS, makespan, total_cost, validate_paths
+except ModuleNotFoundError:
+    from baseline.solution import plan_paths as baseline_plan_paths_fn
+    from runtime.problem import GOALS, FREE_GRID, LOWER_BOUND_TOTAL_COST, STARTS, makespan, total_cost, validate_paths
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_total_cost": 0.0,
+        "baseline_total_cost": 0.0,
+        "candidate_makespan": 0.0,
+        "lower_bound_total_cost": float(LOWER_BOUND_TOTAL_COST),
+    }
+    artifacts: dict[str, str] = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    plan_paths_fn = namespace.get("plan_paths")
+    if not callable(plan_paths_fn):
+        artifacts["error_message"] = "candidate must define plan_paths(grid, starts, goals)"
+        return metrics, artifacts
+    try:
+        baseline_paths = validate_paths(baseline_plan_paths_fn(FREE_GRID, STARTS, GOALS))
+        candidate_paths = validate_paths(plan_paths_fn(FREE_GRID, STARTS, GOALS))
+        baseline_total_cost = float(total_cost(baseline_paths))
+        candidate_total_cost = float(total_cost(candidate_paths))
+        candidate_makespan = float(makespan(candidate_paths))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    if not math.isfinite(candidate_total_cost) or candidate_total_cost <= 0:
+        artifacts["error_message"] = "candidate total cost is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_total_cost"] = candidate_total_cost
+    metrics["baseline_total_cost"] = baseline_total_cost
+    metrics["candidate_makespan"] = candidate_makespan
+    metrics["combined_score"] = -candidate_total_cost
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/requirements.txt b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/requirements.txt
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/benchmarks/Robotics/MultiRobotPrioritizedPlanning/verification/requirements.txt
@@ -0,0 +1 @@
+
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/README.md b/benchmarks/Robotics/NarrowPassagePlanning/README.md
new file mode 100644
index 00000000..64ceba1d
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/README.md
@@ -0,0 +1,55 @@
+# Narrow-Passage Planning
+
+Plan a collision-free path through a frozen narrow-passage occupancy grid and keep path cost close to optimal.
+
+## Why This Benchmark Matters
+
+Narrow passages are a classic planning failure mode: a planner that looks reasonable in open space can still fail badly at doorways, single-cell corridors, and other bottlenecks.
+
+This is still graph search, but the topology forces the useful path through a thin feasible corridor, so many locally plausible heuristics waste search effort or suggest illegal shortcuts.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `plan_path(grid, start, goal)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/NarrowPassagePlanning/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/Robotics/NarrowPassagePlanning/verification/evaluator.py \
+  benchmarks/Robotics/NarrowPassagePlanning/scripts/init.py \
+  --metrics-out /tmp/NarrowPassagePlanning_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/NarrowPassagePlanning \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/README_zh-CN.md b/benchmarks/Robotics/NarrowPassagePlanning/README_zh-CN.md
new file mode 100644
index 00000000..250e4e8c
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 狭窄通道路径规划
+
+在冻结的狭窄通道占据栅格上规划一条无碰撞路径，并尽量接近最优路径代价。
+
+## 这个 Benchmark 在测什么
+
+狭窄通道是规划算法的经典失效模式。一个在开阔空间里看起来很正常的规划器，到了门洞、单格走廊或其他瓶颈位置时，可能会明显失效。
+
+它本质上还是图搜索，但拓扑结构会把可行路径强行压进一条很薄的通道里，所以很多局部上看着合理的启发式会浪费搜索，甚至试图走非法捷径。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`plan_path(grid, start, goal)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/Robotics/NarrowPassagePlanning/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/Robotics/NarrowPassagePlanning/verification/evaluator.py \
+  benchmarks/Robotics/NarrowPassagePlanning/scripts/init.py \
+  --metrics-out /tmp/NarrowPassagePlanning_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=Robotics/NarrowPassagePlanning \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/Task.md b/benchmarks/Robotics/NarrowPassagePlanning/Task.md
new file mode 100644
index 00000000..4d2a5e1a
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/Task.md
@@ -0,0 +1,50 @@
+# Narrow-Passage Planning Task
+
+## Problem
+
+Plan a collision-free path through a frozen narrow-passage occupancy grid and keep path cost close to optimal.
+
+Narrow passages are a classic planning failure mode: a planner that looks reasonable in open space can still fail badly at doorways, single-cell corridors, and other bottlenecks.
+
+This is still graph search, but the topology forces the useful path through a thin feasible corridor, so many locally plausible heuristics waste search effort or suggest illegal shortcuts.
+
+## What Is Frozen
+
+- The narrow-passage occupancy grid, start cell, goal cell, and validator in `runtime/problem.py`.
+- The movement rule: each step must stay in free space and move between adjacent grid cells.
+- The baseline path and the shortest-path reference cost reported for context.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def plan_path(grid, start, goal):
+    ...
+```
+
+Return a path as a sequence of `(x, y)` cells. A dict with key `path` is also accepted.
+
+## Evaluation
+
+1. Load the frozen grid, start, and goal from `runtime/problem.py`.
+2. Validate the returned path against the start/end cells, adjacency rule, and obstacle mask.
+3. Compute candidate path cost as path length minus one.
+4. Report candidate cost together with baseline and shortest-path reference costs.
+
+## Metrics
+
+- `combined_score`: `-candidate_cost`
+- `valid`: `1.0` only if the path is finite and collision-free
+- `candidate_cost`
+- `baseline_cost`
+- `reference_cost`
+
+## Invalid Submissions
+
+- `plan_path(...)` is missing or crashes
+- The returned value cannot be parsed into a path
+- The path has the wrong start or goal, contains a non-adjacent move, or enters an obstacle
+- Any reported metric becomes non-finite
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/Task_zh-CN.md b/benchmarks/Robotics/NarrowPassagePlanning/Task_zh-CN.md
new file mode 100644
index 00000000..93c2ff3c
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/Task_zh-CN.md
@@ -0,0 +1,50 @@
+# 狭窄通道路径规划
+
+## 任务概览
+
+在冻结的狭窄通道占据栅格上规划一条无碰撞路径，并尽量接近最优路径代价。
+
+狭窄通道是规划算法的经典失效模式。一个在开阔空间里看起来很正常的规划器，到了门洞、单格走廊或其他瓶颈位置时，可能会明显失效。
+
+它本质上还是图搜索，但拓扑结构会把可行路径强行压进一条很薄的通道里，所以很多局部上看着合理的启发式会浪费搜索，甚至试图走非法捷径。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的狭窄通道占据栅格、起点、终点和校验器。
+- 固定的移动规则：每一步都必须在空闲区域内，并且只能在相邻格点之间移动。
+- 用于对照的 baseline 路径和最短路参考代价。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def plan_path(grid, start, goal):
+    ...
+```
+
+返回一条由 `(x, y)` 单元组成的路径序列；也接受带 `path` 字段的字典。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 载入冻结的栅格、起点和终点。
+2. 检查返回路径的起终点、相邻移动规则和障碍掩码。
+3. 按路径长度减一的方式计算候选路径代价。
+4. 输出候选代价，并同时给出 baseline 与最短路参考代价。
+
+## 指标
+
+- `combined_score`：`-candidate_cost`
+- `valid`：只有路径有限且无碰撞时才为 `1.0`
+- `candidate_cost`
+- `baseline_cost`
+- `reference_cost`
+
+## 判为无效的情况
+
+- 缺少 `plan_path(...)`，或函数在评测中报错
+- 返回值无法解析为路径
+- 路径起终点错误、包含非相邻移动，或进入障碍物
+- 任意报告指标出现非有限值
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/baseline/solution.py b/benchmarks/Robotics/NarrowPassagePlanning/baseline/solution.py
new file mode 100644
index 00000000..1c73498e
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.Robotics.NarrowPassagePlanning.runtime.problem import baseline_plan
+except ModuleNotFoundError:
+    from runtime.problem import baseline_plan
+
+
+def plan_path(grid, start, goal):
+    return baseline_plan()
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/agent_files.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/candidate_destination.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/constraints.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/constraints.txt
new file mode 100644
index 00000000..ea087e19
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Return finite, collision-free paths.
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_command.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_cwd.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/initial_program.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/readonly_files.txt b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/references/source_manifest.md b/benchmarks/Robotics/NarrowPassagePlanning/references/source_manifest.md
new file mode 100644
index 00000000..9acd900d
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream algorithm lineage: `motion-planners`
+- Upstream files:
+  - `motion_planners/rrt.py`
+  - `motion_planners/search.py`
+- Frozen map provenance: locally frozen synthetic narrow-passage occupancy grid with a fixed start and goal.
+- Authenticity note: the planner lineage is upstream-authentic, while the map is a benchmark-local synthetic grid deliberately chosen to stress passage-finding behavior.
+- License lineage: `motion-planners` is released under the MIT License.
+- Provenance class: fixed synthetic grid with official algorithm lineage.
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/runtime/problem.py b/benchmarks/Robotics/NarrowPassagePlanning/runtime/problem.py
new file mode 100644
index 00000000..79d96da1
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/runtime/problem.py
@@ -0,0 +1,146 @@
+from __future__ import annotations
+
+from heapq import heappop, heappush
+import random
+from typing import Any
+
+
+GRID = (
+    "############################",
+    "#S..........#............G.#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "#..........................#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "#...........#..............#",
+    "############################",
+)
+BASELINE_KIND = "rrt"
+BASELINE_SEED = 3
+BASELINE_ITERATIONS = 5000
+
+
+def _parse_grid() -> tuple[tuple[str, ...], tuple[int, int], tuple[int, int]]:
+    start = None
+    goal = None
+    rows = []
+    for y, row in enumerate(GRID):
+        new_row = []
+        for x, cell in enumerate(row):
+            if cell == "S":
+                start = (x, y)
+                new_row.append(".")
+            elif cell == "G":
+                goal = (x, y)
+                new_row.append(".")
+            else:
+                new_row.append(cell)
+        rows.append("".join(new_row))
+    if start is None or goal is None:
+        raise ValueError("grid must contain both S and G")
+    return tuple(rows), start, goal
+
+
+FREE_GRID, START, GOAL = _parse_grid()
+
+
+def load_instance() -> dict[str, Any]:
+    return {"grid": FREE_GRID, "start": START, "goal": GOAL}
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def _extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if "path" not in value:
+            raise ValueError("missing path")
+        value = value["path"]
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError("path is empty")
+    return path
+
+
+def is_free(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= y < len(FREE_GRID) and 0 <= x < len(FREE_GRID[0]) and FREE_GRID[y][x] != "#"
+
+
+def neighbors(cell: tuple[int, int]) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for nx, ny in ((x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)):
+        candidate = (nx, ny)
+        if is_free(candidate):
+            result.append(candidate)
+    return result
+
+
+def _retrace(parent: dict[tuple[int, int], tuple[int, int] | None], node: tuple[int, int]) -> list[tuple[int, int]]:
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def rrt_path(grid: tuple[str, ...], start: tuple[int, int], goal: tuple[int, int], seed: int, iterations: int, goal_probability: float = 0.2) -> list[tuple[int, int]] | None:
+    rng = random.Random(seed)
+    free_cells = [(x, y) for y, row in enumerate(grid) for x, cell in enumerate(row) if cell != "#"]
+    parent = {start: None}
+    nodes = [start]
+    for _ in range(iterations):
+        target = goal if rng.random() < goal_probability else rng.choice(free_cells)
+        nearest = min(nodes, key=lambda cell: abs(cell[0] - target[0]) + abs(cell[1] - target[1]))
+        candidates = neighbors(nearest)
+        rng.shuffle(candidates)
+        candidates.sort(key=lambda cell: abs(cell[0] - target[0]) + abs(cell[1] - target[1]))
+        for nxt in candidates:
+            if nxt in parent:
+                continue
+            parent[nxt] = nearest
+            nodes.append(nxt)
+            if nxt == goal:
+                return _retrace(parent, nxt)
+            break
+    return None
+
+
+def baseline_plan() -> list[tuple[int, int]]:
+    path = rrt_path(FREE_GRID, START, GOAL, BASELINE_SEED, BASELINE_ITERATIONS)
+    if path is None:
+        raise RuntimeError("baseline planner failed to find a path")
+    return path
+
+
+def validate_path(path_value: Any) -> list[tuple[int, int]]:
+    path = _extract_path(path_value)
+    if path[0] != START:
+        raise ValueError("path does not start at START")
+    if path[-1] != GOAL:
+        raise ValueError("path does not end at GOAL")
+    for cell in path:
+        if not is_free(cell):
+            raise ValueError("path enters an obstacle or leaves the grid")
+    for previous, current in zip(path, path[1:]):
+        dx = abs(previous[0] - current[0])
+        dy = abs(previous[1] - current[1])
+        if dx + dy not in {0, 1}:
+            raise ValueError("path contains a non-adjacent move")
+    return path
+
+
+def path_cost(path_value: Any) -> int:
+    return len(validate_path(path_value)) - 1
+
+
+REFERENCE_COST = 34
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/scripts/init.py b/benchmarks/Robotics/NarrowPassagePlanning/scripts/init.py
new file mode 100644
index 00000000..51edc79a
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/scripts/init.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.NarrowPassagePlanning.baseline.solution import plan_path as _baseline_plan_path
+except ModuleNotFoundError:
+    from baseline.solution import plan_path as _baseline_plan_path
+
+
+# EVOLVE-BLOCK-START
+def plan_path(grid, start, goal):
+    return _baseline_plan_path(grid, start, goal)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.Robotics.NarrowPassagePlanning.runtime.problem import GOAL, FREE_GRID, START, path_cost
+    except ModuleNotFoundError:
+        from runtime.problem import GOAL, FREE_GRID, START, path_cost
+    print(path_cost(plan_path(FREE_GRID, START, GOAL)))
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/verification/evaluator.py b/benchmarks/Robotics/NarrowPassagePlanning/verification/evaluator.py
new file mode 100644
index 00000000..b568a510
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/verification/evaluator.py
@@ -0,0 +1,85 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.Robotics.NarrowPassagePlanning.baseline.solution import plan_path as baseline_plan_path
+    from benchmarks.Robotics.NarrowPassagePlanning.runtime.problem import GOAL, FREE_GRID, REFERENCE_COST, START, path_cost
+except ModuleNotFoundError:
+    from baseline.solution import plan_path as baseline_plan_path
+    from runtime.problem import GOAL, FREE_GRID, REFERENCE_COST, START, path_cost
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_cost": 0.0,
+        "baseline_cost": 0.0,
+        "reference_cost": float(REFERENCE_COST),
+    }
+    artifacts: dict[str, str] = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    plan_path_fn = namespace.get("plan_path")
+    if not callable(plan_path_fn):
+        artifacts["error_message"] = "candidate must define plan_path(grid, start, goal)"
+        return metrics, artifacts
+    try:
+        baseline_cost = float(path_cost(baseline_plan_path(FREE_GRID, START, GOAL)))
+        candidate_cost = float(path_cost(plan_path_fn(FREE_GRID, START, GOAL)))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    if not math.isfinite(candidate_cost) or candidate_cost <= 0:
+        artifacts["error_message"] = "candidate cost is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_cost"] = candidate_cost
+    metrics["baseline_cost"] = baseline_cost
+    metrics["combined_score"] = -candidate_cost
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/Robotics/NarrowPassagePlanning/verification/requirements.txt b/benchmarks/Robotics/NarrowPassagePlanning/verification/requirements.txt
new file mode 100644
index 00000000..8b137891
--- /dev/null
+++ b/benchmarks/Robotics/NarrowPassagePlanning/verification/requirements.txt
@@ -0,0 +1 @@
+
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README.md b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README.md
new file mode 100644
index 00000000..24e86cdc
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README.md
@@ -0,0 +1,55 @@
+# Bridge Topology Optimization
+
+Update densities inside a frozen bridge-like pyMOTO topology-optimization loop and minimize final compliance.
+
+## Why This Benchmark Matters
+
+This benchmark models a bridge-like layout problem with a prescribed solid deck. Part of the structure is fixed up front, so the remaining material must form an efficient load path under a hard budget.
+
+You are not drawing the final structure once. You are designing the inner update rule of a fixed PDE-constrained optimizer, and every step must remain feasible.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `update_density(density, sensitivity, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/BridgeTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/BridgeTopologyOptimization_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/BridgeTopologyOptimization \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README_zh-CN.md b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README_zh-CN.md
new file mode 100644
index 00000000..ec1fd806
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 桥梁拓扑优化
+
+在冻结的桥梁风格 pyMOTO 拓扑优化循环里更新密度场，并最小化最终柔顺度。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是带有预设实心桥面的桥梁结构布局问题。部分结构一开始就是固定的，所以剩余材料必须在严格预算下形成有效的传力路径。
+
+你不是一次性画出最终结构，而是在一个冻结的 PDE 约束优化器内部设计更新规则，并且每一步都必须保持可行。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`update_density(density, sensitivity, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/BridgeTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/BridgeTopologyOptimization_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/BridgeTopologyOptimization \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task.md b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task.md
new file mode 100644
index 00000000..dc574ce4
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task.md
@@ -0,0 +1,53 @@
+# Bridge Topology Optimization Task
+
+## Problem
+
+Update densities inside a frozen bridge-like pyMOTO topology-optimization loop and minimize final compliance.
+
+This benchmark models a bridge-like layout problem with a prescribed solid deck. Part of the structure is fixed up front, so the remaining material must form an efficient load path under a hard budget.
+
+You are not drawing the final structure once. You are designing the inner update rule of a fixed PDE-constrained optimizer, and every step must remain feasible.
+
+## What Is Frozen
+
+- The pyMOTO finite-element model, geometry, loads, passive masks, and SIMP settings in `runtime/problem.py`.
+- The material budget, minimum density, move limit, and 30-step optimization horizon.
+- The compliance objective and the feasibility validator for each intermediate density update.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` is the current density vector, `sensitivity` is the current compliance sensitivity, and `state` includes keys such as `iteration`, `domain_shape`, `volume_fraction`, `target_density_sum`, `minimum_density`, `move_limit`, `current_compliance`, `history`, `passive_solid_mask`, and `passive_void_mask`.
+
+Return the next feasible density vector, or a dict with key `density`. If you want a projection helper, you may import `project_density` from `runtime.problem`.
+
+## Evaluation
+
+1. Build the frozen pyMOTO model from `runtime/problem.py`.
+2. Run the fixed 30-iteration optimization loop with your `update_density(...)` callback.
+3. Validate every intermediate density update against bounds, move limits, masks, and volume conservation.
+4. Report final candidate compliance and compare it with the OC-style baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_compliance`
+- `valid`: `1.0` only if every update is finite and feasible
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## Invalid Submissions
+
+- `update_density(...)` is missing or crashes
+- Any proposed density contains non-finite values
+- Any update violates bounds, move limits, passive masks, or the target density sum
+- The pyMOTO solve fails during evaluation
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task_zh-CN.md b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task_zh-CN.md
new file mode 100644
index 00000000..eec83ed3
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/Task_zh-CN.md
@@ -0,0 +1,53 @@
+# 桥式结构拓扑优化
+
+## 任务概览
+
+在冻结的桥式 pyMOTO 拓扑优化循环中更新密度场，并尽量降低最终柔顺度。
+
+这个 benchmark 对应的是一个带预设实心桥面的桥式布局问题。结构的一部分一开始就被固定了，所以剩余材料必须在严格预算下尽量形成高效受力路径。
+
+你不是一次性画出最终结构，而是在一个固定的 PDE 约束优化器里设计内部更新规则，并且每一步都必须保持可行。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中的 pyMOTO 有限元模型、几何、载荷、被动区域和 SIMP 设置。
+- 材料体积分数预算、最小密度、单步 move limit，以及固定的 30 次迭代。
+- 柔顺度目标和每一步密度更新的可行性校验逻辑。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` 是当前密度向量，`sensitivity` 是当前柔顺度灵敏度，`state` 中包含 `iteration`、`domain_shape`、`volume_fraction`、`target_density_sum`、`minimum_density`、`move_limit`、`current_compliance`、`history`、`passive_solid_mask`、`passive_void_mask` 等字段。
+
+返回下一步可行的密度向量；也接受带 `density` 字段的字典。如果需要投影辅助函数，可以从 `runtime.problem` 导入 `project_density`。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 构建冻结的 pyMOTO 模型。
+2. 在固定的 30 次迭代优化循环里调用你的 `update_density(...)`。
+3. 对每一步候选密度执行边界、move limit、被动区域和体积守恒校验。
+4. 输出最终候选柔顺度，并同时给出 OC 风格基线作参考。
+
+## 指标
+
+- `combined_score`：`-candidate_compliance`
+- `valid`：只有每一步更新都有限且可行时才为 `1.0`
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## 判为无效的情况
+
+- 缺少 `update_density(...)`，或函数在评测中报错
+- 任意一步候选密度包含非有限值
+- 任意一步更新违反边界、move limit、被动区域或目标体积约束
+- 评测过程中 pyMOTO 求解失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/baseline/solution.py b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/baseline/solution.py
new file mode 100644
index 00000000..e4547302
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.StructuralOptimization.BridgeTopologyOptimization.runtime.problem import oc_update
+except ModuleNotFoundError:
+    from runtime.problem import oc_update
+
+
+def update_density(density, sensitivity, state):
+    return oc_update(density, sensitivity, state)
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/agent_files.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/candidate_destination.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/constraints.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/constraints.txt
new file mode 100644
index 00000000..c1220208
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep every density update finite and feasible.
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_command.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_cwd.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/initial_program.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/readonly_files.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/references/source_manifest.md b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/references/source_manifest.md
new file mode 100644
index 00000000..12247d59
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_compliance_69line.py`
+- Geometry provenance: a frozen bridge-like case derived from the standard bridge-structure topology-optimization literature, including the "symmetric half of a bridge structure" discussion in Couri et al. (2024), *One-shot procedures for topology optimization: a comparative study*, with a passive-solid deck row added so the distributed load has an explicit load-bearing support region.
+- Frozen benchmark status: this repository vendors a traceable local instance; it is not claimed to be an official upstream data file.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: traceable literature-derived local instance.
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/runtime/problem.py b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/runtime/problem.py
new file mode 100644
index 00000000..a7c4fe23
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/runtime/problem.py
@@ -0,0 +1,252 @@
+from __future__ import annotations
+
+import math
+import warnings
+from typing import Any
+
+import numpy as np
+import pymoto as pym
+from scipy.sparse import SparseEfficiencyWarning
+
+
+warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
+
+PROBLEM = {
+    "geometry": "bridge_half",
+    "nx": 48,
+    "ny": 16,
+    "volume_fraction": 0.45,
+    "minimum_density": 0.001,
+    "filter_radius": 1.5,
+    "penalty_power": 3.0,
+    "move_limit": 0.2,
+    "max_iterations": 30,
+    "load_scale": 1.0,
+    "passive_solid_top_rows": 1
+}
+SAMPLE_INSTANCE = {
+    "title": "Bridge Topology Optimization",
+    "geometry": PROBLEM["geometry"],
+    "domain_shape": [PROBLEM["nx"], PROBLEM["ny"]],
+    "volume_fraction": PROBLEM["volume_fraction"],
+    "filter_radius": PROBLEM["filter_radius"],
+    "penalty_power": PROBLEM["penalty_power"],
+    "max_iterations": PROBLEM["max_iterations"],
+}
+
+
+def load_instance() -> dict[str, Any]:
+    return dict(SAMPLE_INSTANCE)
+
+
+def _passive_masks(domain: pym.VoxelDomain) -> tuple[np.ndarray, np.ndarray]:
+    solid = np.zeros(domain.nel, dtype=bool)
+    void = np.zeros(domain.nel, dtype=bool)
+    top_rows = int(PROBLEM.get("passive_solid_top_rows", 0))
+    for offset in range(top_rows):
+        y = PROBLEM["ny"] - 1 - offset
+        solid[domain.elements[:, y, 0].reshape(-1)] = True
+    return solid, void
+
+
+def _initial_density(domain: pym.VoxelDomain, solid_mask: np.ndarray, void_mask: np.ndarray) -> np.ndarray:
+    target_sum = PROBLEM["volume_fraction"] * domain.nel
+    fixed_sum = float(np.sum(solid_mask)) + PROBLEM["minimum_density"] * float(np.sum(void_mask))
+    free_mask = ~(solid_mask | void_mask)
+    free_count = int(np.sum(free_mask))
+    if free_count == 0:
+        raise ValueError("no free design variables remain")
+    free_density = (target_sum - fixed_sum) / free_count
+    if not (PROBLEM["minimum_density"] <= free_density <= 1.0):
+        raise ValueError("target volume is infeasible for the chosen passive masks")
+    density = np.full(domain.nel, free_density, dtype=float)
+    density[solid_mask] = 1.0
+    density[void_mask] = PROBLEM["minimum_density"]
+    return density
+
+
+def _fixed_dofs(domain: pym.VoxelDomain) -> np.ndarray:
+    geometry = PROBLEM["geometry"]
+    if geometry == "cantilever":
+        left_nodes = domain.nodes[0, :].flatten()
+        return domain.get_dofnumber(left_nodes, [0, 1], 2).flatten()
+    if geometry in {"mbb_half", "bridge_half"}:
+        left_nodes = domain.nodes[0, :].flatten()
+        left_x = domain.get_dofnumber(left_nodes, 0, 2).flatten()
+        right_bottom = int(domain.nodes[PROBLEM["nx"], 0, 0])
+        return np.concatenate([left_x, np.array([2 * right_bottom + 1], dtype=int)])
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _force_vector(domain: pym.VoxelDomain) -> np.ndarray:
+    f = np.zeros(domain.nnodes * 2, dtype=float)
+    geometry = PROBLEM["geometry"]
+    load = float(PROBLEM["load_scale"])
+    if geometry == "cantilever":
+        force_node = int(domain.nodes[PROBLEM["nx"], PROBLEM["ny"] // 2, 0])
+        f[2 * force_node + 1] = load
+        return f
+    if geometry == "mbb_half":
+        force_node = int(domain.nodes[0, PROBLEM["ny"], 0])
+        f[2 * force_node + 1] = -load
+        return f
+    if geometry == "bridge_half":
+        deck_nodes = domain.nodes[:, PROBLEM["ny"], 0].flatten()
+        f[2 * deck_nodes + 1] = -load / len(deck_nodes)
+        return f
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _build_context() -> dict[str, Any]:
+    domain = pym.VoxelDomain(PROBLEM["nx"], PROBLEM["ny"])
+    fixed_dofs = _fixed_dofs(domain)
+    force = _force_vector(domain)
+    passive_solid_mask, passive_void_mask = _passive_masks(domain)
+    x0 = _initial_density(domain, passive_solid_mask, passive_void_mask)
+    signal = pym.Signal("x", state=x0.copy())
+    with pym.Network() as network:
+        filtered = pym.DensityFilter(domain=domain, radius=PROBLEM["filter_radius"])(signal)
+        penalized = pym.MathExpression(
+            expression=f"{PROBLEM['minimum_density']} + {1.0 - PROBLEM['minimum_density']}*inp0^{PROBLEM['penalty_power']}"
+        )(filtered)
+        stiffness = pym.AssembleStiffness(domain=domain, bc=fixed_dofs)(penalized)
+        displacement = pym.LinSolve(symmetric=True, positive_definite=True)(stiffness, force)
+        compliance = pym.EinSum(expression="i,i->")(displacement, force)
+    network.response()
+    return {
+        "domain": domain,
+        "fixed_dofs": fixed_dofs,
+        "force": force,
+        "signal": signal,
+        "network": network,
+        "compliance_signal": compliance,
+        "passive_solid_mask": passive_solid_mask,
+        "passive_void_mask": passive_void_mask,
+    }
+
+
+def _extract_density(value: Any, expected_size: int) -> np.ndarray:
+    if isinstance(value, dict):
+        if "density" not in value:
+            raise ValueError("missing density key")
+        value = value["density"]
+    density = np.asarray(value, dtype=float).reshape(-1)
+    if density.size != expected_size:
+        raise ValueError(f"density must have length {expected_size}, got {density.size}")
+    if not np.all(np.isfinite(density)):
+        raise ValueError("density contains non-finite values")
+    return density
+
+
+def _target_density_sum(state: dict[str, Any]) -> float:
+    return float(state["target_density_sum"])
+
+
+def density_bounds(previous_density: np.ndarray, state: dict[str, Any]) -> tuple[np.ndarray, np.ndarray]:
+    lower = np.maximum(float(state["minimum_density"]), previous_density - float(state["move_limit"]))
+    upper = np.minimum(1.0, previous_density + float(state["move_limit"]))
+    solid_mask = np.asarray(state["passive_solid_mask"], dtype=bool)
+    void_mask = np.asarray(state["passive_void_mask"], dtype=bool)
+    if solid_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[solid_mask] = 1.0
+        upper[solid_mask] = 1.0
+    if void_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[void_mask] = float(state["minimum_density"])
+        upper[void_mask] = float(state["minimum_density"])
+    return lower, upper
+
+
+def _project_sum_with_bounds(raw: np.ndarray, lower: np.ndarray, upper: np.ndarray, target_sum: float) -> np.ndarray:
+    if float(np.sum(lower)) - 1e-9 > target_sum or float(np.sum(upper)) + 1e-9 < target_sum:
+        raise ValueError("target density sum is infeasible under current bounds")
+    lam_low = float(np.min(raw - upper))
+    lam_high = float(np.max(raw - lower))
+    for _ in range(80):
+        lam = 0.5 * (lam_low + lam_high)
+        candidate = np.clip(raw - lam, lower, upper)
+        if float(np.sum(candidate)) > target_sum:
+            lam_low = lam
+        else:
+            lam_high = lam
+    return np.clip(raw - lam_high, lower, upper)
+
+
+def project_density(raw_density: Any, previous_density: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    raw = _extract_density(raw_density, previous_density.size)
+    lower, upper = density_bounds(previous_density, state)
+    return _project_sum_with_bounds(raw, lower, upper, _target_density_sum(state))
+
+
+def validate_density(candidate_density: np.ndarray, previous_density: np.ndarray, state: dict[str, Any]) -> None:
+    lower, upper = density_bounds(previous_density, state)
+    tol = 1e-6
+    if np.any(candidate_density < lower - tol) or np.any(candidate_density > upper + tol):
+        raise ValueError("density violates bounds, move limit, or passive masks")
+    volume_error = abs(float(np.sum(candidate_density)) - _target_density_sum(state))
+    if volume_error > 1e-4:
+        raise ValueError("density violates target volume")
+
+
+def oc_update(density: np.ndarray, sensitivity: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    lower, upper = density_bounds(density, state)
+    sens = np.asarray(sensitivity, dtype=float).reshape(-1)
+    if sens.shape != density.shape:
+        raise ValueError("sensitivity shape mismatch")
+    sens = np.minimum(sens, -1e-12)
+    l1, l2 = 1e-9, 1e9
+    for _ in range(80):
+        lam = 0.5 * (l1 + l2)
+        candidate = np.clip(density * np.sqrt(np.maximum(1e-12, -sens / lam)), lower, upper)
+        if float(np.sum(candidate)) > _target_density_sum(state):
+            l1 = lam
+        else:
+            l2 = lam
+    return np.clip(density * np.sqrt(np.maximum(1e-12, -sens / l2)), lower, upper)
+
+
+def run_optimization(update_density, max_iterations: int | None = None) -> dict[str, Any]:
+    context = _build_context()
+    signal = context["signal"]
+    network = context["network"]
+    compliance_signal = context["compliance_signal"]
+
+    history: list[float] = [float(compliance_signal.state)]
+    iterations = int(PROBLEM["max_iterations"] if max_iterations is None else max_iterations)
+    for iteration in range(iterations):
+        network.reset()
+        compliance_signal.sensitivity = 1.0
+        network.sensitivity()
+        density = np.asarray(signal.state, dtype=float).reshape(-1).copy()
+        sensitivity = np.asarray(signal.sensitivity, dtype=float).reshape(-1).copy()
+        state = {
+            "iteration": iteration,
+            "domain_shape": (PROBLEM["nx"], PROBLEM["ny"]),
+            "volume_fraction": PROBLEM["volume_fraction"],
+            "target_density_sum": PROBLEM["volume_fraction"] * context["domain"].nel,
+            "minimum_density": PROBLEM["minimum_density"],
+            "move_limit": PROBLEM["move_limit"],
+            "current_compliance": float(compliance_signal.state),
+            "history": tuple(history),
+            "passive_solid_mask": context["passive_solid_mask"].copy(),
+            "passive_void_mask": context["passive_void_mask"].copy(),
+        }
+        candidate = update_density(density.copy(), sensitivity.copy(), state)
+        density_next = _extract_density(candidate, density.size)
+        validate_density(density_next, density, state)
+        signal.state = density_next
+        network.response()
+        history.append(float(compliance_signal.state))
+
+    final_density = np.asarray(signal.state, dtype=float).reshape(-1)
+    return {
+        "valid": True,
+        "compliance": float(compliance_signal.state),
+        "history": history,
+        "iterations": iterations,
+        "final_volume_fraction": float(np.mean(final_density)),
+        "volume_fraction_error": abs(float(np.mean(final_density)) - PROBLEM["volume_fraction"]),
+    }
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/scripts/init.py b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/scripts/init.py
new file mode 100644
index 00000000..13e0f8f9
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/scripts/init.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.BridgeTopologyOptimization.baseline.solution import update_density as _baseline_update_density
+except ModuleNotFoundError:
+    from baseline.solution import update_density as _baseline_update_density
+
+
+# EVOLVE-BLOCK-START
+def update_density(density, sensitivity, state):
+    return _baseline_update_density(density, sensitivity, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.StructuralOptimization.BridgeTopologyOptimization.runtime.problem import run_optimization
+    except ModuleNotFoundError:
+        from runtime.problem import run_optimization
+
+    result = run_optimization(update_density)
+    print(result["compliance"])
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/evaluator.py b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/evaluator.py
new file mode 100644
index 00000000..71c474b3
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/evaluator.py
@@ -0,0 +1,98 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.BridgeTopologyOptimization.baseline.solution import update_density as baseline_update_density
+    from benchmarks.StructuralOptimization.BridgeTopologyOptimization.runtime.problem import run_optimization
+except ModuleNotFoundError:
+    from baseline.solution import update_density as baseline_update_density
+    from runtime.problem import run_optimization
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_compliance": 0.0,
+        "baseline_compliance": 0.0,
+        "final_volume_fraction": 0.0,
+        "volume_fraction_error": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    update_density = namespace.get("update_density")
+    if not callable(update_density):
+        artifacts["error_message"] = "candidate must define update_density(density, sensitivity, state)"
+        return metrics, artifacts
+
+    try:
+        baseline = run_optimization(baseline_update_density)
+        candidate = run_optimization(update_density)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    baseline_compliance = float(baseline["compliance"])
+    candidate_compliance = float(candidate["compliance"])
+    if not math.isfinite(baseline_compliance) or baseline_compliance <= 0:
+        artifacts["error_message"] = "internal baseline produced an invalid compliance value"
+        return metrics, artifacts
+    if not math.isfinite(candidate_compliance) or candidate_compliance <= 0:
+        artifacts["error_message"] = "candidate produced an invalid compliance value"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_compliance"] = candidate_compliance
+    metrics["baseline_compliance"] = baseline_compliance
+    metrics["final_volume_fraction"] = float(candidate["final_volume_fraction"])
+    metrics["volume_fraction_error"] = float(candidate["volume_fraction_error"])
+    metrics["combined_score"] = -candidate_compliance
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/requirements.txt b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/requirements.txt
new file mode 100644
index 00000000..61b3c0e4
--- /dev/null
+++ b/benchmarks/StructuralOptimization/BridgeTopologyOptimization/verification/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+scipy
+pymoto
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README.md b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README.md
new file mode 100644
index 00000000..fe3eb960
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README.md
@@ -0,0 +1,55 @@
+# Cantilever Compliance Topology Optimization
+
+Update densities inside a frozen cantilever pyMOTO topology-optimization loop and minimize final compliance.
+
+## Why This Benchmark Matters
+
+This benchmark stands in for lightweight bracket and support design. With a fixed material budget, the objective is to make the cantilever as stiff as possible.
+
+From a CS point of view, this is optimizer design inside a frozen FEM/SIMP loop rather than one-shot prediction.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `update_density(density, sensitivity, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/CantileverComplianceTopologyOptimization_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/CantileverComplianceTopologyOptimization \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README_zh-CN.md b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README_zh-CN.md
new file mode 100644
index 00000000..64f2872e
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/README_zh-CN.md
@@ -0,0 +1,55 @@
+# 悬臂梁柔顺度拓扑优化
+
+在冻结的悬臂 pyMOTO 拓扑优化循环里更新密度场，并最小化最终柔顺度。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是轻量化支架和悬臂支撑结构设计问题。在固定材料预算下，目标是让悬臂结构尽可能刚。
+
+从 CS 角度看，这不是一次性预测结果，而是在一个冻结的 FEM/SIMP 优化循环内部设计更新策略。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`update_density(density, sensitivity, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/CantileverComplianceTopologyOptimization_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/CantileverComplianceTopologyOptimization \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task.md b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task.md
new file mode 100644
index 00000000..ad616ea6
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task.md
@@ -0,0 +1,53 @@
+# Cantilever Compliance Topology Optimization Task
+
+## Problem
+
+Update densities inside a frozen cantilever pyMOTO topology-optimization loop and minimize final compliance.
+
+This benchmark stands in for lightweight bracket and support design. With a fixed material budget, the objective is to make the cantilever as stiff as possible.
+
+From a CS point of view, this is optimizer design inside a frozen FEM/SIMP loop rather than one-shot prediction.
+
+## What Is Frozen
+
+- The pyMOTO finite-element model, geometry, loads, passive masks, and SIMP settings in `runtime/problem.py`.
+- The material budget, minimum density, move limit, and 30-step optimization horizon.
+- The compliance objective and the feasibility validator for each intermediate density update.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` is the current density vector, `sensitivity` is the current compliance sensitivity, and `state` includes keys such as `iteration`, `domain_shape`, `volume_fraction`, `target_density_sum`, `minimum_density`, `move_limit`, `current_compliance`, `history`, `passive_solid_mask`, and `passive_void_mask`.
+
+Return the next feasible density vector, or a dict with key `density`. If you want a projection helper, you may import `project_density` from `runtime.problem`.
+
+## Evaluation
+
+1. Build the frozen pyMOTO model from `runtime/problem.py`.
+2. Run the fixed 30-iteration optimization loop with your `update_density(...)` callback.
+3. Validate every intermediate density update against bounds, move limits, masks, and volume conservation.
+4. Report final candidate compliance and compare it with the OC-style baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_compliance`
+- `valid`: `1.0` only if every update is finite and feasible
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## Invalid Submissions
+
+- `update_density(...)` is missing or crashes
+- Any proposed density contains non-finite values
+- Any update violates bounds, move limits, passive masks, or the target density sum
+- The pyMOTO solve fails during evaluation
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task_zh-CN.md b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task_zh-CN.md
new file mode 100644
index 00000000..9c5a978f
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/Task_zh-CN.md
@@ -0,0 +1,53 @@
+# 悬臂梁柔顺度拓扑优化
+
+## 任务概览
+
+在冻结的悬臂梁 pyMOTO 拓扑优化循环中更新密度场，并尽量降低最终柔顺度。
+
+这个 benchmark 对应的是轻量化支架和悬臂结构设计。在材料预算固定的前提下，目标就是让悬臂尽可能“更硬”。
+
+从计算角度看，这更像是在冻结的 FEM/SIMP 循环里设计优化器，而不是一次性预测答案。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中的 pyMOTO 有限元模型、几何、载荷、被动区域和 SIMP 设置。
+- 材料体积分数预算、最小密度、单步 move limit，以及固定的 30 次迭代。
+- 柔顺度目标和每一步密度更新的可行性校验逻辑。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` 是当前密度向量，`sensitivity` 是当前柔顺度灵敏度，`state` 中包含 `iteration`、`domain_shape`、`volume_fraction`、`target_density_sum`、`minimum_density`、`move_limit`、`current_compliance`、`history`、`passive_solid_mask`、`passive_void_mask` 等字段。
+
+返回下一步可行的密度向量；也接受带 `density` 字段的字典。如果需要投影辅助函数，可以从 `runtime.problem` 导入 `project_density`。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 构建冻结的 pyMOTO 模型。
+2. 在固定的 30 次迭代优化循环里调用你的 `update_density(...)`。
+3. 对每一步候选密度执行边界、move limit、被动区域和体积守恒校验。
+4. 输出最终候选柔顺度，并同时给出 OC 风格基线作参考。
+
+## 指标
+
+- `combined_score`：`-candidate_compliance`
+- `valid`：只有每一步更新都有限且可行时才为 `1.0`
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## 判为无效的情况
+
+- 缺少 `update_density(...)`，或函数在评测中报错
+- 任意一步候选密度包含非有限值
+- 任意一步更新违反边界、move limit、被动区域或目标体积约束
+- 评测过程中 pyMOTO 求解失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/baseline/solution.py b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/baseline/solution.py
new file mode 100644
index 00000000..6c3b3896
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.StructuralOptimization.CantileverComplianceTopologyOptimization.runtime.problem import oc_update
+except ModuleNotFoundError:
+    from runtime.problem import oc_update
+
+
+def update_density(density, sensitivity, state):
+    return oc_update(density, sensitivity, state)
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/agent_files.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/candidate_destination.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/constraints.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/constraints.txt
new file mode 100644
index 00000000..c1220208
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep every density update finite and feasible.
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_command.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_cwd.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/initial_program.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/readonly_files.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/references/source_manifest.md b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/references/source_manifest.md
new file mode 100644
index 00000000..a0b09f47
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_compliance_69line.py`
+- Geometry provenance: clamped-left cantilever with a point load at the free side, directly aligned with the official pyMOTO compliance examples.
+- Frozen benchmark status: this repository vendors only a reduced-size local instance and fixed solver settings; there is no external data file.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: official-example-derived frozen instance.
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/runtime/problem.py b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/runtime/problem.py
new file mode 100644
index 00000000..2347495a
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/runtime/problem.py
@@ -0,0 +1,251 @@
+from __future__ import annotations
+
+import math
+import warnings
+from typing import Any
+
+import numpy as np
+import pymoto as pym
+from scipy.sparse import SparseEfficiencyWarning
+
+
+warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
+
+PROBLEM = {
+    "geometry": "cantilever",
+    "nx": 36,
+    "ny": 12,
+    "volume_fraction": 0.45,
+    "minimum_density": 0.001,
+    "filter_radius": 1.5,
+    "penalty_power": 3.0,
+    "move_limit": 0.2,
+    "max_iterations": 30,
+    "load_scale": 1.0
+}
+SAMPLE_INSTANCE = {
+    "title": "Cantilever Compliance Topology Optimization",
+    "geometry": PROBLEM["geometry"],
+    "domain_shape": [PROBLEM["nx"], PROBLEM["ny"]],
+    "volume_fraction": PROBLEM["volume_fraction"],
+    "filter_radius": PROBLEM["filter_radius"],
+    "penalty_power": PROBLEM["penalty_power"],
+    "max_iterations": PROBLEM["max_iterations"],
+}
+
+
+def load_instance() -> dict[str, Any]:
+    return dict(SAMPLE_INSTANCE)
+
+
+def _passive_masks(domain: pym.VoxelDomain) -> tuple[np.ndarray, np.ndarray]:
+    solid = np.zeros(domain.nel, dtype=bool)
+    void = np.zeros(domain.nel, dtype=bool)
+    top_rows = int(PROBLEM.get("passive_solid_top_rows", 0))
+    for offset in range(top_rows):
+        y = PROBLEM["ny"] - 1 - offset
+        solid[domain.elements[:, y, 0].reshape(-1)] = True
+    return solid, void
+
+
+def _initial_density(domain: pym.VoxelDomain, solid_mask: np.ndarray, void_mask: np.ndarray) -> np.ndarray:
+    target_sum = PROBLEM["volume_fraction"] * domain.nel
+    fixed_sum = float(np.sum(solid_mask)) + PROBLEM["minimum_density"] * float(np.sum(void_mask))
+    free_mask = ~(solid_mask | void_mask)
+    free_count = int(np.sum(free_mask))
+    if free_count == 0:
+        raise ValueError("no free design variables remain")
+    free_density = (target_sum - fixed_sum) / free_count
+    if not (PROBLEM["minimum_density"] <= free_density <= 1.0):
+        raise ValueError("target volume is infeasible for the chosen passive masks")
+    density = np.full(domain.nel, free_density, dtype=float)
+    density[solid_mask] = 1.0
+    density[void_mask] = PROBLEM["minimum_density"]
+    return density
+
+
+def _fixed_dofs(domain: pym.VoxelDomain) -> np.ndarray:
+    geometry = PROBLEM["geometry"]
+    if geometry == "cantilever":
+        left_nodes = domain.nodes[0, :].flatten()
+        return domain.get_dofnumber(left_nodes, [0, 1], 2).flatten()
+    if geometry in {"mbb_half", "bridge_half"}:
+        left_nodes = domain.nodes[0, :].flatten()
+        left_x = domain.get_dofnumber(left_nodes, 0, 2).flatten()
+        right_bottom = int(domain.nodes[PROBLEM["nx"], 0, 0])
+        return np.concatenate([left_x, np.array([2 * right_bottom + 1], dtype=int)])
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _force_vector(domain: pym.VoxelDomain) -> np.ndarray:
+    f = np.zeros(domain.nnodes * 2, dtype=float)
+    geometry = PROBLEM["geometry"]
+    load = float(PROBLEM["load_scale"])
+    if geometry == "cantilever":
+        force_node = int(domain.nodes[PROBLEM["nx"], PROBLEM["ny"] // 2, 0])
+        f[2 * force_node + 1] = load
+        return f
+    if geometry == "mbb_half":
+        force_node = int(domain.nodes[0, PROBLEM["ny"], 0])
+        f[2 * force_node + 1] = -load
+        return f
+    if geometry == "bridge_half":
+        deck_nodes = domain.nodes[:, PROBLEM["ny"], 0].flatten()
+        f[2 * deck_nodes + 1] = -load / len(deck_nodes)
+        return f
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _build_context() -> dict[str, Any]:
+    domain = pym.VoxelDomain(PROBLEM["nx"], PROBLEM["ny"])
+    fixed_dofs = _fixed_dofs(domain)
+    force = _force_vector(domain)
+    passive_solid_mask, passive_void_mask = _passive_masks(domain)
+    x0 = _initial_density(domain, passive_solid_mask, passive_void_mask)
+    signal = pym.Signal("x", state=x0.copy())
+    with pym.Network() as network:
+        filtered = pym.DensityFilter(domain=domain, radius=PROBLEM["filter_radius"])(signal)
+        penalized = pym.MathExpression(
+            expression=f"{PROBLEM['minimum_density']} + {1.0 - PROBLEM['minimum_density']}*inp0^{PROBLEM['penalty_power']}"
+        )(filtered)
+        stiffness = pym.AssembleStiffness(domain=domain, bc=fixed_dofs)(penalized)
+        displacement = pym.LinSolve(symmetric=True, positive_definite=True)(stiffness, force)
+        compliance = pym.EinSum(expression="i,i->")(displacement, force)
+    network.response()
+    return {
+        "domain": domain,
+        "fixed_dofs": fixed_dofs,
+        "force": force,
+        "signal": signal,
+        "network": network,
+        "compliance_signal": compliance,
+        "passive_solid_mask": passive_solid_mask,
+        "passive_void_mask": passive_void_mask,
+    }
+
+
+def _extract_density(value: Any, expected_size: int) -> np.ndarray:
+    if isinstance(value, dict):
+        if "density" not in value:
+            raise ValueError("missing density key")
+        value = value["density"]
+    density = np.asarray(value, dtype=float).reshape(-1)
+    if density.size != expected_size:
+        raise ValueError(f"density must have length {expected_size}, got {density.size}")
+    if not np.all(np.isfinite(density)):
+        raise ValueError("density contains non-finite values")
+    return density
+
+
+def _target_density_sum(state: dict[str, Any]) -> float:
+    return float(state["target_density_sum"])
+
+
+def density_bounds(previous_density: np.ndarray, state: dict[str, Any]) -> tuple[np.ndarray, np.ndarray]:
+    lower = np.maximum(float(state["minimum_density"]), previous_density - float(state["move_limit"]))
+    upper = np.minimum(1.0, previous_density + float(state["move_limit"]))
+    solid_mask = np.asarray(state["passive_solid_mask"], dtype=bool)
+    void_mask = np.asarray(state["passive_void_mask"], dtype=bool)
+    if solid_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[solid_mask] = 1.0
+        upper[solid_mask] = 1.0
+    if void_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[void_mask] = float(state["minimum_density"])
+        upper[void_mask] = float(state["minimum_density"])
+    return lower, upper
+
+
+def _project_sum_with_bounds(raw: np.ndarray, lower: np.ndarray, upper: np.ndarray, target_sum: float) -> np.ndarray:
+    if float(np.sum(lower)) - 1e-9 > target_sum or float(np.sum(upper)) + 1e-9 < target_sum:
+        raise ValueError("target density sum is infeasible under current bounds")
+    lam_low = float(np.min(raw - upper))
+    lam_high = float(np.max(raw - lower))
+    for _ in range(80):
+        lam = 0.5 * (lam_low + lam_high)
+        candidate = np.clip(raw - lam, lower, upper)
+        if float(np.sum(candidate)) > target_sum:
+            lam_low = lam
+        else:
+            lam_high = lam
+    return np.clip(raw - lam_high, lower, upper)
+
+
+def project_density(raw_density: Any, previous_density: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    raw = _extract_density(raw_density, previous_density.size)
+    lower, upper = density_bounds(previous_density, state)
+    return _project_sum_with_bounds(raw, lower, upper, _target_density_sum(state))
+
+
+def validate_density(candidate_density: np.ndarray, previous_density: np.ndarray, state: dict[str, Any]) -> None:
+    lower, upper = density_bounds(previous_density, state)
+    tol = 1e-6
+    if np.any(candidate_density < lower - tol) or np.any(candidate_density > upper + tol):
+        raise ValueError("density violates bounds, move limit, or passive masks")
+    volume_error = abs(float(np.sum(candidate_density)) - _target_density_sum(state))
+    if volume_error > 1e-4:
+        raise ValueError("density violates target volume")
+
+
+def oc_update(density: np.ndarray, sensitivity: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    lower, upper = density_bounds(density, state)
+    sens = np.asarray(sensitivity, dtype=float).reshape(-1)
+    if sens.shape != density.shape:
+        raise ValueError("sensitivity shape mismatch")
+    sens = np.minimum(sens, -1e-12)
+    l1, l2 = 1e-9, 1e9
+    for _ in range(80):
+        lam = 0.5 * (l1 + l2)
+        candidate = np.clip(density * np.sqrt(np.maximum(1e-12, -sens / lam)), lower, upper)
+        if float(np.sum(candidate)) > _target_density_sum(state):
+            l1 = lam
+        else:
+            l2 = lam
+    return np.clip(density * np.sqrt(np.maximum(1e-12, -sens / l2)), lower, upper)
+
+
+def run_optimization(update_density, max_iterations: int | None = None) -> dict[str, Any]:
+    context = _build_context()
+    signal = context["signal"]
+    network = context["network"]
+    compliance_signal = context["compliance_signal"]
+
+    history: list[float] = [float(compliance_signal.state)]
+    iterations = int(PROBLEM["max_iterations"] if max_iterations is None else max_iterations)
+    for iteration in range(iterations):
+        network.reset()
+        compliance_signal.sensitivity = 1.0
+        network.sensitivity()
+        density = np.asarray(signal.state, dtype=float).reshape(-1).copy()
+        sensitivity = np.asarray(signal.sensitivity, dtype=float).reshape(-1).copy()
+        state = {
+            "iteration": iteration,
+            "domain_shape": (PROBLEM["nx"], PROBLEM["ny"]),
+            "volume_fraction": PROBLEM["volume_fraction"],
+            "target_density_sum": PROBLEM["volume_fraction"] * context["domain"].nel,
+            "minimum_density": PROBLEM["minimum_density"],
+            "move_limit": PROBLEM["move_limit"],
+            "current_compliance": float(compliance_signal.state),
+            "history": tuple(history),
+            "passive_solid_mask": context["passive_solid_mask"].copy(),
+            "passive_void_mask": context["passive_void_mask"].copy(),
+        }
+        candidate = update_density(density.copy(), sensitivity.copy(), state)
+        density_next = _extract_density(candidate, density.size)
+        validate_density(density_next, density, state)
+        signal.state = density_next
+        network.response()
+        history.append(float(compliance_signal.state))
+
+    final_density = np.asarray(signal.state, dtype=float).reshape(-1)
+    return {
+        "valid": True,
+        "compliance": float(compliance_signal.state),
+        "history": history,
+        "iterations": iterations,
+        "final_volume_fraction": float(np.mean(final_density)),
+        "volume_fraction_error": abs(float(np.mean(final_density)) - PROBLEM["volume_fraction"]),
+    }
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/scripts/init.py b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/scripts/init.py
new file mode 100644
index 00000000..4a555a52
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/scripts/init.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.CantileverComplianceTopologyOptimization.baseline.solution import update_density as _baseline_update_density
+except ModuleNotFoundError:
+    from baseline.solution import update_density as _baseline_update_density
+
+
+# EVOLVE-BLOCK-START
+def update_density(density, sensitivity, state):
+    return _baseline_update_density(density, sensitivity, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.StructuralOptimization.CantileverComplianceTopologyOptimization.runtime.problem import run_optimization
+    except ModuleNotFoundError:
+        from runtime.problem import run_optimization
+
+    result = run_optimization(update_density)
+    print(result["compliance"])
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/evaluator.py b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/evaluator.py
new file mode 100644
index 00000000..6114ebce
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/evaluator.py
@@ -0,0 +1,98 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.CantileverComplianceTopologyOptimization.baseline.solution import update_density as baseline_update_density
+    from benchmarks.StructuralOptimization.CantileverComplianceTopologyOptimization.runtime.problem import run_optimization
+except ModuleNotFoundError:
+    from baseline.solution import update_density as baseline_update_density
+    from runtime.problem import run_optimization
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_compliance": 0.0,
+        "baseline_compliance": 0.0,
+        "final_volume_fraction": 0.0,
+        "volume_fraction_error": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    update_density = namespace.get("update_density")
+    if not callable(update_density):
+        artifacts["error_message"] = "candidate must define update_density(density, sensitivity, state)"
+        return metrics, artifacts
+
+    try:
+        baseline = run_optimization(baseline_update_density)
+        candidate = run_optimization(update_density)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    baseline_compliance = float(baseline["compliance"])
+    candidate_compliance = float(candidate["compliance"])
+    if not math.isfinite(baseline_compliance) or baseline_compliance <= 0:
+        artifacts["error_message"] = "internal baseline produced an invalid compliance value"
+        return metrics, artifacts
+    if not math.isfinite(candidate_compliance) or candidate_compliance <= 0:
+        artifacts["error_message"] = "candidate produced an invalid compliance value"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_compliance"] = candidate_compliance
+    metrics["baseline_compliance"] = baseline_compliance
+    metrics["final_volume_fraction"] = float(candidate["final_volume_fraction"])
+    metrics["volume_fraction_error"] = float(candidate["volume_fraction_error"])
+    metrics["combined_score"] = -candidate_compliance
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/requirements.txt b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/requirements.txt
new file mode 100644
index 00000000..61b3c0e4
--- /dev/null
+++ b/benchmarks/StructuralOptimization/CantileverComplianceTopologyOptimization/verification/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+scipy
+pymoto
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README.md b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README.md
new file mode 100644
index 00000000..2027c9b8
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README.md
@@ -0,0 +1,55 @@
+# MBB Beam Topology Optimization
+
+Update densities inside a frozen half-MBB pyMOTO topology-optimization loop and minimize final compliance.
+
+## Why This Benchmark Matters
+
+The half-MBB beam is a classic stiffness-per-material benchmark. Local density tweaks can help or hurt global load paths, so the update rule has to reason beyond a single element neighborhood.
+
+The task is again optimizer design under repeated constrained calls: you control the update rule, while the physics loop and feasibility checks stay fixed.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `update_density(density, sensitivity, state)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/MBBBeamTopologyOptimization_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/MBBBeamTopologyOptimization \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README_zh-CN.md b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README_zh-CN.md
new file mode 100644
index 00000000..25071697
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/README_zh-CN.md
@@ -0,0 +1,55 @@
+# MBB 梁拓扑优化
+
+在冻结的半 MBB pyMOTO 拓扑优化循环里更新密度场，并最小化最终柔顺度。
+
+## 这个 Benchmark 在测什么
+
+半 MBB 梁是经典的“单位材料刚度最大化”基准。局部密度的小变化，可能帮助也可能破坏整体传力路径，所以更新规则必须具备超出单元局部邻域的判断能力。
+
+这个任务依旧属于“冻结物理循环里的优化器设计”：你控制的是更新规则，而物理求解和可行性检查保持不变。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`update_density(density, sensitivity, state)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/evaluator.py \
+  benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/scripts/init.py \
+  --metrics-out /tmp/MBBBeamTopologyOptimization_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=StructuralOptimization/MBBBeamTopologyOptimization \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task.md b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task.md
new file mode 100644
index 00000000..9bfb396e
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task.md
@@ -0,0 +1,53 @@
+# MBB Beam Topology Optimization Task
+
+## Problem
+
+Update densities inside a frozen half-MBB pyMOTO topology-optimization loop and minimize final compliance.
+
+The half-MBB beam is a classic stiffness-per-material benchmark. Local density tweaks can help or hurt global load paths, so the update rule has to reason beyond a single element neighborhood.
+
+The task is again optimizer design under repeated constrained calls: you control the update rule, while the physics loop and feasibility checks stay fixed.
+
+## What Is Frozen
+
+- The pyMOTO finite-element model, geometry, loads, passive masks, and SIMP settings in `runtime/problem.py`.
+- The material budget, minimum density, move limit, and 30-step optimization horizon.
+- The compliance objective and the feasibility validator for each intermediate density update.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` is the current density vector, `sensitivity` is the current compliance sensitivity, and `state` includes keys such as `iteration`, `domain_shape`, `volume_fraction`, `target_density_sum`, `minimum_density`, `move_limit`, `current_compliance`, `history`, `passive_solid_mask`, and `passive_void_mask`.
+
+Return the next feasible density vector, or a dict with key `density`. If you want a projection helper, you may import `project_density` from `runtime.problem`.
+
+## Evaluation
+
+1. Build the frozen pyMOTO model from `runtime/problem.py`.
+2. Run the fixed 30-iteration optimization loop with your `update_density(...)` callback.
+3. Validate every intermediate density update against bounds, move limits, masks, and volume conservation.
+4. Report final candidate compliance and compare it with the OC-style baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_compliance`
+- `valid`: `1.0` only if every update is finite and feasible
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## Invalid Submissions
+
+- `update_density(...)` is missing or crashes
+- Any proposed density contains non-finite values
+- Any update violates bounds, move limits, passive masks, or the target density sum
+- The pyMOTO solve fails during evaluation
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task_zh-CN.md b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task_zh-CN.md
new file mode 100644
index 00000000..70268479
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/Task_zh-CN.md
@@ -0,0 +1,53 @@
+# MBB 梁拓扑优化
+
+## 任务概览
+
+在冻结的半 MBB 梁 pyMOTO 拓扑优化循环中更新密度场，并尽量降低最终柔顺度。
+
+半 MBB 梁是经典的“单位材料刚度”基准。局部密度的小改动可能改善，也可能破坏全局受力路径，所以更新规则不能只盯着局部。
+
+这道题依然是在重复约束调用下做优化器设计：你只控制更新规则，物理求解循环和可行性检查都保持不变。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中的 pyMOTO 有限元模型、几何、载荷、被动区域和 SIMP 设置。
+- 材料体积分数预算、最小密度、单步 move limit，以及固定的 30 次迭代。
+- 柔顺度目标和每一步密度更新的可行性校验逻辑。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+`density` 是当前密度向量，`sensitivity` 是当前柔顺度灵敏度，`state` 中包含 `iteration`、`domain_shape`、`volume_fraction`、`target_density_sum`、`minimum_density`、`move_limit`、`current_compliance`、`history`、`passive_solid_mask`、`passive_void_mask` 等字段。
+
+返回下一步可行的密度向量；也接受带 `density` 字段的字典。如果需要投影辅助函数，可以从 `runtime.problem` 导入 `project_density`。
+
+## 评测流程
+
+1. 从 `runtime/problem.py` 构建冻结的 pyMOTO 模型。
+2. 在固定的 30 次迭代优化循环里调用你的 `update_density(...)`。
+3. 对每一步候选密度执行边界、move limit、被动区域和体积守恒校验。
+4. 输出最终候选柔顺度，并同时给出 OC 风格基线作参考。
+
+## 指标
+
+- `combined_score`：`-candidate_compliance`
+- `valid`：只有每一步更新都有限且可行时才为 `1.0`
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## 判为无效的情况
+
+- 缺少 `update_density(...)`，或函数在评测中报错
+- 任意一步候选密度包含非有限值
+- 任意一步更新违反边界、move limit、被动区域或目标体积约束
+- 评测过程中 pyMOTO 求解失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/baseline/solution.py b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/baseline/solution.py
new file mode 100644
index 00000000..81a18f2f
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/baseline/solution.py
@@ -0,0 +1,10 @@
+from __future__ import annotations
+
+try:
+    from benchmarks.StructuralOptimization.MBBBeamTopologyOptimization.runtime.problem import oc_update
+except ModuleNotFoundError:
+    from runtime.problem import oc_update
+
+
+def update_density(density, sensitivity, state):
+    return oc_update(density, sensitivity, state)
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/agent_files.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/agent_files.txt
new file mode 100644
index 00000000..1d2eb069
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/candidate_destination.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/candidate_destination.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/constraints.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/constraints.txt
new file mode 100644
index 00000000..c1220208
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep every density update finite and feasible.
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_command.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_command.txt
new file mode 100644
index 00000000..fcba5e60
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_cwd.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_cwd.txt
new file mode 100644
index 00000000..9c558e35
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/initial_program.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/initial_program.txt
new file mode 100644
index 00000000..b9411b3d
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/readonly_files.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/readonly_files.txt
new file mode 100644
index 00000000..75978e1f
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/frontier_eval/readonly_files.txt
@@ -0,0 +1,4 @@
+baseline/solution.py
+runtime/problem.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/references/source_manifest.md b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/references/source_manifest.md
new file mode 100644
index 00000000..3b50d317
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_self_weight.py` (`bc == 2` names the MBB-beam support style)
+- Geometry provenance: the standard half-MBB beam benchmark lineage used in density-based topology optimization, including Sigmund (2001), "A 99 line topology optimization code written in Matlab".
+- Frozen benchmark status: this repository vendors a reduced-size local half-MBB instance with fixed symmetry/support conditions and a fixed point load.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: literature-derived canonical geometry, locally frozen.
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/runtime/problem.py b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/runtime/problem.py
new file mode 100644
index 00000000..c52882d9
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/runtime/problem.py
@@ -0,0 +1,251 @@
+from __future__ import annotations
+
+import math
+import warnings
+from typing import Any
+
+import numpy as np
+import pymoto as pym
+from scipy.sparse import SparseEfficiencyWarning
+
+
+warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
+
+PROBLEM = {
+    "geometry": "mbb_half",
+    "nx": 48,
+    "ny": 16,
+    "volume_fraction": 0.5,
+    "minimum_density": 0.001,
+    "filter_radius": 1.5,
+    "penalty_power": 3.0,
+    "move_limit": 0.2,
+    "max_iterations": 30,
+    "load_scale": 1.0
+}
+SAMPLE_INSTANCE = {
+    "title": "MBB Beam Topology Optimization",
+    "geometry": PROBLEM["geometry"],
+    "domain_shape": [PROBLEM["nx"], PROBLEM["ny"]],
+    "volume_fraction": PROBLEM["volume_fraction"],
+    "filter_radius": PROBLEM["filter_radius"],
+    "penalty_power": PROBLEM["penalty_power"],
+    "max_iterations": PROBLEM["max_iterations"],
+}
+
+
+def load_instance() -> dict[str, Any]:
+    return dict(SAMPLE_INSTANCE)
+
+
+def _passive_masks(domain: pym.VoxelDomain) -> tuple[np.ndarray, np.ndarray]:
+    solid = np.zeros(domain.nel, dtype=bool)
+    void = np.zeros(domain.nel, dtype=bool)
+    top_rows = int(PROBLEM.get("passive_solid_top_rows", 0))
+    for offset in range(top_rows):
+        y = PROBLEM["ny"] - 1 - offset
+        solid[domain.elements[:, y, 0].reshape(-1)] = True
+    return solid, void
+
+
+def _initial_density(domain: pym.VoxelDomain, solid_mask: np.ndarray, void_mask: np.ndarray) -> np.ndarray:
+    target_sum = PROBLEM["volume_fraction"] * domain.nel
+    fixed_sum = float(np.sum(solid_mask)) + PROBLEM["minimum_density"] * float(np.sum(void_mask))
+    free_mask = ~(solid_mask | void_mask)
+    free_count = int(np.sum(free_mask))
+    if free_count == 0:
+        raise ValueError("no free design variables remain")
+    free_density = (target_sum - fixed_sum) / free_count
+    if not (PROBLEM["minimum_density"] <= free_density <= 1.0):
+        raise ValueError("target volume is infeasible for the chosen passive masks")
+    density = np.full(domain.nel, free_density, dtype=float)
+    density[solid_mask] = 1.0
+    density[void_mask] = PROBLEM["minimum_density"]
+    return density
+
+
+def _fixed_dofs(domain: pym.VoxelDomain) -> np.ndarray:
+    geometry = PROBLEM["geometry"]
+    if geometry == "cantilever":
+        left_nodes = domain.nodes[0, :].flatten()
+        return domain.get_dofnumber(left_nodes, [0, 1], 2).flatten()
+    if geometry in {"mbb_half", "bridge_half"}:
+        left_nodes = domain.nodes[0, :].flatten()
+        left_x = domain.get_dofnumber(left_nodes, 0, 2).flatten()
+        right_bottom = int(domain.nodes[PROBLEM["nx"], 0, 0])
+        return np.concatenate([left_x, np.array([2 * right_bottom + 1], dtype=int)])
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _force_vector(domain: pym.VoxelDomain) -> np.ndarray:
+    f = np.zeros(domain.nnodes * 2, dtype=float)
+    geometry = PROBLEM["geometry"]
+    load = float(PROBLEM["load_scale"])
+    if geometry == "cantilever":
+        force_node = int(domain.nodes[PROBLEM["nx"], PROBLEM["ny"] // 2, 0])
+        f[2 * force_node + 1] = load
+        return f
+    if geometry == "mbb_half":
+        force_node = int(domain.nodes[0, PROBLEM["ny"], 0])
+        f[2 * force_node + 1] = -load
+        return f
+    if geometry == "bridge_half":
+        deck_nodes = domain.nodes[:, PROBLEM["ny"], 0].flatten()
+        f[2 * deck_nodes + 1] = -load / len(deck_nodes)
+        return f
+    raise ValueError(f"unsupported geometry: {geometry}")
+
+
+def _build_context() -> dict[str, Any]:
+    domain = pym.VoxelDomain(PROBLEM["nx"], PROBLEM["ny"])
+    fixed_dofs = _fixed_dofs(domain)
+    force = _force_vector(domain)
+    passive_solid_mask, passive_void_mask = _passive_masks(domain)
+    x0 = _initial_density(domain, passive_solid_mask, passive_void_mask)
+    signal = pym.Signal("x", state=x0.copy())
+    with pym.Network() as network:
+        filtered = pym.DensityFilter(domain=domain, radius=PROBLEM["filter_radius"])(signal)
+        penalized = pym.MathExpression(
+            expression=f"{PROBLEM['minimum_density']} + {1.0 - PROBLEM['minimum_density']}*inp0^{PROBLEM['penalty_power']}"
+        )(filtered)
+        stiffness = pym.AssembleStiffness(domain=domain, bc=fixed_dofs)(penalized)
+        displacement = pym.LinSolve(symmetric=True, positive_definite=True)(stiffness, force)
+        compliance = pym.EinSum(expression="i,i->")(displacement, force)
+    network.response()
+    return {
+        "domain": domain,
+        "fixed_dofs": fixed_dofs,
+        "force": force,
+        "signal": signal,
+        "network": network,
+        "compliance_signal": compliance,
+        "passive_solid_mask": passive_solid_mask,
+        "passive_void_mask": passive_void_mask,
+    }
+
+
+def _extract_density(value: Any, expected_size: int) -> np.ndarray:
+    if isinstance(value, dict):
+        if "density" not in value:
+            raise ValueError("missing density key")
+        value = value["density"]
+    density = np.asarray(value, dtype=float).reshape(-1)
+    if density.size != expected_size:
+        raise ValueError(f"density must have length {expected_size}, got {density.size}")
+    if not np.all(np.isfinite(density)):
+        raise ValueError("density contains non-finite values")
+    return density
+
+
+def _target_density_sum(state: dict[str, Any]) -> float:
+    return float(state["target_density_sum"])
+
+
+def density_bounds(previous_density: np.ndarray, state: dict[str, Any]) -> tuple[np.ndarray, np.ndarray]:
+    lower = np.maximum(float(state["minimum_density"]), previous_density - float(state["move_limit"]))
+    upper = np.minimum(1.0, previous_density + float(state["move_limit"]))
+    solid_mask = np.asarray(state["passive_solid_mask"], dtype=bool)
+    void_mask = np.asarray(state["passive_void_mask"], dtype=bool)
+    if solid_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[solid_mask] = 1.0
+        upper[solid_mask] = 1.0
+    if void_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[void_mask] = float(state["minimum_density"])
+        upper[void_mask] = float(state["minimum_density"])
+    return lower, upper
+
+
+def _project_sum_with_bounds(raw: np.ndarray, lower: np.ndarray, upper: np.ndarray, target_sum: float) -> np.ndarray:
+    if float(np.sum(lower)) - 1e-9 > target_sum or float(np.sum(upper)) + 1e-9 < target_sum:
+        raise ValueError("target density sum is infeasible under current bounds")
+    lam_low = float(np.min(raw - upper))
+    lam_high = float(np.max(raw - lower))
+    for _ in range(80):
+        lam = 0.5 * (lam_low + lam_high)
+        candidate = np.clip(raw - lam, lower, upper)
+        if float(np.sum(candidate)) > target_sum:
+            lam_low = lam
+        else:
+            lam_high = lam
+    return np.clip(raw - lam_high, lower, upper)
+
+
+def project_density(raw_density: Any, previous_density: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    raw = _extract_density(raw_density, previous_density.size)
+    lower, upper = density_bounds(previous_density, state)
+    return _project_sum_with_bounds(raw, lower, upper, _target_density_sum(state))
+
+
+def validate_density(candidate_density: np.ndarray, previous_density: np.ndarray, state: dict[str, Any]) -> None:
+    lower, upper = density_bounds(previous_density, state)
+    tol = 1e-6
+    if np.any(candidate_density < lower - tol) or np.any(candidate_density > upper + tol):
+        raise ValueError("density violates bounds, move limit, or passive masks")
+    volume_error = abs(float(np.sum(candidate_density)) - _target_density_sum(state))
+    if volume_error > 1e-4:
+        raise ValueError("density violates target volume")
+
+
+def oc_update(density: np.ndarray, sensitivity: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    lower, upper = density_bounds(density, state)
+    sens = np.asarray(sensitivity, dtype=float).reshape(-1)
+    if sens.shape != density.shape:
+        raise ValueError("sensitivity shape mismatch")
+    sens = np.minimum(sens, -1e-12)
+    l1, l2 = 1e-9, 1e9
+    for _ in range(80):
+        lam = 0.5 * (l1 + l2)
+        candidate = np.clip(density * np.sqrt(np.maximum(1e-12, -sens / lam)), lower, upper)
+        if float(np.sum(candidate)) > _target_density_sum(state):
+            l1 = lam
+        else:
+            l2 = lam
+    return np.clip(density * np.sqrt(np.maximum(1e-12, -sens / l2)), lower, upper)
+
+
+def run_optimization(update_density, max_iterations: int | None = None) -> dict[str, Any]:
+    context = _build_context()
+    signal = context["signal"]
+    network = context["network"]
+    compliance_signal = context["compliance_signal"]
+
+    history: list[float] = [float(compliance_signal.state)]
+    iterations = int(PROBLEM["max_iterations"] if max_iterations is None else max_iterations)
+    for iteration in range(iterations):
+        network.reset()
+        compliance_signal.sensitivity = 1.0
+        network.sensitivity()
+        density = np.asarray(signal.state, dtype=float).reshape(-1).copy()
+        sensitivity = np.asarray(signal.sensitivity, dtype=float).reshape(-1).copy()
+        state = {
+            "iteration": iteration,
+            "domain_shape": (PROBLEM["nx"], PROBLEM["ny"]),
+            "volume_fraction": PROBLEM["volume_fraction"],
+            "target_density_sum": PROBLEM["volume_fraction"] * context["domain"].nel,
+            "minimum_density": PROBLEM["minimum_density"],
+            "move_limit": PROBLEM["move_limit"],
+            "current_compliance": float(compliance_signal.state),
+            "history": tuple(history),
+            "passive_solid_mask": context["passive_solid_mask"].copy(),
+            "passive_void_mask": context["passive_void_mask"].copy(),
+        }
+        candidate = update_density(density.copy(), sensitivity.copy(), state)
+        density_next = _extract_density(candidate, density.size)
+        validate_density(density_next, density, state)
+        signal.state = density_next
+        network.response()
+        history.append(float(compliance_signal.state))
+
+    final_density = np.asarray(signal.state, dtype=float).reshape(-1)
+    return {
+        "valid": True,
+        "compliance": float(compliance_signal.state),
+        "history": history,
+        "iterations": iterations,
+        "final_volume_fraction": float(np.mean(final_density)),
+        "volume_fraction_error": abs(float(np.mean(final_density)) - PROBLEM["volume_fraction"]),
+    }
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/scripts/init.py b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/scripts/init.py
new file mode 100644
index 00000000..943114cf
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/scripts/init.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.MBBBeamTopologyOptimization.baseline.solution import update_density as _baseline_update_density
+except ModuleNotFoundError:
+    from baseline.solution import update_density as _baseline_update_density
+
+
+# EVOLVE-BLOCK-START
+def update_density(density, sensitivity, state):
+    return _baseline_update_density(density, sensitivity, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.StructuralOptimization.MBBBeamTopologyOptimization.runtime.problem import run_optimization
+    except ModuleNotFoundError:
+        from runtime.problem import run_optimization
+
+    result = run_optimization(update_density)
+    print(result["compliance"])
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/evaluator.py b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/evaluator.py
new file mode 100644
index 00000000..7686131a
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/evaluator.py
@@ -0,0 +1,98 @@
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.StructuralOptimization.MBBBeamTopologyOptimization.baseline.solution import update_density as baseline_update_density
+    from benchmarks.StructuralOptimization.MBBBeamTopologyOptimization.runtime.problem import run_optimization
+except ModuleNotFoundError:
+    from baseline.solution import update_density as baseline_update_density
+    from runtime.problem import run_optimization
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_compliance": 0.0,
+        "baseline_compliance": 0.0,
+        "final_volume_fraction": 0.0,
+        "volume_fraction_error": 0.0,
+    }
+    artifacts: dict[str, str] = {}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    update_density = namespace.get("update_density")
+    if not callable(update_density):
+        artifacts["error_message"] = "candidate must define update_density(density, sensitivity, state)"
+        return metrics, artifacts
+
+    try:
+        baseline = run_optimization(baseline_update_density)
+        candidate = run_optimization(update_density)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    baseline_compliance = float(baseline["compliance"])
+    candidate_compliance = float(candidate["compliance"])
+    if not math.isfinite(baseline_compliance) or baseline_compliance <= 0:
+        artifacts["error_message"] = "internal baseline produced an invalid compliance value"
+        return metrics, artifacts
+    if not math.isfinite(candidate_compliance) or candidate_compliance <= 0:
+        artifacts["error_message"] = "candidate produced an invalid compliance value"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_compliance"] = candidate_compliance
+    metrics["baseline_compliance"] = baseline_compliance
+    metrics["final_volume_fraction"] = float(candidate["final_volume_fraction"])
+    metrics["volume_fraction_error"] = float(candidate["volume_fraction_error"])
+    metrics["combined_score"] = -candidate_compliance
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/requirements.txt b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/requirements.txt
new file mode 100644
index 00000000..61b3c0e4
--- /dev/null
+++ b/benchmarks/StructuralOptimization/MBBBeamTopologyOptimization/verification/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+scipy
+pymoto
diff --git a/docs/benchmark_ideas/frontier_benchmark_cards.md b/docs/benchmark_ideas/frontier_benchmark_cards.md
new file mode 100644
index 00000000..b73fdfc5
--- /dev/null
+++ b/docs/benchmark_ideas/frontier_benchmark_cards.md
@@ -0,0 +1,601 @@
+# Frontier Benchmark Cards
+
+Generated on 2026-03-12.
+Drafted with the `frontier-benchmark-contributor` skill.
+
+Selection policy:
+
+1. Prefer authentic and traceable data sources or canonical benchmark instances.
+2. Prefer offline reproducibility and stable evaluation.
+3. Prefer `task=unified` where practical.
+4. Downgrade or defer items with weak provenance, unclear redistribution terms, or heavyweight runtime stacks.
+
+## P1 Priority
+
+### 1. EOQ+MOQ Annual Cost Minimization
+- Domain: Inventory optimization
+- Upstream: Stockpyl EOQ
+- Canonical source: Stockpyl official EOQ docs and Snyder/Shen textbook formulas; no external dataset
+- License / redistribution: Stockpyl MIT; no data redistribution issue
+- Baseline: `economic_order_quantity` plus outer feasible-set enumeration for MOQ
+- Agent interface: `solve(params) -> Q`
+- Metrics: `annual_total_cost`, `valid`
+- Recommended integration: `unified`
+- Main risk: MOQ is an outer constraint rather than a native closed-form Stockpyl solver
+
+### 2. EOQ All-Units Discount Optimization
+- Domain: Inventory optimization
+- Upstream: Stockpyl EOQ with all-units discounts
+- Canonical source: Stockpyl official EOQ docs; no external dataset
+- License / redistribution: Stockpyl MIT; no data redistribution issue
+- Baseline: `economic_order_quantity_with_all_units_discounts`
+- Agent interface: `solve(params) -> Q`
+- Metrics: `annual_total_cost`, `chosen_region`, `valid`
+- Recommended integration: `unified`
+- Main risk: Feasible regions become subtle when all-units discounts are combined with MOQ
+
+### 3. EOQ Incremental Discount Optimization
+- Domain: Inventory optimization
+- Upstream: Stockpyl EOQ with incremental discounts
+- Canonical source: Stockpyl official EOQ docs; no external dataset
+- License / redistribution: Stockpyl MIT; no data redistribution issue
+- Baseline: `economic_order_quantity_with_incremental_discounts`
+- Agent interface: `solve(params) -> Q`
+- Metrics: `annual_total_cost`, `chosen_region`, `valid`
+- Recommended integration: `unified`
+- Main risk: Incremental discount cost accounting is easy to implement incorrectly
+
+### 4. Poisson-Demand (r,Q) Exact Optimization
+- Domain: Inventory optimization
+- Upstream: Stockpyl `rq`
+- Canonical source: Stockpyl single-echelon inventory tutorial and `r_q_poisson_exact`
+- License / redistribution: Stockpyl MIT; no data redistribution issue
+- Baseline: `r_q_poisson_exact`
+- Agent interface: `solve(params) -> (r, Q)`
+- Metrics: `cost`, `valid`
+- Recommended integration: `unified`
+- Main risk: Service-level constraints may need an outer audit rather than a native function argument
+
+### 5. Normal-Demand (r,Q) with 95% Service-Level Constraint
+- Domain: Inventory optimization
+- Upstream: Stockpyl `rq` normal-demand approximations
+- Canonical source: Stockpyl official `rq` docs and tutorial
+- License / redistribution: Stockpyl MIT; no data redistribution issue
+- Baseline: `r_q_eil_approximation`, `r_q_eoqss_approximation`, or `r_q_loss_function_approximation`
+- Agent interface: `solve(params) -> (r, Q)`
+- Metrics: `cost`, `service_level_feasible`, `valid`
+- Recommended integration: `unified`
+- Main risk: Approximation quality can drift around the service-level boundary
+
+### 6. FT10 Dispatching Rule Optimization
+- Domain: Job shop scheduling
+- Upstream: JobShopLib / Fisher-Thompson FT10
+- Canonical source: JobShopLib bundled `ft10`; original instance from Fisher and Thompson (1963)
+- License / redistribution: JobShopLib MIT; original instance should preferably be loaded from the library or re-encoded carefully
+- Baseline: CP-SAT or built-in dispatching rules
+- Agent interface: `solve(instance) -> schedule`
+- Metrics: `makespan`, `valid`
+- Recommended integration: `unified`
+- Main risk: Original instance licensing is less explicit than the code license
+
+### 7. LA16 Dispatching Rule Optimization
+- Domain: Job shop scheduling
+- Upstream: JobShopLib / Lawrence LA16
+- Canonical source: JobShopLib bundled `la16`; original instance from Lawrence (1984)
+- License / redistribution: JobShopLib MIT; original instance should preferably be loaded from the library or re-encoded carefully
+- Baseline: CP-SAT or built-in dispatching rules
+- Agent interface: `solve(instance) -> schedule`
+- Metrics: `makespan`, `valid`
+- Recommended integration: `unified`
+- Main risk: Same provenance caveat as FT10
+
+### 8. FT10 Neighborhood Move Selection
+- Domain: Job shop scheduling
+- Upstream: JobShopLib plus SA/CP-SAT
+- Canonical source: Canonical FT10 benchmark instance
+- License / redistribution: JobShopLib MIT; instance provenance should be tied to JobShopLib or another canonical loader
+- Baseline: Simulated annealing default neighborhood or CP-SAT optimal reference 930
+- Agent interface: `choose_moves(state) -> move`
+- Metrics: `makespan`, `improvement_over_baseline`, `valid`
+- Recommended integration: `unified`
+- Main risk: The agent surface must stay narrow so the solver itself is not editable
+
+### 9. LA16 Neighborhood Move Selection
+- Domain: Job shop scheduling
+- Upstream: JobShopLib plus SA/CP-SAT
+- Canonical source: Canonical LA16 benchmark instance
+- License / redistribution: JobShopLib MIT; instance provenance should be tied to JobShopLib or another canonical loader
+- Baseline: Simulated annealing default neighborhood or CP-SAT reference 945
+- Agent interface: `choose_moves(state) -> move`
+- Metrics: `makespan`, `improvement_over_baseline`, `valid`
+- Recommended integration: `unified`
+- Main risk: Reproducibility depends on fixed random seeds and a pinned neighborhood set
+
+### 10. DuckDB TPC-H Materialization / Index Selection
+- Domain: Database optimization
+- Upstream: DuckDB plus official TPC-H extension
+- Canonical source: DuckDB official `tpch` extension with local `CALL dbgen(sf=...)`
+- License / redistribution: DuckDB MIT; TPC-H data generated locally to avoid redistributing raw benchmark tables
+- Baseline: No additional materialization or a simple rule-based config
+- Agent interface: `solve(workload) -> config`
+- Metrics: `total_runtime`, `correctness`, `valid`
+- Recommended integration: `unified`
+- Main risk: Native DuckDB indexes may have limited impact, so the benchmark may work better around materialization or layout
+
+### 11. DuckDB TPC-H Query Rewriting
+- Domain: Database optimization
+- Upstream: DuckDB plus official TPC-H queries
+- Canonical source: DuckDB official TPC-H query set and local generated data
+- License / redistribution: DuckDB MIT; data generated locally
+- Baseline: Original SQL
+- Agent interface: `rewrite(sql) -> sql`
+- Metrics: `runtime_ratio`, `result_match`, `valid`
+- Recommended integration: `unified`
+- Main risk: Semantic equivalence checks must be strict and deterministic
+
+### 12. DuckDB TPC-H Pre-Aggregation Selection
+- Domain: Database optimization
+- Upstream: DuckDB plus TPC-H
+- Canonical source: DuckDB official TPC-H extension with local generated data
+- License / redistribution: DuckDB MIT; data generated locally
+- Baseline: No pre-aggregation
+- Agent interface: `solve(workload) -> ddl_plan`
+- Metrics: `total_runtime`, `storage_overhead`, `valid`
+- Recommended integration: `unified`
+- Main risk: The task can drift into schema design rather than benchmark-guided optimization if not scoped tightly
+
+### 13. 2D Grid Obstacle-Avoiding Path Planning
+- Domain: Motion planning
+- Upstream: `caelan/motion-planners`
+- Canonical source: Official repo plus fixed-seed synthetic grid maps
+- License / redistribution: MIT; synthetic maps have no redistribution issue
+- Baseline: A* or RRT
+- Agent interface: `plan_path(map, start, goal) -> path`
+- Metrics: `path_length`, `success_rate`, `runtime`
+- Recommended integration: `unified`
+- Main risk: Map authenticity is weaker than a canonical real-world benchmark
+
+### 14. 2D Narrow-Passage Planning
+- Domain: Motion planning
+- Upstream: `caelan/motion-planners`
+- Canonical source: Official repo plus fixed-seed synthetic narrow-passage maps
+- License / redistribution: MIT; synthetic maps have no redistribution issue
+- Baseline: BiRRT or RRT*
+- Agent interface: `plan_path(map, start, goal) -> path`
+- Metrics: `success_rate`, `path_cost`, `runtime`
+- Recommended integration: `unified`
+- Main risk: The benchmark is synthetic rather than tied to a public robotics map dataset
+
+### 15. Multi-Robot Prioritized Planning
+- Domain: Motion planning
+- Upstream: `caelan/motion-planners`
+- Canonical source: Official repo plus fixed-seed multi-agent grid instances
+- License / redistribution: MIT; synthetic instances have no redistribution issue
+- Baseline: Prioritized planning or independent A*
+- Agent interface: `solve(map, agents) -> paths`
+- Metrics: `total_path_length`, `collisions`, `runtime`
+- Recommended integration: `unified`
+- Main risk: Collision checking and tie-breaking must remain deterministic
+
+## P2 Priority
+
+### 16. Cantilever Compliance Topology Optimization
+- Domain: Structural optimization
+- Upstream: pyMOTO
+- Canonical source: pyMOTO official examples and standard FEM setup
+- License / redistribution: MIT; example data can be redistributed with the repo
+- Baseline: Official SIMP plus OC/MMA
+- Agent interface: `solve(load_case) -> density`
+- Metrics: `compliance`, `volume_fraction`, `valid`
+- Recommended integration: `unified`
+- Main risk: Mesh size must be kept small for fast evaluation
+
+### 17. MBB Beam Topology Optimization
+- Domain: Structural optimization
+- Upstream: pyMOTO
+- Canonical source: pyMOTO official examples
+- License / redistribution: MIT
+- Baseline: Official SIMP plus OC/MMA
+- Agent interface: `solve(load_case) -> density`
+- Metrics: `compliance`, `volume_fraction`, `valid`
+- Recommended integration: `unified`
+- Main risk: Filter and projection choices can dominate the result
+
+### 18. Bridge Topology Optimization
+- Domain: Structural optimization
+- Upstream: pyMOTO
+- Canonical source: pyMOTO official examples or standard textbook bridge setup
+- License / redistribution: MIT
+- Baseline: Official SIMP plus OC/MMA
+- Agent interface: `solve(load_case) -> density`
+- Metrics: `compliance`, `volume_fraction`, `checkerboard_penalty`
+- Recommended integration: `unified`
+- Main risk: The canonical geometry must be frozen in the task spec
+
+### 19. Fuel-Minimizing Ship Weather Routing
+- Domain: Maritime optimization
+- Upstream: 52North WeatherRoutingTool
+- Canonical source: Official WeatherRoutingTool algorithms and, if available, official demo weather fields; otherwise clearly labeled synthetic grids
+- License / redistribution: MIT; synthetic weather grids avoid redistribution issues
+- Baseline: Default shortest-route or official routing heuristic
+- Agent interface: `solve(instance) -> waypoints`
+- Metrics: `fuel_cost`, `travel_time`, `constraint_violations`
+- Recommended integration: `unified`
+- Main risk: If official demo weather fields are not usable, the benchmark becomes synthetic
+
+### 20. Dynamic-Current Minimum-Time Routing
+- Domain: Maritime optimization
+- Upstream: HALEM
+- Canonical source: HALEM official repo and test/example current fields, or clearly labeled synthetic current field
+- License / redistribution: MIT
+- Baseline: `HALEM_time`
+- Agent interface: `solve(instance) -> route`
+- Metrics: `travel_time`, `valid`
+- Recommended integration: `unified`
+- Main risk: Need to confirm that official example fields are light enough to vendor or reproduce
+
+### 21. Depth-Constrained Cost-Minimizing Routing
+- Domain: Maritime optimization
+- Upstream: HALEM
+- Canonical source: HALEM official repo and example current/depth fields
+- License / redistribution: MIT
+- Baseline: `HALEM_cost` or `HALEM_co2`
+- Agent interface: `solve(instance) -> route`
+- Metrics: `route_cost`, `grounding_violations`, `valid`
+- Recommended integration: `unified`
+- Main risk: Any transformation into local arrays must be documented carefully
+
+### 22. Intraday Operation with Storage
+- Domain: Power systems
+- Upstream: PyPSA
+- Canonical source: PyPSA official example networks
+- License / redistribution: MIT; official example networks are preferred
+- Baseline: PyPSA default linear optimization
+- Agent interface: `solve(network) -> dispatch`
+- Metrics: `opex`, `curtailment`, `valid`
+- Recommended integration: `unified`
+- Main risk: Linopy/HiGHS runtime stack is heavier than pure Python tasks
+
+### 23. Transmission Expansion Planning
+- Domain: Power systems
+- Upstream: PyPSA
+- Canonical source: PyPSA official example networks
+- License / redistribution: MIT
+- Baseline: PyPSA default capacity expansion model
+- Agent interface: `solve(network) -> expansion_plan`
+- Metrics: `total_system_cost`, `unmet_demand`, `valid`
+- Recommended integration: `unified`
+- Main risk: Time horizon must be reduced to keep runtime bounded
+
+### 24. Renewable Siting plus Line Expansion
+- Domain: Power systems
+- Upstream: PyPSA
+- Canonical source: PyPSA official example networks or a clearly documented toy network derived from them
+- License / redistribution: MIT
+- Baseline: PyPSA default optimization
+- Agent interface: `solve(network) -> plan`
+- Metrics: `capex_opex_total`, `renewable_share`, `valid`
+- Recommended integration: `unified`
+- Main risk: A derived toy case requires explicit provenance and transformation notes
+
+### 25. Single-Day Multi-Energy Scheduling
+- Domain: Integrated energy systems
+- Upstream: MESMO
+- Canonical source: MESMO official example cases and official repo data
+- License / redistribution: MIT; official examples preferred
+- Baseline: MESMO or CVXPY default model
+- Agent interface: `solve(instance) -> schedule`
+- Metrics: `total_cost`, `feasibility`, `valid`
+- Recommended integration: `unified`
+- Main risk: CVXPY and solver dependencies are relatively heavy
+
+### 26. Joint Investment plus Operations for Heat Pump and Storage
+- Domain: Integrated energy systems
+- Upstream: MESMO
+- Canonical source: MESMO official example case or a clearly documented reduced derivative
+- License / redistribution: MIT
+- Baseline: MESMO default optimization
+- Agent interface: `solve(instance) -> design_and_ops`
+- Metrics: `lifecycle_cost`, `capacity_feasible`, `valid`
+- Recommended integration: `unified`
+- Main risk: Investment variables increase runtime significantly
+
+### 27. SnAr Reaction Condition Optimization
+- Domain: Chemical process optimization
+- Upstream: Summit
+- Canonical source: Summit official `SnarBenchmark` and associated benchmark paper
+- License / redistribution: Summit MIT; benchmark is a simulator, so no external data redistribution issue
+- Baseline: SOBO or Nelder-Mead
+- Agent interface: `solve(experiment) -> conditions`
+- Metrics: `best_objective`, `budget_efficiency`, `valid`
+- Recommended integration: `unified`
+- Main risk: Budget and random seed must be fixed tightly
+
+### 28. SnAr Multi-Objective Optimization
+- Domain: Chemical process optimization
+- Upstream: Summit
+- Canonical source: Summit official benchmark suite
+- License / redistribution: MIT
+- Baseline: Scalarized official strategy such as SOBO
+- Agent interface: `solve(experiment) -> trial_sequence`
+- Metrics: `best_scalarized_score` or `pareto_hypervolume`, `valid`
+- Recommended integration: `unified`
+- Main risk: Multi-objective scoring must be frozen before implementation
+
+### 29. EV Price-Arbitrage Charging
+- Domain: EV charging optimization
+- Upstream: EV2Gym
+- Canonical source: EV2Gym official example configs and open-source data references
+- License / redistribution: MIT; chosen config and any bundled data must be checked case by case
+- Baseline: `ChargeAsFastAsPossible` or official MPC/oracle heuristic
+- Agent interface: `solve(env) -> actions`
+- Metrics: `energy_cost`, `user_satisfaction`, `valid`
+- Recommended integration: `unified`
+- Main risk: Need to confirm the specific config data can be redistributed or reconstructed locally
+
+### 30. EV Scheduling with Overload and Voltage Constraints
+- Domain: EV charging optimization
+- Upstream: EV2Gym
+- Canonical source: EV2Gym official configs
+- License / redistribution: MIT; config provenance must be pinned
+- Baseline: Official MPC or heuristic
+- Agent interface: `solve(env) -> actions`
+- Metrics: `cost`, `overload_penalty`, `voltage_penalty`
+- Recommended integration: `unified`
+- Main risk: The grid proxy must be made stable enough for repeated evaluation
+
+### 31. Small Reservoir Network Cost Minimization
+- Domain: Water resources optimization
+- Upstream: CALVIN
+- Canonical source: CALVIN official site, official GitHub, and official example data lineage
+- License / redistribution: CALVIN code MIT; full data redistribution needs separate confirmation
+- Baseline: Pyomo MILP
+- Agent interface: `solve(instance) -> policy`
+- Metrics: `total_cost`, `shortage_penalty`, `valid`
+- Recommended integration: `unified`
+- Main risk: Official data is large and externally hosted, so a smaller benchmark likely needs a carefully documented derivative
+
+### 32. Drought-Scenario Water Allocation
+- Domain: Water resources optimization
+- Upstream: CALVIN
+- Canonical source: CALVIN official model and published problem framing; likely a reduced synthetic or derived network
+- License / redistribution: CALVIN code MIT; derived data must be labeled clearly
+- Baseline: Pyomo MILP
+- Agent interface: `solve(instance) -> allocation`
+- Metrics: `total_cost`, `shortage`, `reservoir_violations`
+- Recommended integration: `unified`
+- Main risk: It must not be misrepresented as official CALVIN data if the network is reduced or synthetic
+
+### 33. Markowitz Minimum Variance with Return Floor
+- Domain: Portfolio optimization
+- Upstream: PyPortfolioOpt
+- Canonical source: PyPortfolioOpt repo test returns or another fixed local returns matrix; upstream market-data provenance is relatively weak
+- License / redistribution: Code MIT; market data licensing requires extra verification
+- Baseline: `EfficientFrontier`
+- Agent interface: `solve(returns) -> weights`
+- Metrics: `annual_return`, `volatility`, `sharpe`, `max_drawdown`
+- Recommended integration: `unified`
+- Main risk: Data authenticity is weaker than canonical academic benchmarks unless a stronger source is chosen
+
+### 34. Maximum Sharpe Static Portfolio
+- Domain: Portfolio optimization
+- Upstream: PyPortfolioOpt
+- Canonical source: Same as above; stronger if replaced by a licensed market snapshot or clearly labeled synthetic returns
+- License / redistribution: Code MIT; any real market data requires separate license review
+- Baseline: `max_sharpe()`
+- Agent interface: `solve(returns) -> weights`
+- Metrics: `sharpe`, `turnover_vs_baseline`, `valid`
+- Recommended integration: `unified`
+- Main risk: Weak data provenance if using ad hoc historical prices
+
+### 35. Rebalancing with Transaction Cost and Leverage Limits
+- Domain: Portfolio optimization
+- Upstream: PyPortfolioOpt
+- Canonical source: Fixed returns matrix from official tests or synthetic data unless a licensed market dataset is selected
+- License / redistribution: Code MIT; real price data needs explicit license verification
+- Baseline: `EfficientFrontier` plus transaction-cost objective
+- Agent interface: `solve(returns, w0) -> weights`
+- Metrics: `net_sharpe`, `turnover`, `max_leverage_violation`
+- Recommended integration: `unified`
+- Main risk: This task is highly exposed to data provenance weakness
+
+### 36. Small Fab Dispatch Rule Composition
+- Domain: Manufacturing scheduling
+- Upstream: SimRLFab
+- Canonical source: SimRLFab official default semiconductor config and related paper
+- License / redistribution: MIT; bundled configs appear open
+- Baseline: FIFO, SPT, or LPT
+- Agent interface: `schedule_fab(state) -> priority`
+- Metrics: `throughput`, `avg_flow_time`, `valid`
+- Recommended integration: `bespoke wrapper`
+- Main risk: The simulator stack is heavier than typical single-file tasks
+
+### 37. Urgent-Order Multi-Objective Dispatching
+- Domain: Manufacturing scheduling
+- Upstream: SimRLFab
+- Canonical source: Official SimRLFab config plus explicit local reward shaping
+- License / redistribution: MIT
+- Baseline: FIFO, SPT, or LPT
+- Agent interface: `schedule_fab(state) -> priority`
+- Metrics: `weighted_flow_time_wip_tardiness`, `valid`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Reward design can easily make the benchmark unstable
+
+### 38. 3D Urban Drone Obstacle-Avoiding Path Planning
+- Domain: Drone path planning
+- Upstream: `martin0004/drone_path_planning`
+- Canonical source: Official repo plus San Francisco `colliders.csv` lineage from the Udacity FCND project
+- License / redistribution: Repo contains `LICENSE.txt`; map-data lineage should be documented explicitly
+- Baseline: Repo RRT implementation
+- Agent interface: `plan_3d_path(terrain, start, goal) -> path`
+- Metrics: `path_length`, `collisions`, `runtime`
+- Recommended integration: `unified`
+- Main risk: Provenance is a lineage chain rather than a formal canonical benchmark
+
+### 39. Energy-Penalized 3D Path Planning
+- Domain: Drone path planning
+- Upstream: `martin0004/drone_path_planning`
+- Canonical source: Same lineage as above or a clearly labeled synthetic terrain
+- License / redistribution: Same caveat as above
+- Baseline: 3D A* or RRT
+- Agent interface: `plan_3d_path(terrain, start, goal) -> path`
+- Metrics: `path_length_plus_height_penalty`, `valid`
+- Recommended integration: `unified`
+- Main risk: If changed to synthetic terrain, the benchmark authenticity drops further
+
+### 40. UAV Formation Convergence Control
+- Domain: Multi-agent control
+- Upstream: CoFlyers
+- Canonical source: CoFlyers official Vasarhelyi example and repo configs
+- License / redistribution: GPL-3.0; MATLAB/Simulink dependencies are significant
+- Baseline: Official Vasarhelyi algorithm
+- Agent interface: `solve(state) -> desired_velocities`
+- Metrics: `convergence_time`, `collisions`, `control_energy`
+- Recommended integration: `bespoke wrapper`
+- Main risk: GPL and MATLAB dependencies make integration expensive
+
+## P3 Priority
+
+### 41. HALE Aircraft Design Optimization
+- Domain: Aerospace design
+- Upstream: DawnDesignTool
+- Canonical source: Official repo `design_opt.py` and related paper; assumptions come from the model rather than a public benchmark dataset
+- License / redistribution: MIT
+- Baseline: Official `design_opt.py` flow
+- Agent interface: `solve_design(spec) -> design`
+- Metrics: `weight` or `cruise_power`, `constraint_feasible`
+- Recommended integration: `unified`
+- Main risk: Official model provenance is fine, but not the same as a canonical public benchmark instance
+
+### 42. Small Imaging-System Optical Optimization
+- Domain: Optical design
+- Upstream: Optiland
+- Canonical source: Optiland official examples; avoid hidden dependency on external materials databases
+- License / redistribution: MIT; any external glass/material database needs separate terms review
+- Baseline: Official optimizer and merit function
+- Agent interface: `solve(system) -> lens_params`
+- Metrics: `mtf_loss`, `aberration_score`, `valid`
+- Recommended integration: `unified`
+- Main risk: Material-database provenance can become muddy quickly
+
+### 43. Small-Molecule Force-Field Parameter Fitting
+- Domain: Molecular simulation
+- Upstream: OpenFF Toolkit / BespokeFit
+- Canonical source: OpenFF official examples and BespokeFit paper; actual QM reference set still needs a canonical selection
+- License / redistribution: MIT
+- Baseline: OpenFF recommended parameterization flow
+- Agent interface: `fit_ff_params(dataset) -> params`
+- Metrics: `energy_rmse`, `force_rmse`, `valid`
+- Recommended integration: `bespoke wrapper`
+- Main risk: The hard provenance question lies in the QM reference set, not the toolkit code
+
+### 44. MD Parameter Performance Tuning
+- Domain: Molecular simulation
+- Upstream: OpenFF Toolkit plus OpenMM ecosystem
+- Canonical source: OpenFF official example molecules; not a canonical performance benchmark
+- License / redistribution: OpenFF MIT; additional dependency terms vary
+- Baseline: Default integrator and cutoff settings
+- Agent interface: `tune_md(system) -> md_config`
+- Metrics: `throughput`, `energy_stability`, `valid`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Results are highly hardware- and version-sensitive
+
+### 45. Multiple Sequence Alignment Quality-Time Tradeoff
+- Domain: Bioinformatics
+- Upstream: Sequoya
+- Canonical source: Sequoya official repo and paper; a serious benchmark should instead pin a canonical subset such as BAliBASE
+- License / redistribution: Code MIT; benchmark-dataset terms must be checked separately
+- Baseline: Sequoya default parameters
+- Agent interface: `align(seqs) -> alignment`
+- Metrics: `alignment_score`, `runtime`, `valid`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Benchmark value depends more on dataset provenance than on the code package
+
+### 46. Gap-Penalty-Sensitive MSA Optimization
+- Domain: Bioinformatics
+- Upstream: Sequoya
+- Canonical source: Same as above; should wait for a verified benchmark dataset choice
+- License / redistribution: Code MIT; dataset terms unresolved
+- Baseline: Default NSGA-II or M2Align-style configuration
+- Agent interface: `align(seqs) -> alignment`
+- Metrics: `sp_score`, `tc_score`, `gap_score`, `runtime`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Dataset provenance and scoring setup would dominate the implementation effort
+
+### 47. Additive-Manufacturing Differentiable Simulation Optimization
+- Domain: Manufacturing simulation
+- Upstream: `differentiable-simulation-am`
+- Canonical source: Official repo notebooks and bundled `data/`
+- License / redistribution: MIT
+- Baseline: Default gradient-descent setup from the paper and repo
+- Agent interface: `solve(params0) -> params`
+- Metrics: `best_loss`, `sim_calls`, `valid`
+- Recommended integration: `unified`
+- Main risk: The repo is notebook-heavy and needs stabilization into a clean evaluator
+
+### 48. Offline Driving Path and Behavior Planning
+- Domain: Autonomous driving
+- Upstream: CARLA
+- Canonical source: CARLA official simulator and official maps/assets exported offline
+- License / redistribution: Code MIT, assets CC-BY, Unreal-related terms add complexity
+- Baseline: IDM plus rule-based lane changes
+- Agent interface: `solve(state) -> controls`
+- Metrics: `collisions`, `avg_speed`, `brake_events`, `fuel_proxy`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Asset and engine licensing complexity makes this a poor first-batch candidate
+
+### 49. Data-Center Scheduling plus Cooling Control
+- Domain: Data-center optimization
+- Upstream: Hewlett Packard `dc-rl`
+- Canonical source: Official repo environments and examples
+- License / redistribution: Mixed MIT and CC BY-NC 4.0 terms
+- Baseline: Fixed MARL or heuristic policy
+- Agent interface: `solve(state) -> actions`
+- Metrics: `energy`, `peak_power`, `temp_violations`
+- Recommended integration: `bespoke wrapper`
+- Main risk: Mixed licensing and a heavy environment stack
+
+### 50. Optical-System Lightweighting
+- Domain: Optical design
+- Upstream: Optiland
+- Canonical source: Optiland official examples; any external glass/material database must be cited separately
+- License / redistribution: MIT for code; external material-data terms may differ
+- Baseline: Official optimizer with thickness and quality constraints
+- Agent interface: `solve(system) -> lens_params`
+- Metrics: `total_thickness`, `image_quality`, `valid`
+- Recommended integration: `unified`
+- Main risk: The code is relatively clean, but materials provenance still needs careful documentation
+
+## Source Index
+
+- Stockpyl GitHub: https://github.com/LarrySnyder/stockpyl
+- Stockpyl EOQ docs: https://stockpyl.readthedocs.io/en/latest/api/seio/eoq.html
+- Stockpyl RQ docs: https://stockpyl.readthedocs.io/en/latest/api/seio/rq.html
+- Stockpyl single-echelon tutorial: https://stockpyl.readthedocs.io/en/latest/tutorial/tutorial_seio.html
+- JobShopLib: https://github.com/Pabloo22/job_shop_lib
+- Job Shop Scheduling Benchmark Environments: https://github.com/ai-for-decision-making-tue/Job_Shop_Scheduling_Benchmark_Environments_and_Instances
+- DuckDB TPC-H extension: https://duckdb.org/docs/stable/core_extensions/tpch
+- DuckDB benchmark docs: https://duckdb.org/docs/1.3/guides/performance/benchmarks.html
+- motion-planners: https://github.com/caelan/motion-planners
+- pyMOTO: https://github.com/aatmdelissen/pyMOTO
+- WeatherRoutingTool: https://github.com/52North/WeatherRoutingTool
+- HALEM: https://github.com/TUDelft-CITG/halem
+- PyPSA: https://github.com/PyPSA/PyPSA
+- MESMO: https://github.com/mesmo-dev/mesmo
+- Summit: https://github.com/sustainable-processes/summit
+- EV2Gym: https://github.com/StavrosOrf/EV2Gym
+- CALVIN official site: https://calvin.ucdavis.edu/
+- CALVIN GitHub: https://github.com/ucd-cws/calvin
+- PyPortfolioOpt: https://github.com/PyPortfolio/PyPortfolioOpt
+- SimRLFab: https://github.com/AndreasKuhnle/SimRLFab
+- drone_path_planning: https://github.com/martin0004/drone_path_planning
+- CoFlyers: https://github.com/micros-uav/CoFlyers
+- DawnDesignTool: https://github.com/peterdsharpe/DawnDesignTool
+- Optiland: https://github.com/HarrisonKramer/optiland
+- OpenFF Toolkit: https://github.com/openforcefield/openff-toolkit
+- OpenFF BespokeFit: https://github.com/openforcefield/openff-bespokefit
+- Sequoya GitHub: https://github.com/benhid/Sequoya
+- Sequoya paper: https://academic.oup.com/bioinformatics/article-abstract/36/12/3892/5823295
+- differentiable-simulation-am: https://github.com/mojtabamozaffar/differentiable-simulation-am
+- CARLA: https://github.com/carla-simulator/carla
+- dc-rl: https://github.com/HewlettPackard/dc-rl
diff --git a/docs/benchmark_ideas/frontier_benchmark_cards.yaml b/docs/benchmark_ideas/frontier_benchmark_cards.yaml
new file mode 100644
index 00000000..a6e98663
--- /dev/null
+++ b/docs/benchmark_ideas/frontier_benchmark_cards.yaml
@@ -0,0 +1,672 @@
+generated_on: "2026-03-12"
+skill: "frontier-benchmark-contributor"
+status: "draft"
+selection_policy:
+  - "Prefer authentic and traceable data sources or canonical benchmark instances."
+  - "Prefer offline reproducibility and stable evaluation."
+  - "Prefer task=unified where practical."
+  - "Downgrade items with weak provenance, unclear redistribution terms, or heavyweight runtime stacks."
+cards:
+  - rank: 1
+    priority: "P1"
+    title: "EOQ+MOQ Annual Cost Minimization"
+    domain: "Inventory optimization"
+    upstream: "Stockpyl EOQ"
+    source: "Stockpyl official EOQ docs and Snyder/Shen textbook formulas; no external dataset."
+    license: "Stockpyl MIT; no data redistribution issue."
+    baseline: "economic_order_quantity plus outer feasible-set enumeration for MOQ."
+    agent_interface: "solve(params) -> Q"
+    metrics: ["annual_total_cost", "valid"]
+    integration: "unified"
+    risk: "MOQ is an outer constraint rather than a native closed-form Stockpyl solver."
+  - rank: 2
+    priority: "P1"
+    title: "EOQ All-Units Discount Optimization"
+    domain: "Inventory optimization"
+    upstream: "Stockpyl EOQ with all-units discounts"
+    source: "Stockpyl official EOQ docs; no external dataset."
+    license: "Stockpyl MIT; no data redistribution issue."
+    baseline: "economic_order_quantity_with_all_units_discounts."
+    agent_interface: "solve(params) -> Q"
+    metrics: ["annual_total_cost", "chosen_region", "valid"]
+    integration: "unified"
+    risk: "Feasible regions become subtle when all-units discounts are combined with MOQ."
+  - rank: 3
+    priority: "P1"
+    title: "EOQ Incremental Discount Optimization"
+    domain: "Inventory optimization"
+    upstream: "Stockpyl EOQ with incremental discounts"
+    source: "Stockpyl official EOQ docs; no external dataset."
+    license: "Stockpyl MIT; no data redistribution issue."
+    baseline: "economic_order_quantity_with_incremental_discounts."
+    agent_interface: "solve(params) -> Q"
+    metrics: ["annual_total_cost", "chosen_region", "valid"]
+    integration: "unified"
+    risk: "Incremental discount cost accounting is easy to implement incorrectly."
+  - rank: 4
+    priority: "P1"
+    title: "Poisson-Demand (r,Q) Exact Optimization"
+    domain: "Inventory optimization"
+    upstream: "Stockpyl rq"
+    source: "Stockpyl single-echelon inventory tutorial and r_q_poisson_exact."
+    license: "Stockpyl MIT; no data redistribution issue."
+    baseline: "r_q_poisson_exact."
+    agent_interface: "solve(params) -> (r, Q)"
+    metrics: ["cost", "valid"]
+    integration: "unified"
+    risk: "Service-level constraints may need an outer audit rather than a native function argument."
+  - rank: 5
+    priority: "P1"
+    title: "Normal-Demand (r,Q) with 95% Service-Level Constraint"
+    domain: "Inventory optimization"
+    upstream: "Stockpyl rq normal-demand approximations"
+    source: "Stockpyl official rq docs and tutorial."
+    license: "Stockpyl MIT; no data redistribution issue."
+    baseline: "r_q_eil_approximation, r_q_eoqss_approximation, or r_q_loss_function_approximation."
+    agent_interface: "solve(params) -> (r, Q)"
+    metrics: ["cost", "service_level_feasible", "valid"]
+    integration: "unified"
+    risk: "Approximation quality can drift around the service-level boundary."
+  - rank: 6
+    priority: "P1"
+    title: "FT10 Dispatching Rule Optimization"
+    domain: "Job shop scheduling"
+    upstream: "JobShopLib / Fisher-Thompson FT10"
+    source: "JobShopLib bundled ft10; original instance from Fisher and Thompson (1963)."
+    license: "JobShopLib MIT; original instance should preferably be loaded from the library or re-encoded carefully."
+    baseline: "CP-SAT or built-in dispatching rules."
+    agent_interface: "solve(instance) -> schedule"
+    metrics: ["makespan", "valid"]
+    integration: "unified"
+    risk: "Original instance licensing is less explicit than the code license."
+  - rank: 7
+    priority: "P1"
+    title: "LA16 Dispatching Rule Optimization"
+    domain: "Job shop scheduling"
+    upstream: "JobShopLib / Lawrence LA16"
+    source: "JobShopLib bundled la16; original instance from Lawrence (1984)."
+    license: "JobShopLib MIT; original instance should preferably be loaded from the library or re-encoded carefully."
+    baseline: "CP-SAT or built-in dispatching rules."
+    agent_interface: "solve(instance) -> schedule"
+    metrics: ["makespan", "valid"]
+    integration: "unified"
+    risk: "Same provenance caveat as FT10."
+  - rank: 8
+    priority: "P1"
+    title: "FT10 Neighborhood Move Selection"
+    domain: "Job shop scheduling"
+    upstream: "JobShopLib plus SA/CP-SAT"
+    source: "Canonical FT10 benchmark instance."
+    license: "JobShopLib MIT; instance provenance should be tied to JobShopLib or another canonical loader."
+    baseline: "Simulated annealing default neighborhood or CP-SAT optimal reference 930."
+    agent_interface: "choose_moves(state) -> move"
+    metrics: ["makespan", "improvement_over_baseline", "valid"]
+    integration: "unified"
+    risk: "The agent surface must stay narrow so the solver itself is not editable."
+  - rank: 9
+    priority: "P1"
+    title: "LA16 Neighborhood Move Selection"
+    domain: "Job shop scheduling"
+    upstream: "JobShopLib plus SA/CP-SAT"
+    source: "Canonical LA16 benchmark instance."
+    license: "JobShopLib MIT; instance provenance should be tied to JobShopLib or another canonical loader."
+    baseline: "Simulated annealing default neighborhood or CP-SAT reference 945."
+    agent_interface: "choose_moves(state) -> move"
+    metrics: ["makespan", "improvement_over_baseline", "valid"]
+    integration: "unified"
+    risk: "Reproducibility depends on fixed random seeds and a pinned neighborhood set."
+  - rank: 10
+    priority: "P1"
+    title: "DuckDB TPC-H Materialization / Index Selection"
+    domain: "Database optimization"
+    upstream: "DuckDB plus official TPC-H extension"
+    source: "DuckDB official tpch extension with local CALL dbgen(sf=...)."
+    license: "DuckDB MIT; TPC-H data generated locally to avoid redistributing raw benchmark tables."
+    baseline: "No additional materialization or a simple rule-based config."
+    agent_interface: "solve(workload) -> config"
+    metrics: ["total_runtime", "correctness", "valid"]
+    integration: "unified"
+    risk: "Native DuckDB indexes may have limited impact, so the benchmark may work better around materialization or layout."
+  - rank: 11
+    priority: "P1"
+    title: "DuckDB TPC-H Query Rewriting"
+    domain: "Database optimization"
+    upstream: "DuckDB plus official TPC-H queries"
+    source: "DuckDB official TPC-H query set and local generated data."
+    license: "DuckDB MIT; data generated locally."
+    baseline: "Original SQL."
+    agent_interface: "rewrite(sql) -> sql"
+    metrics: ["runtime_ratio", "result_match", "valid"]
+    integration: "unified"
+    risk: "Semantic equivalence checks must be strict and deterministic."
+  - rank: 12
+    priority: "P1"
+    title: "DuckDB TPC-H Pre-Aggregation Selection"
+    domain: "Database optimization"
+    upstream: "DuckDB plus TPC-H"
+    source: "DuckDB official TPC-H extension with local generated data."
+    license: "DuckDB MIT; data generated locally."
+    baseline: "No pre-aggregation."
+    agent_interface: "solve(workload) -> ddl_plan"
+    metrics: ["total_runtime", "storage_overhead", "valid"]
+    integration: "unified"
+    risk: "The task can drift into schema design rather than benchmark-guided optimization if not scoped tightly."
+  - rank: 13
+    priority: "P1"
+    title: "2D Grid Obstacle-Avoiding Path Planning"
+    domain: "Motion planning"
+    upstream: "caelan/motion-planners"
+    source: "Official repo plus fixed-seed synthetic grid maps."
+    license: "MIT; synthetic maps have no redistribution issue."
+    baseline: "A* or RRT."
+    agent_interface: "plan_path(map, start, goal) -> path"
+    metrics: ["path_length", "success_rate", "runtime"]
+    integration: "unified"
+    risk: "Map authenticity is weaker than a canonical real-world benchmark."
+  - rank: 14
+    priority: "P1"
+    title: "2D Narrow-Passage Planning"
+    domain: "Motion planning"
+    upstream: "caelan/motion-planners"
+    source: "Official repo plus fixed-seed synthetic narrow-passage maps."
+    license: "MIT; synthetic maps have no redistribution issue."
+    baseline: "BiRRT or RRT*."
+    agent_interface: "plan_path(map, start, goal) -> path"
+    metrics: ["success_rate", "path_cost", "runtime"]
+    integration: "unified"
+    risk: "The benchmark is synthetic rather than tied to a public robotics map dataset."
+  - rank: 15
+    priority: "P1"
+    title: "Multi-Robot Prioritized Planning"
+    domain: "Motion planning"
+    upstream: "caelan/motion-planners"
+    source: "Official repo plus fixed-seed multi-agent grid instances."
+    license: "MIT; synthetic instances have no redistribution issue."
+    baseline: "Prioritized planning or independent A*."
+    agent_interface: "solve(map, agents) -> paths"
+    metrics: ["total_path_length", "collisions", "runtime"]
+    integration: "unified"
+    risk: "Collision checking and tie-breaking must remain deterministic."
+  - rank: 16
+    priority: "P2"
+    title: "Cantilever Compliance Topology Optimization"
+    domain: "Structural optimization"
+    upstream: "pyMOTO"
+    source: "pyMOTO official examples and standard FEM setup."
+    license: "MIT; example data can be redistributed with the repo."
+    baseline: "Official SIMP plus OC/MMA."
+    agent_interface: "solve(load_case) -> density"
+    metrics: ["compliance", "volume_fraction", "valid"]
+    integration: "unified"
+    risk: "Mesh size must be kept small for fast evaluation."
+  - rank: 17
+    priority: "P2"
+    title: "MBB Beam Topology Optimization"
+    domain: "Structural optimization"
+    upstream: "pyMOTO"
+    source: "pyMOTO official examples."
+    license: "MIT."
+    baseline: "Official SIMP plus OC/MMA."
+    agent_interface: "solve(load_case) -> density"
+    metrics: ["compliance", "volume_fraction", "valid"]
+    integration: "unified"
+    risk: "Filter and projection choices can dominate the result."
+  - rank: 18
+    priority: "P2"
+    title: "Bridge Topology Optimization"
+    domain: "Structural optimization"
+    upstream: "pyMOTO"
+    source: "pyMOTO official examples or standard textbook bridge setup."
+    license: "MIT."
+    baseline: "Official SIMP plus OC/MMA."
+    agent_interface: "solve(load_case) -> density"
+    metrics: ["compliance", "volume_fraction", "checkerboard_penalty"]
+    integration: "unified"
+    risk: "The canonical geometry must be frozen in the task spec."
+  - rank: 19
+    priority: "P2"
+    title: "Fuel-Minimizing Ship Weather Routing"
+    domain: "Maritime optimization"
+    upstream: "52North WeatherRoutingTool"
+    source: "Official WeatherRoutingTool algorithms and, if available, official demo weather fields; otherwise clearly labeled synthetic grids."
+    license: "MIT; synthetic weather grids avoid redistribution issues."
+    baseline: "Default shortest-route or official routing heuristic."
+    agent_interface: "solve(instance) -> waypoints"
+    metrics: ["fuel_cost", "travel_time", "constraint_violations"]
+    integration: "unified"
+    risk: "If official demo weather fields are not usable, the benchmark becomes synthetic."
+  - rank: 20
+    priority: "P2"
+    title: "Dynamic-Current Minimum-Time Routing"
+    domain: "Maritime optimization"
+    upstream: "HALEM"
+    source: "HALEM official repo and test/example current fields, or clearly labeled synthetic current field."
+    license: "MIT."
+    baseline: "HALEM_time."
+    agent_interface: "solve(instance) -> route"
+    metrics: ["travel_time", "valid"]
+    integration: "unified"
+    risk: "Need to confirm that official example fields are light enough to vendor or reproduce."
+  - rank: 21
+    priority: "P2"
+    title: "Depth-Constrained Cost-Minimizing Routing"
+    domain: "Maritime optimization"
+    upstream: "HALEM"
+    source: "HALEM official repo and example current/depth fields."
+    license: "MIT."
+    baseline: "HALEM_cost or HALEM_co2."
+    agent_interface: "solve(instance) -> route"
+    metrics: ["route_cost", "grounding_violations", "valid"]
+    integration: "unified"
+    risk: "Any transformation into local arrays must be documented carefully."
+  - rank: 22
+    priority: "P2"
+    title: "Intraday Operation with Storage"
+    domain: "Power systems"
+    upstream: "PyPSA"
+    source: "PyPSA official example networks."
+    license: "MIT; official example networks are preferred."
+    baseline: "PyPSA default linear optimization."
+    agent_interface: "solve(network) -> dispatch"
+    metrics: ["opex", "curtailment", "valid"]
+    integration: "unified"
+    risk: "Linopy/HiGHS runtime stack is heavier than pure Python tasks."
+  - rank: 23
+    priority: "P2"
+    title: "Transmission Expansion Planning"
+    domain: "Power systems"
+    upstream: "PyPSA"
+    source: "PyPSA official example networks."
+    license: "MIT."
+    baseline: "PyPSA default capacity expansion model."
+    agent_interface: "solve(network) -> expansion_plan"
+    metrics: ["total_system_cost", "unmet_demand", "valid"]
+    integration: "unified"
+    risk: "Time horizon must be reduced to keep runtime bounded."
+  - rank: 24
+    priority: "P2"
+    title: "Renewable Siting plus Line Expansion"
+    domain: "Power systems"
+    upstream: "PyPSA"
+    source: "PyPSA official example networks or a clearly documented toy network derived from them."
+    license: "MIT."
+    baseline: "PyPSA default optimization."
+    agent_interface: "solve(network) -> plan"
+    metrics: ["capex_opex_total", "renewable_share", "valid"]
+    integration: "unified"
+    risk: "A derived toy case requires explicit provenance and transformation notes."
+  - rank: 25
+    priority: "P2"
+    title: "Single-Day Multi-Energy Scheduling"
+    domain: "Integrated energy systems"
+    upstream: "MESMO"
+    source: "MESMO official example cases and official repo data."
+    license: "MIT; official examples preferred."
+    baseline: "MESMO or CVXPY default model."
+    agent_interface: "solve(instance) -> schedule"
+    metrics: ["total_cost", "feasibility", "valid"]
+    integration: "unified"
+    risk: "CVXPY and solver dependencies are relatively heavy."
+  - rank: 26
+    priority: "P2"
+    title: "Joint Investment plus Operations for Heat Pump and Storage"
+    domain: "Integrated energy systems"
+    upstream: "MESMO"
+    source: "MESMO official example case or a clearly documented reduced derivative."
+    license: "MIT."
+    baseline: "MESMO default optimization."
+    agent_interface: "solve(instance) -> design_and_ops"
+    metrics: ["lifecycle_cost", "capacity_feasible", "valid"]
+    integration: "unified"
+    risk: "Investment variables increase runtime significantly."
+  - rank: 27
+    priority: "P2"
+    title: "SnAr Reaction Condition Optimization"
+    domain: "Chemical process optimization"
+    upstream: "Summit"
+    source: "Summit official SnarBenchmark and associated benchmark paper."
+    license: "Summit MIT; benchmark is a simulator, so no external data redistribution issue."
+    baseline: "SOBO or Nelder-Mead."
+    agent_interface: "solve(experiment) -> conditions"
+    metrics: ["best_objective", "budget_efficiency", "valid"]
+    integration: "unified"
+    risk: "Budget and random seed must be fixed tightly."
+  - rank: 28
+    priority: "P2"
+    title: "SnAr Multi-Objective Optimization"
+    domain: "Chemical process optimization"
+    upstream: "Summit"
+    source: "Summit official benchmark suite."
+    license: "MIT."
+    baseline: "Scalarized official strategy such as SOBO."
+    agent_interface: "solve(experiment) -> trial_sequence"
+    metrics: ["best_scalarized_score", "pareto_hypervolume", "valid"]
+    integration: "unified"
+    risk: "Multi-objective scoring must be frozen before implementation."
+  - rank: 29
+    priority: "P2"
+    title: "EV Price-Arbitrage Charging"
+    domain: "EV charging optimization"
+    upstream: "EV2Gym"
+    source: "EV2Gym official example configs and open-source data references."
+    license: "MIT; chosen config and any bundled data must be checked case by case."
+    baseline: "ChargeAsFastAsPossible or official MPC/oracle heuristic."
+    agent_interface: "solve(env) -> actions"
+    metrics: ["energy_cost", "user_satisfaction", "valid"]
+    integration: "unified"
+    risk: "Need to confirm the specific config data can be redistributed or reconstructed locally."
+  - rank: 30
+    priority: "P2"
+    title: "EV Scheduling with Overload and Voltage Constraints"
+    domain: "EV charging optimization"
+    upstream: "EV2Gym"
+    source: "EV2Gym official configs."
+    license: "MIT; config provenance must be pinned."
+    baseline: "Official MPC or heuristic."
+    agent_interface: "solve(env) -> actions"
+    metrics: ["cost", "overload_penalty", "voltage_penalty"]
+    integration: "unified"
+    risk: "The grid proxy must be made stable enough for repeated evaluation."
+  - rank: 31
+    priority: "P2"
+    title: "Small Reservoir Network Cost Minimization"
+    domain: "Water resources optimization"
+    upstream: "CALVIN"
+    source: "CALVIN official site, official GitHub, and official example data lineage."
+    license: "CALVIN code MIT; full data redistribution needs separate confirmation."
+    baseline: "Pyomo MILP."
+    agent_interface: "solve(instance) -> policy"
+    metrics: ["total_cost", "shortage_penalty", "valid"]
+    integration: "unified"
+    risk: "Official data is large and externally hosted, so a smaller benchmark likely needs a carefully documented derivative."
+  - rank: 32
+    priority: "P2"
+    title: "Drought-Scenario Water Allocation"
+    domain: "Water resources optimization"
+    upstream: "CALVIN"
+    source: "CALVIN official model and published problem framing; likely a reduced synthetic or derived network."
+    license: "CALVIN code MIT; derived data must be labeled clearly."
+    baseline: "Pyomo MILP."
+    agent_interface: "solve(instance) -> allocation"
+    metrics: ["total_cost", "shortage", "reservoir_violations"]
+    integration: "unified"
+    risk: "It must not be misrepresented as official CALVIN data if the network is reduced or synthetic."
+  - rank: 33
+    priority: "P2"
+    title: "Markowitz Minimum Variance with Return Floor"
+    domain: "Portfolio optimization"
+    upstream: "PyPortfolioOpt"
+    source: "PyPortfolioOpt repo test returns or another fixed local returns matrix; upstream market-data provenance is relatively weak."
+    license: "Code MIT; market data licensing requires extra verification."
+    baseline: "EfficientFrontier."
+    agent_interface: "solve(returns) -> weights"
+    metrics: ["annual_return", "volatility", "sharpe", "max_drawdown"]
+    integration: "unified"
+    risk: "Data authenticity is weaker than canonical academic benchmarks unless a stronger source is chosen."
+  - rank: 34
+    priority: "P2"
+    title: "Maximum Sharpe Static Portfolio"
+    domain: "Portfolio optimization"
+    upstream: "PyPortfolioOpt"
+    source: "Same as above; stronger if replaced by a licensed market snapshot or clearly labeled synthetic returns."
+    license: "Code MIT; any real market data requires separate license review."
+    baseline: "max_sharpe()."
+    agent_interface: "solve(returns) -> weights"
+    metrics: ["sharpe", "turnover_vs_baseline", "valid"]
+    integration: "unified"
+    risk: "Weak data provenance if using ad hoc historical prices."
+  - rank: 35
+    priority: "P2"
+    title: "Rebalancing with Transaction Cost and Leverage Limits"
+    domain: "Portfolio optimization"
+    upstream: "PyPortfolioOpt"
+    source: "Fixed returns matrix from official tests or synthetic data unless a licensed market dataset is selected."
+    license: "Code MIT; real price data needs explicit license verification."
+    baseline: "EfficientFrontier plus transaction-cost objective."
+    agent_interface: "solve(returns, w0) -> weights"
+    metrics: ["net_sharpe", "turnover", "max_leverage_violation"]
+    integration: "unified"
+    risk: "This task is highly exposed to data provenance weakness."
+  - rank: 36
+    priority: "P2"
+    title: "Small Fab Dispatch Rule Composition"
+    domain: "Manufacturing scheduling"
+    upstream: "SimRLFab"
+    source: "SimRLFab official default semiconductor config and related paper."
+    license: "MIT; bundled configs appear open."
+    baseline: "FIFO, SPT, or LPT."
+    agent_interface: "schedule_fab(state) -> priority"
+    metrics: ["throughput", "avg_flow_time", "valid"]
+    integration: "bespoke wrapper"
+    risk: "The simulator stack is heavier than typical single-file tasks."
+  - rank: 37
+    priority: "P2"
+    title: "Urgent-Order Multi-Objective Dispatching"
+    domain: "Manufacturing scheduling"
+    upstream: "SimRLFab"
+    source: "Official SimRLFab config plus explicit local reward shaping."
+    license: "MIT."
+    baseline: "FIFO, SPT, or LPT."
+    agent_interface: "schedule_fab(state) -> priority"
+    metrics: ["weighted_flow_time_wip_tardiness", "valid"]
+    integration: "bespoke wrapper"
+    risk: "Reward design can easily make the benchmark unstable."
+  - rank: 38
+    priority: "P2"
+    title: "3D Urban Drone Obstacle-Avoiding Path Planning"
+    domain: "Drone path planning"
+    upstream: "martin0004/drone_path_planning"
+    source: "Official repo plus San Francisco colliders.csv lineage from the Udacity FCND project."
+    license: "Repo contains LICENSE.txt; map-data lineage should be documented explicitly."
+    baseline: "Repo RRT implementation."
+    agent_interface: "plan_3d_path(terrain, start, goal) -> path"
+    metrics: ["path_length", "collisions", "runtime"]
+    integration: "unified"
+    risk: "Provenance is a lineage chain rather than a formal canonical benchmark."
+  - rank: 39
+    priority: "P2"
+    title: "Energy-Penalized 3D Path Planning"
+    domain: "Drone path planning"
+    upstream: "martin0004/drone_path_planning"
+    source: "Same lineage as above or a clearly labeled synthetic terrain."
+    license: "Same caveat as above."
+    baseline: "3D A* or RRT."
+    agent_interface: "plan_3d_path(terrain, start, goal) -> path"
+    metrics: ["path_length_plus_height_penalty", "valid"]
+    integration: "unified"
+    risk: "If changed to synthetic terrain, the benchmark authenticity drops further."
+  - rank: 40
+    priority: "P2"
+    title: "UAV Formation Convergence Control"
+    domain: "Multi-agent control"
+    upstream: "CoFlyers"
+    source: "CoFlyers official Vasarhelyi example and repo configs."
+    license: "GPL-3.0; MATLAB/Simulink dependencies are significant."
+    baseline: "Official Vasarhelyi algorithm."
+    agent_interface: "solve(state) -> desired_velocities"
+    metrics: ["convergence_time", "collisions", "control_energy"]
+    integration: "bespoke wrapper"
+    risk: "GPL and MATLAB dependencies make integration expensive."
+  - rank: 41
+    priority: "P3"
+    title: "HALE Aircraft Design Optimization"
+    domain: "Aerospace design"
+    upstream: "DawnDesignTool"
+    source: "Official repo design_opt.py and related paper; assumptions come from the model rather than a public benchmark dataset."
+    license: "MIT."
+    baseline: "Official design_opt.py flow."
+    agent_interface: "solve_design(spec) -> design"
+    metrics: ["weight", "cruise_power", "constraint_feasible"]
+    integration: "unified"
+    risk: "Official model provenance is fine, but not the same as a canonical public benchmark instance."
+  - rank: 42
+    priority: "P3"
+    title: "Small Imaging-System Optical Optimization"
+    domain: "Optical design"
+    upstream: "Optiland"
+    source: "Optiland official examples; avoid hidden dependency on external materials databases."
+    license: "MIT; any external glass/material database needs separate terms review."
+    baseline: "Official optimizer and merit function."
+    agent_interface: "solve(system) -> lens_params"
+    metrics: ["mtf_loss", "aberration_score", "valid"]
+    integration: "unified"
+    risk: "Material-database provenance can become muddy quickly."
+  - rank: 43
+    priority: "P3"
+    title: "Small-Molecule Force-Field Parameter Fitting"
+    domain: "Molecular simulation"
+    upstream: "OpenFF Toolkit / BespokeFit"
+    source: "OpenFF official examples and BespokeFit paper; actual QM reference set still needs a canonical selection."
+    license: "MIT."
+    baseline: "OpenFF recommended parameterization flow."
+    agent_interface: "fit_ff_params(dataset) -> params"
+    metrics: ["energy_rmse", "force_rmse", "valid"]
+    integration: "bespoke wrapper"
+    risk: "The hard provenance question lies in the QM reference set, not the toolkit code."
+  - rank: 44
+    priority: "P3"
+    title: "MD Parameter Performance Tuning"
+    domain: "Molecular simulation"
+    upstream: "OpenFF Toolkit plus OpenMM ecosystem"
+    source: "OpenFF official example molecules; not a canonical performance benchmark."
+    license: "OpenFF MIT; additional dependency terms vary."
+    baseline: "Default integrator and cutoff settings."
+    agent_interface: "tune_md(system) -> md_config"
+    metrics: ["throughput", "energy_stability", "valid"]
+    integration: "bespoke wrapper"
+    risk: "Results are highly hardware- and version-sensitive."
+  - rank: 45
+    priority: "P3"
+    title: "Multiple Sequence Alignment Quality-Time Tradeoff"
+    domain: "Bioinformatics"
+    upstream: "Sequoya"
+    source: "Sequoya official repo and paper; a serious benchmark should instead pin a canonical subset such as BAliBASE."
+    license: "Code MIT; benchmark-dataset terms must be checked separately."
+    baseline: "Sequoya default parameters."
+    agent_interface: "align(seqs) -> alignment"
+    metrics: ["alignment_score", "runtime", "valid"]
+    integration: "bespoke wrapper"
+    risk: "Benchmark value depends more on dataset provenance than on the code package."
+  - rank: 46
+    priority: "P3"
+    title: "Gap-Penalty-Sensitive MSA Optimization"
+    domain: "Bioinformatics"
+    upstream: "Sequoya"
+    source: "Same as above; should wait for a verified benchmark dataset choice."
+    license: "Code MIT; dataset terms unresolved."
+    baseline: "Default NSGA-II or M2Align-style configuration."
+    agent_interface: "align(seqs) -> alignment"
+    metrics: ["sp_score", "tc_score", "gap_score", "runtime"]
+    integration: "bespoke wrapper"
+    risk: "Dataset provenance and scoring setup would dominate the implementation effort."
+  - rank: 47
+    priority: "P3"
+    title: "Additive-Manufacturing Differentiable Simulation Optimization"
+    domain: "Manufacturing simulation"
+    upstream: "differentiable-simulation-am"
+    source: "Official repo notebooks and bundled data."
+    license: "MIT."
+    baseline: "Default gradient-descent setup from the paper and repo."
+    agent_interface: "solve(params0) -> params"
+    metrics: ["best_loss", "sim_calls", "valid"]
+    integration: "unified"
+    risk: "The repo is notebook-heavy and needs stabilization into a clean evaluator."
+  - rank: 48
+    priority: "P3"
+    title: "Offline Driving Path and Behavior Planning"
+    domain: "Autonomous driving"
+    upstream: "CARLA"
+    source: "CARLA official simulator and official maps/assets exported offline."
+    license: "Code MIT, assets CC-BY, Unreal-related terms add complexity."
+    baseline: "IDM plus rule-based lane changes."
+    agent_interface: "solve(state) -> controls"
+    metrics: ["collisions", "avg_speed", "brake_events", "fuel_proxy"]
+    integration: "bespoke wrapper"
+    risk: "Asset and engine licensing complexity makes this a poor first-batch candidate."
+  - rank: 49
+    priority: "P3"
+    title: "Data-Center Scheduling plus Cooling Control"
+    domain: "Data-center optimization"
+    upstream: "Hewlett Packard dc-rl"
+    source: "Official repo environments and examples."
+    license: "Mixed MIT and CC BY-NC 4.0 terms."
+    baseline: "Fixed MARL or heuristic policy."
+    agent_interface: "solve(state) -> actions"
+    metrics: ["energy", "peak_power", "temp_violations"]
+    integration: "bespoke wrapper"
+    risk: "Mixed licensing and a heavy environment stack."
+  - rank: 50
+    priority: "P3"
+    title: "Optical-System Lightweighting"
+    domain: "Optical design"
+    upstream: "Optiland"
+    source: "Optiland official examples; any external glass/material database must be cited separately."
+    license: "MIT for code; external material-data terms may differ."
+    baseline: "Official optimizer with thickness and quality constraints."
+    agent_interface: "solve(system) -> lens_params"
+    metrics: ["total_thickness", "image_quality", "valid"]
+    integration: "unified"
+    risk: "The code is relatively clean, but materials provenance still needs careful documentation."
+sources:
+  - name: "Stockpyl GitHub"
+    url: "https://github.com/LarrySnyder/stockpyl"
+  - name: "Stockpyl EOQ docs"
+    url: "https://stockpyl.readthedocs.io/en/latest/api/seio/eoq.html"
+  - name: "Stockpyl RQ docs"
+    url: "https://stockpyl.readthedocs.io/en/latest/api/seio/rq.html"
+  - name: "Stockpyl single-echelon tutorial"
+    url: "https://stockpyl.readthedocs.io/en/latest/tutorial/tutorial_seio.html"
+  - name: "JobShopLib"
+    url: "https://github.com/Pabloo22/job_shop_lib"
+  - name: "Job Shop Scheduling Benchmark Environments"
+    url: "https://github.com/ai-for-decision-making-tue/Job_Shop_Scheduling_Benchmark_Environments_and_Instances"
+  - name: "DuckDB TPC-H extension"
+    url: "https://duckdb.org/docs/stable/core_extensions/tpch"
+  - name: "DuckDB benchmark docs"
+    url: "https://duckdb.org/docs/1.3/guides/performance/benchmarks.html"
+  - name: "motion-planners"
+    url: "https://github.com/caelan/motion-planners"
+  - name: "pyMOTO"
+    url: "https://github.com/aatmdelissen/pyMOTO"
+  - name: "WeatherRoutingTool"
+    url: "https://github.com/52North/WeatherRoutingTool"
+  - name: "HALEM"
+    url: "https://github.com/TUDelft-CITG/halem"
+  - name: "PyPSA"
+    url: "https://github.com/PyPSA/PyPSA"
+  - name: "MESMO"
+    url: "https://github.com/mesmo-dev/mesmo"
+  - name: "Summit"
+    url: "https://github.com/sustainable-processes/summit"
+  - name: "EV2Gym"
+    url: "https://github.com/StavrosOrf/EV2Gym"
+  - name: "CALVIN official site"
+    url: "https://calvin.ucdavis.edu/"
+  - name: "CALVIN GitHub"
+    url: "https://github.com/ucd-cws/calvin"
+  - name: "PyPortfolioOpt"
+    url: "https://github.com/PyPortfolio/PyPortfolioOpt"
+  - name: "SimRLFab"
+    url: "https://github.com/AndreasKuhnle/SimRLFab"
+  - name: "drone_path_planning"
+    url: "https://github.com/martin0004/drone_path_planning"
+  - name: "CoFlyers"
+    url: "https://github.com/micros-uav/CoFlyers"
+  - name: "DawnDesignTool"
+    url: "https://github.com/peterdsharpe/DawnDesignTool"
+  - name: "Optiland"
+    url: "https://github.com/HarrisonKramer/optiland"
+  - name: "OpenFF Toolkit"
+    url: "https://github.com/openforcefield/openff-toolkit"
+  - name: "OpenFF BespokeFit"
+    url: "https://github.com/openforcefield/openff-bespokefit"
+  - name: "Sequoya GitHub"
+    url: "https://github.com/benhid/Sequoya"
+  - name: "Sequoya paper"
+    url: "https://academic.oup.com/bioinformatics/article-abstract/36/12/3892/5823295"
+  - name: "differentiable-simulation-am"
+    url: "https://github.com/mojtabamozaffar/differentiable-simulation-am"
+  - name: "CARLA"
+    url: "https://github.com/carla-simulator/carla"
+  - name: "dc-rl"
+    url: "https://github.com/HewlettPackard/dc-rl"
diff --git a/docs/benchmark_ideas/frontier_first20_quality_audit.md b/docs/benchmark_ideas/frontier_first20_quality_audit.md
new file mode 100644
index 00000000..e9ffa6a4
--- /dev/null
+++ b/docs/benchmark_ideas/frontier_first20_quality_audit.md
@@ -0,0 +1,110 @@
+# First 20 Benchmarks Quality Audit
+
+Updated on 2026-03-13.
+
+## Scope
+
+This audit covers the first 20 benchmark ideas that were implemented under `benchmarks/`.
+
+The quality gate used here follows the current `frontier-benchmark-contributor` skill:
+
+1. Direct evaluator must return a finite score with `valid=1`.
+2. Benchmark integrity must not leak hidden reference answers through agent-visible files.
+3. Data provenance must be explicit and traceable.
+4. `eval_single.sh` with 10 evolution steps is the second-line sanity check, but only when a working LLM credential is available in the current environment.
+
+## Summary
+
+- Direct evaluator pass rate: `20 / 20`
+- Benchmarks with `valid=1.0`: `20 / 20`
+- Reference-answer leakage fixed: `4 / 4`
+- Inventory provenance manifests added: `5 / 5`
+- Baseline/init scoring bias fixed: `DuckDBPreAggregationSelection`
+- `eval_single.sh` 10-step sweep: blocked by current OpenRouter credential failure (`401 User not found`)
+
+## Fixed Issues
+
+### 1. Hidden reference solution leakage
+
+The following tasks originally exposed full reference paths through agent-visible `runtime/problem.py` files. This is now fixed by keeping only scalar reference metrics in the public runtime.
+
+- `OperationsResearch/FuelMinimizingShipWeatherRouting`
+- `OperationsResearch/DynamicCurrentMinimumTimeRouting`
+- `Robotics/GridPathPlanningWithObstacles`
+- `Robotics/NarrowPassagePlanning`
+
+### 2. Missing provenance manifests
+
+The following tasks originally lacked `references/source_manifest.md`. Provenance files and source notes were added to task docs.
+
+- `OperationsResearch/EOQWithMinimumOrderQuantity`
+- `OperationsResearch/EOQWithAllUnitsDiscounts`
+- `OperationsResearch/EOQWithIncrementalDiscounts`
+- `OperationsResearch/PoissonRQServiceLevel`
+- `OperationsResearch/NormalRQServiceLevel95`
+
+### 3. Baseline-equivalence scoring bias
+
+`ComputerSystems/DuckDBPreAggregationSelection` originally penalized the init candidate even when it matched the baseline design. The runtime helper now special-cases the no-preaggregation path so baseline-equivalent candidates score exactly `1.0`.
+
+## Direct Evaluator Results
+
+| Benchmark | Combined Score | Valid |
+| --- | ---: | ---: |
+| `OperationsResearch/EOQWithMinimumOrderQuantity` | `1.000000` | `1.0` |
+| `OperationsResearch/EOQWithAllUnitsDiscounts` | `1.000000` | `1.0` |
+| `OperationsResearch/EOQWithIncrementalDiscounts` | `1.000000` | `1.0` |
+| `OperationsResearch/PoissonRQServiceLevel` | `1.000000` | `1.0` |
+| `OperationsResearch/NormalRQServiceLevel95` | `1.000000` | `1.0` |
+| `OperationsResearch/FT10DispatchingRuleOptimization` | `0.865922` | `1.0` |
+| `OperationsResearch/LA16DispatchingRuleOptimization` | `0.747036` | `1.0` |
+| `OperationsResearch/FT10NeighborhoodMoveSelection` | `0.914454` | `1.0` |
+| `OperationsResearch/LA16NeighborhoodMoveSelection` | `0.859873` | `1.0` |
+| `ComputerSystems/DuckDBIndexSelection` | `0.998004` | `1.0` |
+| `ComputerSystems/DuckDBQueryRewrite` | `1.000472` | `1.0` |
+| `ComputerSystems/DuckDBPreAggregationSelection` | `1.000000` | `1.0` |
+| `Robotics/GridPathPlanningWithObstacles` | `0.875000` | `1.0` |
+| `Robotics/NarrowPassagePlanning` | `0.944444` | `1.0` |
+| `Robotics/MultiRobotPrioritizedPlanning` | `1.000000` | `1.0` |
+| `StructuralOptimization/CantileverComplianceTopologyOptimization` | `1.000000` | `1.0` |
+| `StructuralOptimization/MBBBeamTopologyOptimization` | `1.000000` | `1.0` |
+| `StructuralOptimization/BridgeTopologyOptimization` | `1.000000` | `1.0` |
+| `OperationsResearch/FuelMinimizingShipWeatherRouting` | `0.402275` | `1.0` |
+| `OperationsResearch/DynamicCurrentMinimumTimeRouting` | `0.619596` | `1.0` |
+
+## `eval_single.sh` Status
+
+Attempted command shape:
+
+```bash
+PYTHON_BIN=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \
+./eval_single.sh \
+  task=unified \
+  task.benchmark=OperationsResearch/EOQWithMinimumOrderQuantity \
+  task.runtime.use_conda_run=false \
+  task.runtime.python_path=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
+```
+
+Observed blocker:
+
+- The run starts correctly.
+- `frontier_eval` loads `.env`.
+- The configured provider is `https://openrouter.ai/api/v1`.
+- Candidate generation fails immediately with `401 User not found`.
+
+Because of that provider-side auth failure, the 10-step improvement check cannot be used as a benchmark quality signal in the current shell yet.
+
+## Current Conclusion
+
+The first 20 benchmarks now pass the structural quality gate:
+
+- They produce valid direct-evaluator scores.
+- The identified answer-leakage issue has been removed.
+- Provenance coverage is materially better.
+- Known scoring asymmetry in DuckDB pre-aggregation is fixed.
+
+The remaining open gate is operational rather than benchmark-specific:
+
+- restore a working LLM credential for `frontier_eval`
+- then run `eval_single.sh` across the 20 tasks
+- then record which tasks show reliable improvement signal within 10 steps
diff --git a/docs/benchmark_ideas/frontier_first20_status.md b/docs/benchmark_ideas/frontier_first20_status.md
new file mode 100644
index 00000000..337add05
--- /dev/null
+++ b/docs/benchmark_ideas/frontier_first20_status.md
@@ -0,0 +1,49 @@
+# First 20 Ideas Implementation Status
+
+Updated on 2026-03-13.
+
+See also: `docs/benchmark_ideas/frontier_first20_quality_audit.md` for the consolidated integrity and scoring audit across these tasks.
+
+## Implemented and Smoke-Tested
+
+These benchmarks have been created and passed direct evaluator runs. `EOQWithMinimumOrderQuantity`, `CantileverComplianceTopologyOptimization`, `MultiRobotPrioritizedPlanning`, `DuckDBQueryRewrite`, and `FuelMinimizingShipWeatherRouting` also passed `frontier_eval task=unified` smoke runs when forced to use the repository `.venv` Python.
+
+Under `benchmarks/OperationsResearch/`:
+
+1. `EOQWithMinimumOrderQuantity`
+2. `EOQWithAllUnitsDiscounts`
+3. `EOQWithIncrementalDiscounts`
+4. `PoissonRQServiceLevel`
+5. `NormalRQServiceLevel95`
+6. `FT10DispatchingRuleOptimization`
+7. `LA16DispatchingRuleOptimization`
+8. `FT10NeighborhoodMoveSelection`
+9. `LA16NeighborhoodMoveSelection`
+19. `FuelMinimizingShipWeatherRouting`
+20. `DynamicCurrentMinimumTimeRouting`
+
+Under `benchmarks/StructuralOptimization/`:
+
+16. `CantileverComplianceTopologyOptimization`
+17. `MBBBeamTopologyOptimization`
+18. `BridgeTopologyOptimization`
+
+Under `benchmarks/Robotics/`:
+
+13. `GridPathPlanningWithObstacles`
+14. `NarrowPassagePlanning`
+15. `MultiRobotPrioritizedPlanning`
+
+Under `benchmarks/ComputerSystems/`:
+
+10. `DuckDBIndexSelection`
+11. `DuckDBQueryRewrite`
+12. `DuckDBPreAggregationSelection`
+
+Notes:
+
+- JSSP benchmark instance data has been cloned from `job_shop_lib` into `/tmp/job_shop_lib`, and the canonical JSON payload is available locally.
+- `pymoto` is installed in the repository `.venv`.
+- `duckdb` is installed in the repository `.venv`.
+- The DuckDB benchmarks intentionally use a benchmark-local deterministic SQL-generated workload with DuckDB/TPC-H schema lineage. They do not claim to redistribute official `dbgen` output.
+- The maritime routing benchmarks intentionally use benchmark-local synthetic coastal fields with official WeatherRoutingTool / HALEM algorithm lineage. They do not claim to redistribute official hydrographic or weather rasters.
diff --git a/frontier_eval/conf/batch/frontier_first20_shinkaevolve5.yaml b/frontier_eval/conf/batch/frontier_first20_shinkaevolve5.yaml
new file mode 100644
index 00000000..5b7be77e
--- /dev/null
+++ b/frontier_eval/conf/batch/frontier_first20_shinkaevolve5.yaml
@@ -0,0 +1,38 @@
+version: 1
+
+tasks:
+  - inventory_minimum_order_quantity
+  - inventory_all_units_discounts
+  - inventory_incremental_discounts
+  - inventory_poisson_service_level
+  - inventory_normal_service_level_95
+  - fisher_thompson_instance_10_dispatch_rule
+  - lawrence_instance_16_dispatch_rule
+  - fisher_thompson_instance_10_neighborhood_moves
+  - lawrence_instance_16_neighborhood_moves
+  - analytical_database_index_selection
+  - analytical_database_query_rewrite
+  - analytical_database_pre_aggregation_selection
+  - grid_obstacle_path_planning
+  - narrow_passage_path_planning
+  - multi_robot_priority_planning
+  - cantilever_topology_optimization
+  - half_beam_topology_optimization
+  - bridge_topology_optimization
+  - ship_weather_routing
+  - dynamic_current_routing
+
+algorithms:
+  - name: shinkaevolve
+    overrides:
+      - algorithm.max_generations=5
+
+common_overrides:
+  - llm.temperature=0.7
+  - llm.timeout=60
+
+run:
+  name: frontier_first20_shinkaevolve5
+  base_dir: runs/batch
+  max_parallel: 4
+  fail_fast: false
diff --git a/frontier_eval/conf/task/analytical_database_index_selection.yaml b/frontier_eval/conf/task/analytical_database_index_selection.yaml
new file mode 100644
index 00000000..7d8775a1
--- /dev/null
+++ b/frontier_eval/conf/task/analytical_database_index_selection.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: ComputerSystems/DuckDBIndexSelection
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/analytical_database_pre_aggregation_selection.yaml b/frontier_eval/conf/task/analytical_database_pre_aggregation_selection.yaml
new file mode 100644
index 00000000..d21dc947
--- /dev/null
+++ b/frontier_eval/conf/task/analytical_database_pre_aggregation_selection.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: ComputerSystems/DuckDBPreAggregationSelection
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/analytical_database_query_rewrite.yaml b/frontier_eval/conf/task/analytical_database_query_rewrite.yaml
new file mode 100644
index 00000000..d6de56fb
--- /dev/null
+++ b/frontier_eval/conf/task/analytical_database_query_rewrite.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: ComputerSystems/DuckDBQueryRewrite
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/bridge_topology_optimization.yaml b/frontier_eval/conf/task/bridge_topology_optimization.yaml
new file mode 100644
index 00000000..d05e6049
--- /dev/null
+++ b/frontier_eval/conf/task/bridge_topology_optimization.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: StructuralOptimization/BridgeTopologyOptimization
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/cantilever_topology_optimization.yaml b/frontier_eval/conf/task/cantilever_topology_optimization.yaml
new file mode 100644
index 00000000..c24c837f
--- /dev/null
+++ b/frontier_eval/conf/task/cantilever_topology_optimization.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: StructuralOptimization/CantileverComplianceTopologyOptimization
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/clifford_t_synthesis.yaml b/frontier_eval/conf/task/clifford_t_synthesis.yaml
new file mode 100644
index 00000000..6afb9521
--- /dev/null
+++ b/frontier_eval/conf/task/clifford_t_synthesis.yaml
@@ -0,0 +1,6 @@
+name: unified
+benchmark: QuantumComputing/task_02_clifford_t_synthesis
+metadata_dir: frontier_eval
+
+runtime:
+  conda_env: mqt
diff --git a/frontier_eval/conf/task/cross_target_qaoa.yaml b/frontier_eval/conf/task/cross_target_qaoa.yaml
new file mode 100644
index 00000000..47182219
--- /dev/null
+++ b/frontier_eval/conf/task/cross_target_qaoa.yaml
@@ -0,0 +1,6 @@
+name: unified
+benchmark: QuantumComputing/task_03_cross_target_qaoa
+metadata_dir: frontier_eval
+
+runtime:
+  conda_env: mqt
diff --git a/frontier_eval/conf/task/dynamic_current_routing.yaml b/frontier_eval/conf/task/dynamic_current_routing.yaml
new file mode 100644
index 00000000..244a5dce
--- /dev/null
+++ b/frontier_eval/conf/task/dynamic_current_routing.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/DynamicCurrentMinimumTimeRouting
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/fisher_thompson_instance_10_dispatch_rule.yaml b/frontier_eval/conf/task/fisher_thompson_instance_10_dispatch_rule.yaml
new file mode 100644
index 00000000..6b2ebf39
--- /dev/null
+++ b/frontier_eval/conf/task/fisher_thompson_instance_10_dispatch_rule.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/FT10DispatchingRuleOptimization
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/fisher_thompson_instance_10_neighborhood_moves.yaml b/frontier_eval/conf/task/fisher_thompson_instance_10_neighborhood_moves.yaml
new file mode 100644
index 00000000..e7685248
--- /dev/null
+++ b/frontier_eval/conf/task/fisher_thompson_instance_10_neighborhood_moves.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/FT10NeighborhoodMoveSelection
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/grid_obstacle_path_planning.yaml b/frontier_eval/conf/task/grid_obstacle_path_planning.yaml
new file mode 100644
index 00000000..87c7a5bf
--- /dev/null
+++ b/frontier_eval/conf/task/grid_obstacle_path_planning.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: Robotics/GridPathPlanningWithObstacles
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/half_beam_topology_optimization.yaml b/frontier_eval/conf/task/half_beam_topology_optimization.yaml
new file mode 100644
index 00000000..5487bb9b
--- /dev/null
+++ b/frontier_eval/conf/task/half_beam_topology_optimization.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: StructuralOptimization/MBBBeamTopologyOptimization
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/inventory_all_units_discounts.yaml b/frontier_eval/conf/task/inventory_all_units_discounts.yaml
new file mode 100644
index 00000000..a4241267
--- /dev/null
+++ b/frontier_eval/conf/task/inventory_all_units_discounts.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/EOQWithAllUnitsDiscounts
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/inventory_incremental_discounts.yaml b/frontier_eval/conf/task/inventory_incremental_discounts.yaml
new file mode 100644
index 00000000..e95dd56c
--- /dev/null
+++ b/frontier_eval/conf/task/inventory_incremental_discounts.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/EOQWithIncrementalDiscounts
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/inventory_minimum_order_quantity.yaml b/frontier_eval/conf/task/inventory_minimum_order_quantity.yaml
new file mode 100644
index 00000000..a46ce2ac
--- /dev/null
+++ b/frontier_eval/conf/task/inventory_minimum_order_quantity.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/EOQWithMinimumOrderQuantity
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/inventory_normal_service_level_95.yaml b/frontier_eval/conf/task/inventory_normal_service_level_95.yaml
new file mode 100644
index 00000000..b9475207
--- /dev/null
+++ b/frontier_eval/conf/task/inventory_normal_service_level_95.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/NormalRQServiceLevel95
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/inventory_poisson_service_level.yaml b/frontier_eval/conf/task/inventory_poisson_service_level.yaml
new file mode 100644
index 00000000..3964ce49
--- /dev/null
+++ b/frontier_eval/conf/task/inventory_poisson_service_level.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/PoissonRQServiceLevel
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/lawrence_instance_16_dispatch_rule.yaml b/frontier_eval/conf/task/lawrence_instance_16_dispatch_rule.yaml
new file mode 100644
index 00000000..e8b9af75
--- /dev/null
+++ b/frontier_eval/conf/task/lawrence_instance_16_dispatch_rule.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/LA16DispatchingRuleOptimization
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/lawrence_instance_16_neighborhood_moves.yaml b/frontier_eval/conf/task/lawrence_instance_16_neighborhood_moves.yaml
new file mode 100644
index 00000000..43c474df
--- /dev/null
+++ b/frontier_eval/conf/task/lawrence_instance_16_neighborhood_moves.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/LA16NeighborhoodMoveSelection
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/multi_robot_priority_planning.yaml b/frontier_eval/conf/task/multi_robot_priority_planning.yaml
new file mode 100644
index 00000000..287b14c5
--- /dev/null
+++ b/frontier_eval/conf/task/multi_robot_priority_planning.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: Robotics/MultiRobotPrioritizedPlanning
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/narrow_passage_path_planning.yaml b/frontier_eval/conf/task/narrow_passage_path_planning.yaml
new file mode 100644
index 00000000..a1df2106
--- /dev/null
+++ b/frontier_eval/conf/task/narrow_passage_path_planning.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: Robotics/NarrowPassagePlanning
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/conf/task/routing_qftentangled.yaml b/frontier_eval/conf/task/routing_qftentangled.yaml
new file mode 100644
index 00000000..94b94a4e
--- /dev/null
+++ b/frontier_eval/conf/task/routing_qftentangled.yaml
@@ -0,0 +1,6 @@
+name: unified
+benchmark: QuantumComputing/task_01_routing_qftentangled
+metadata_dir: frontier_eval
+
+runtime:
+  conda_env: mqt
diff --git a/frontier_eval/conf/task/ship_weather_routing.yaml b/frontier_eval/conf/task/ship_weather_routing.yaml
new file mode 100644
index 00000000..38fd3fe7
--- /dev/null
+++ b/frontier_eval/conf/task/ship_weather_routing.yaml
@@ -0,0 +1,8 @@
+defaults:
+  - unified
+
+benchmark: OperationsResearch/FuelMinimizingShipWeatherRouting
+
+runtime:
+  use_conda_run: false
+  python_path: /mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README.md b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README.md
new file mode 100644
index 00000000..054315c5
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README.md
@@ -0,0 +1,39 @@
+# FT10 Neighborhood Move Selection
+
+This teaching scaffold is derived from `benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection`.
+It explains the local-search version of the classic FT10 job-shop benchmark for readers who know CS but do not know job-shop scheduling yet.
+
+## Directory Structure
+
+```text
+teaching_ft10_neighborhood_move_selection/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+- `baseline/init.py`: a simple adjacent-swap ranking policy that only uses cheap local cues.
+- `verification/reference.py`: a stronger reference that replays a CP-SAT-derived machine order and can optionally rerun OR-Tools when `TEACHING_FT10_ENABLE_ORTOOLS=1`.
+- `verification/evaluate.py`: runs baseline and reference, then normalizes the score against the known optimum `930`.
+
+## What This Benchmark Teaches
+
+The benchmark asks you to rank adjacent machine-order swaps inside a frozen local-search loop.
+You do not build the schedule from scratch; instead, you guide which swap the search should try first.
+
+That makes the task a good example of optimization inside an existing solver shell:
+the physics, feasibility checks, and search loop are fixed, and only the move-ranking policy changes.
+
+## Source of Truth
+
+The frozen FT10 instance and all runtime helpers live in:
+
+`benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README_zh-CN.md b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README_zh-CN.md
new file mode 100644
index 00000000..4fe0e270
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/README_zh-CN.md
@@ -0,0 +1,39 @@
+# FT10 邻域移动选择
+
+这个教学版任务基于 `benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection`。
+它面向有 CS 背景、但还不熟悉作业车间调度的读者，用来解释一个经典的局部搜索优化问题。
+
+## 目录结构
+
+```text
+teaching_ft10_neighborhood_move_selection/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+- `baseline/init.py`：一个只使用廉价局部特征的简单相邻交换排序策略。
+- `verification/reference.py`：更强的参考实现，默认回放一个由 CP-SAT 导出的机器顺序；如果你设置 `TEACHING_FT10_ENABLE_ORTOOLS=1`，它也能尝试外部 OR-Tools 求解器。
+- `verification/evaluate.py`：运行 baseline 和 reference，并按已知最优值 `930` 做归一化计分。
+
+## 这个基准在教什么
+
+这个任务要求你在一个冻结的局部搜索循环里，对机器顺序里的相邻交换动作排序。
+你不是从头构造排程，而是决定“下一步先尝试哪个 swap”。
+
+所以它很适合用来理解“在固定求解器壳层里做优化”这类问题：
+模型、可行性检查和搜索循环都固定，变化的只有动作排序策略。
+
+## 事实来源
+
+冻结的 FT10 实例和所有运行时辅助函数都在：
+
+`benchmarks/OperationsResearch/FT10NeighborhoodMoveSelection/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task.md b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task.md
new file mode 100644
index 00000000..14871c29
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task.md
@@ -0,0 +1,146 @@
+# FT10 Neighborhood Move Selection Task
+
+## Problem
+
+You are given the canonical FT10 job-shop instance and a frozen local-search shell.
+Your job is to rank adjacent machine-order swaps so that the search trajectory reaches a smaller makespan.
+
+This is not a full schedule-construction problem.
+The evaluator already knows how to build the incumbent schedule, generate the neighborhood, apply swaps, and stop the search.
+Your code only supplies a move-scoring policy.
+
+## Background
+
+Job-shop scheduling is a manufacturing optimization problem:
+each job has to visit a fixed sequence of machines, and each machine can process only one operation at a time.
+The objective is usually to minimize the final completion time, also called the makespan.
+
+The FT10 instance is a classic 10-job, 10-machine benchmark.
+It is hard because local machine decisions interact globally: a swap that looks harmless on one machine can delay a downstream bottleneck and increase the final makespan.
+
+The teaching idea here is to separate the solver shell from the policy.
+You can think of this as learning a ranking function for a combinatorial local search engine.
+
+## What Is Frozen
+
+- The FT10 instance `ft10`.
+- The incumbent schedule used to start local search.
+- The adjacent-swap neighborhood.
+- The acceptance rule: only improving moves are applied.
+- The theoretical optimum `930`, which is known for this instance.
+
+## Input and Output
+
+Your candidate file should define:
+
+```python
+def score_move(move, state):
+    ...
+
+MAX_ITERATIONS = 50
+```
+
+`score_move(move, state)` receives:
+
+- `move`: a dictionary describing one adjacent swap
+- `state`: a dictionary describing the current local-search iteration
+
+The move dictionary contains:
+
+- `machine_id`: the machine whose sequence is being modified
+- `machine_position`: the index of the left element in the adjacent pair
+- `op_a` and `op_b`: the two neighboring operations being considered for swap
+- `delta_duration`: a cheap feature derived from the two operation durations
+- `current_makespan`: the current schedule makespan
+
+Each operation record inside `op_a` and `op_b` contains:
+
+- `job_id`
+- `op_index`
+- `duration`
+- `start`
+- `end`
+
+The state dictionary contains:
+
+- `iteration`
+- `current_makespan`
+
+Return any finite scalar score.
+Larger scores are tried first.
+If you provide `MAX_ITERATIONS`, it must be a positive integer.
+
+## Expected Result
+
+A good submission should produce a feasible schedule with a smaller makespan than the baseline.
+The best possible makespan for this instance is `930`.
+
+You should expect the evaluator to run a local search loop like this:
+
+1. Start from the baseline incumbent schedule.
+2. Generate all adjacent machine-order swaps.
+3. Rank them using `score_move(move, state)`.
+4. Apply the first improving move in that ranked order.
+5. Stop when there is no improving move or the iteration limit is reached.
+
+That means a good score function should do more than prefer obviously short operations.
+It should prefer swaps that are likely to unlock a better downstream machine order and reduce the critical path.
+
+## How To Start Implementing
+
+If you have a CS background, a practical implementation recipe is:
+
+1. Treat `start` and `end` as a proxy for slack.
+   Swaps involving operations that already sit near the schedule tail usually matter more.
+2. Use `remaining_job_work` indirectly through the operation indices.
+   Delaying an operation from a still-long job often hurts more than delaying the end of a short job.
+3. Look for bottlenecks at the machine level.
+   A swap near the end of a busy machine sequence can change the global makespan even if the two durations are similar.
+4. Think in terms of the critical path.
+   The makespan is determined by one or more tight precedence chains; the best swaps are usually the ones that shorten or reroute those chains.
+
+In code, that usually means combining several weak signals into one score rather than relying on only `delta_duration`.
+
+## Scoring
+
+This is a minimization task, so lower makespans are better.
+We report a normalized score on a 0 to 100 scale:
+
+```text
+normalized_score = 100 * clip((baseline_makespan - candidate_makespan) / (baseline_makespan - 930), 0, 1)
+```
+
+Interpretation:
+
+- `0` means the candidate is no better than the baseline
+- `100` means the candidate reaches the known optimum `930`
+- invalid or infeasible submissions receive `0`
+
+We also report:
+
+- `candidate_makespan`
+- `baseline_makespan`
+- `reference_makespan`
+- `theoretical_optimum_makespan`
+- `gap_to_optimum`
+
+## Why This Is Hard
+
+The obvious greedy idea is to always move shorter operations earlier.
+That helps sometimes, but it ignores the fact that the schedule is constrained by machine conflicts and job precedence.
+
+The real difficulty is that a swap can change the critical path in a non-local way.
+You are trying to improve a global objective using only a local neighborhood signal.
+That is the central lesson of this benchmark.
+
+## Failure Cases
+
+The submission is invalid if:
+
+- `score_move` is missing
+- the returned score is non-finite
+- `MAX_ITERATIONS` is invalid
+- the induced schedule is incomplete or infeasible
+- the evaluator cannot import or run the candidate file
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task_zh-CN.md b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task_zh-CN.md
new file mode 100644
index 00000000..172b73f3
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/Task_zh-CN.md
@@ -0,0 +1,145 @@
+# FT10 邻域移动选择任务
+
+## 问题
+
+你会拿到经典 FT10 作业车间实例，以及一个冻结好的局部搜索壳层。
+你的任务是给相邻交换动作排序，让搜索过程更快找到更小的 makespan。
+
+这不是“从头构造完整排程”的题。
+评测器已经知道如何生成初始排程、构造邻域、执行 swap，并在合适的时候停止搜索。
+你只需要提供一个动作评分策略。
+
+## 背景
+
+作业车间调度是制造业里非常经典的优化问题：
+每个 job 必须按固定顺序经过若干机器，而每台机器同一时刻只能处理一个操作。
+目标通常是最小化最后完成时间，也就是 makespan。
+
+FT10 是一个经典的 10 job、10 machine 基准。
+它之所以难，是因为局部看起来合理的机器顺序变化，可能会在全局上引入更长的关键路径，导致最终 makespan 变差。
+
+这个教学任务的核心思想，是把“求解器壳层”和“策略”拆开。
+你可以把它理解为：在一个组合优化 local search 引擎里学习一个排序函数。
+
+## 哪些部分是冻结的
+
+- FT10 实例 `ft10`
+- 作为初始解的 incumbent 排程
+- 相邻交换邻域
+- 接受规则：只有改进的动作才会被应用
+- 已知理论最优值 `930`
+
+## 输入与输出
+
+你的候选文件需要定义：
+
+```python
+def score_move(move, state):
+    ...
+
+MAX_ITERATIONS = 50
+```
+
+`score_move(move, state)` 的输入包含：
+
+- `move`：描述一次相邻交换的字典
+- `state`：描述当前局部搜索状态的字典
+
+`move` 字典包含：
+
+- `machine_id`：被修改的机器编号
+- `machine_position`：相邻 pair 左边元素的位置
+- `op_a` 和 `op_b`：这两个相邻操作
+- `delta_duration`：由两个操作时长派生出的快速特征
+- `current_makespan`：当前排程的 makespan
+
+每个 `op_a` / `op_b` 里的操作记录都包含：
+
+- `job_id`
+- `op_index`
+- `duration`
+- `start`
+- `end`
+
+`state` 字典包含：
+
+- `iteration`
+- `current_makespan`
+
+你只需要返回任意有限标量分数。
+分数越大，动作越先被尝试。
+如果你提供 `MAX_ITERATIONS`，它必须是正整数。
+
+## 预期结果
+
+一个好的提交应该返回可行排程，并让 makespan 比 baseline 更小。
+这个实例的已知最优 makespan 是 `930`。
+
+评测器大致会按下面的方式运行局部搜索：
+
+1. 从 baseline incumbent 排程开始
+2. 生成所有相邻机器顺序交换动作
+3. 使用 `score_move(move, state)` 排序
+4. 按排序结果应用第一个能改进的动作
+5. 当没有改进动作，或者达到迭代上限时停止
+
+所以，一个好的评分函数不只是“优先短操作”。
+它还应该更偏向那些可能打开更好下游排程、缩短关键路径的 swap。
+
+## 如何开始实现
+
+如果你有 CS 背景，可以按下面这个很实用的思路入手：
+
+1. 先把 `start` 和 `end` 当成 slack 的近似信号。
+   越靠近排程尾部的操作，交换后越可能影响最终 makespan。
+2. 通过操作在 job 里的位置，间接估计剩余工作量。
+   如果一个 job 后面还有很多工序，过早把它卡住，代价通常更大。
+3. 关注机器瓶颈。
+   一台机器尾部附近的 swap，即使两个操作时长接近，也可能改变全局最慢链路。
+4. 用“关键路径”来理解分数设计。
+   makespan 由一条或几条没有松弛的紧约束链决定，真正有价值的 swap 往往是在缩短或重排这些链。
+
+写代码时，通常不要只盯着 `delta_duration`，而是把几个弱信号组合成一个总分。
+
+## 计分方式
+
+这是一个最小化任务，所以 makespan 越小越好。
+我们使用 0 到 100 的归一化分数：
+
+```text
+normalized_score = 100 * clip((baseline_makespan - candidate_makespan) / (baseline_makespan - 930), 0, 1)
+```
+
+含义是：
+
+- `0` 表示候选解不比 baseline 更好
+- `100` 表示候选解达到已知最优值 `930`
+- 无效或不可行提交得分为 `0`
+
+我们还会报告：
+
+- `candidate_makespan`
+- `baseline_makespan`
+- `reference_makespan`
+- `theoretical_optimum_makespan`
+- `gap_to_optimum`
+
+## 为什么难
+
+最直觉的贪心想法是把短操作尽量往前排。
+这有时有用，但它忽略了两个事实：机器冲突会约束顺序，job 前后依赖也会约束顺序。
+
+真正难的是，一个局部 swap 会以非局部的方式改变关键路径。
+你是在用局部邻域信号去改善一个全局目标，这正是这个 benchmark 想教你的地方。
+
+## 判为无效的情况
+
+如果出现以下任一情况，提交无效：
+
+- 缺少 `score_move`
+- 返回的分数不是有限值
+- `MAX_ITERATIONS` 不合法
+- 诱导出的排程不完整或不可行
+- 评测器无法导入或运行候选文件
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/baseline/init.py b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/baseline/init.py
new file mode 100644
index 00000000..895e0ea8
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/baseline/init.py
@@ -0,0 +1,42 @@
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _ensure_repo_root() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = str(parent)
+            if root not in sys.path:
+                sys.path.insert(0, root)
+            return
+
+
+_ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.runtime.problem import load_instance, run_local_search
+except ModuleNotFoundError:
+    from runtime.problem import load_instance, run_local_search
+
+
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def solve(instance):
+    return run_local_search(instance, score_move, MAX_ITERATIONS)
+
+
+if __name__ == "__main__":
+    result = solve(load_instance())
+    print(result["makespan"])
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/evaluate.py b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/evaluate.py
new file mode 100644
index 00000000..06fc517b
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/evaluate.py
@@ -0,0 +1,170 @@
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import sys
+import time
+from pathlib import Path
+from types import ModuleType
+from typing import Any
+
+
+THEORETICAL_OPTIMUM_MAKESPAN = 930
+
+
+def _ensure_repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = parent
+            root_s = str(root)
+            if root_s not in sys.path:
+                sys.path.insert(0, root_s)
+            return root
+    raise RuntimeError("could not locate repository root")
+
+
+REPO_ROOT = _ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.runtime.problem import load_instance, run_local_search
+except ModuleNotFoundError:
+    from runtime.problem import load_instance, run_local_search
+
+
+def _load_module(path: Path, module_name: str) -> ModuleType:
+    path = path.resolve()
+    if not path.is_file():
+        raise FileNotFoundError(path)
+    if str(path.parent) not in sys.path:
+        sys.path.insert(0, str(path.parent))
+    spec = importlib.util.spec_from_file_location(module_name, str(path))
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module
+
+
+def _score_from_makespan(makespan: float, baseline: float) -> float:
+    if not (baseline > THEORETICAL_OPTIMUM_MAKESPAN):
+        return 0.0
+    if makespan is None:
+        return 0.0
+    span = baseline - THEORETICAL_OPTIMUM_MAKESPAN
+    if span <= 0:
+        return 0.0
+    score = 100.0 * (baseline - float(makespan)) / span
+    return max(0.0, min(100.0, score))
+
+
+def _evaluate_score_module(module: ModuleType, instance: Any) -> dict[str, Any]:
+    if hasattr(module, "solve"):
+        result = module.solve(instance)
+    else:
+        score_move = getattr(module, "score_move", None)
+        if score_move is None:
+            raise AttributeError("candidate module must define score_move or solve")
+        max_iterations = int(getattr(module, "MAX_ITERATIONS", 50))
+        result = run_local_search(instance, score_move, max_iterations)
+    if not isinstance(result, dict):
+        raise TypeError("candidate solver must return a dict")
+    if "makespan" not in result:
+        raise KeyError("result missing makespan")
+    return result
+
+
+def evaluate(candidate_path: str | None = None) -> dict[str, Any]:
+    instance = load_instance()
+
+    baseline_module = _load_module(
+        REPO_ROOT / "frontier_eval" / "tasks" / "teaching_ft10_neighborhood_move_selection" / "baseline" / "init.py",
+        "teaching_ft10_baseline",
+    )
+    reference_module = _load_module(
+        REPO_ROOT / "frontier_eval" / "tasks" / "teaching_ft10_neighborhood_move_selection" / "verification" / "reference.py",
+        "teaching_ft10_reference",
+    )
+
+    baseline_start = time.perf_counter()
+    baseline_result = _evaluate_score_module(baseline_module, instance)
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    reference_start = time.perf_counter()
+    reference_result = _evaluate_score_module(reference_module, instance)
+    reference_runtime = time.perf_counter() - reference_start
+
+    baseline_makespan = float(baseline_result["makespan"])
+    reference_makespan = float(reference_result["makespan"])
+    reference_score = _score_from_makespan(reference_makespan, baseline_makespan)
+
+    metrics: dict[str, Any] = {
+        "theoretical_optimum_makespan": float(THEORETICAL_OPTIMUM_MAKESPAN),
+        "theoretical_upper_bound_score": 100.0,
+        "baseline_makespan": baseline_makespan,
+        "baseline_runtime_s": baseline_runtime,
+        "baseline_score": _score_from_makespan(baseline_makespan, baseline_makespan),
+        "baseline_valid": float(bool(baseline_result.get("valid", True))),
+        "reference_makespan": reference_makespan,
+        "reference_runtime_s": reference_runtime,
+        "reference_score": reference_score,
+        "reference_solver": reference_result.get("solver", "unknown"),
+        "reference_valid": float(bool(reference_result.get("valid", True))),
+        "valid": 1.0,
+    }
+
+    if candidate_path:
+        try:
+            candidate_module = _load_module(Path(candidate_path), "teaching_ft10_candidate")
+            candidate_start = time.perf_counter()
+            candidate_result = _evaluate_score_module(candidate_module, instance)
+            candidate_runtime = time.perf_counter() - candidate_start
+            candidate_makespan = float(candidate_result["makespan"])
+            metrics.update(
+                {
+                    "candidate_makespan": candidate_makespan,
+                    "candidate_runtime_s": candidate_runtime,
+                    "candidate_score": _score_from_makespan(candidate_makespan, baseline_makespan),
+                    "candidate_valid": float(bool(candidate_result.get("valid", True))),
+                    "gap_to_optimum": candidate_makespan - THEORETICAL_OPTIMUM_MAKESPAN,
+                    "combined_score": _score_from_makespan(candidate_makespan, baseline_makespan),
+                }
+            )
+        except Exception as exc:
+            metrics.update(
+                {
+                    "candidate_makespan": float("inf"),
+                    "candidate_runtime_s": 0.0,
+                    "candidate_score": 0.0,
+                    "candidate_valid": 0.0,
+                    "gap_to_optimum": float("inf"),
+                    "combined_score": 0.0,
+                    "valid": 0.0,
+                    "candidate_error": str(exc),
+                }
+            )
+    else:
+        metrics.update(
+            {
+                "combined_score": reference_score,
+                "gap_to_optimum": reference_makespan - THEORETICAL_OPTIMUM_MAKESPAN,
+            }
+        )
+
+    return metrics
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("candidate_path", nargs="?", default=None)
+    parser.add_argument("--candidate", dest="candidate_flag", default=None)
+    args = parser.parse_args(argv)
+    candidate_path = args.candidate_flag or args.candidate_path
+    metrics = evaluate(candidate_path)
+    print(json.dumps(metrics, indent=2, sort_keys=True))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/reference.py b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/reference.py
new file mode 100644
index 00000000..53773b3b
--- /dev/null
+++ b/frontier_eval/tasks/teaching_ft10_neighborhood_move_selection/verification/reference.py
@@ -0,0 +1,120 @@
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+
+
+def _ensure_repo_root() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = str(parent)
+            if root not in sys.path:
+                sys.path.insert(0, root)
+            return
+
+
+_ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FT10NeighborhoodMoveSelection.runtime.problem import (
+        build_schedule_from_machine_sequences,
+        load_instance,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        build_schedule_from_machine_sequences,
+        load_instance,
+    )
+
+
+try:  # pragma: no cover - optional dependency
+    from job_shop_lib.benchmarking import load_benchmark_instance
+    from job_shop_lib.constraint_programming import ORToolsSolver
+except Exception as exc:  # pragma: no cover - environment dependent
+    JOB_SHOP_LIB_IMPORT_ERROR = exc
+    ORToolsSolver = None
+    load_benchmark_instance = None
+else:  # pragma: no cover - environment dependent
+    JOB_SHOP_LIB_IMPORT_ERROR = None
+
+
+EXTERNAL_TIME_LIMIT_S = float(os.environ.get("TEACHING_FT10_REFERENCE_TIME_LIMIT", "20.0"))
+ENABLE_EXTERNAL_SOLVER = os.environ.get("TEACHING_FT10_ENABLE_ORTOOLS", "").strip().lower() in {
+    "1",
+    "true",
+    "yes",
+    "on",
+}
+
+# This machine order was extracted from a 20-second OR-Tools CP-SAT run on ft10.
+# It gives makespan 938 when replayed through the local runtime helpers.
+EMBEDDED_REFERENCE_MACHINE_SEQUENCES = [
+    [(1, 0), (4, 1), (6, 1), (8, 0), (7, 1), (3, 2), (9, 1), (0, 0), (2, 1), (5, 6)],
+    [(6, 0), (3, 0), (4, 2), (9, 0), (8, 1), (1, 5), (5, 1), (2, 0), (7, 2), (0, 1)],
+    [(4, 0), (7, 0), (1, 1), (3, 1), (6, 3), (5, 0), (9, 2), (8, 4), (0, 2), (2, 3)],
+    [(6, 2), (4, 4), (1, 4), (8, 2), (5, 3), (2, 2), (0, 3), (9, 7), (3, 7), (7, 9)],
+    [(1, 2), (4, 5), (3, 3), (7, 4), (0, 4), (6, 9), (8, 8), (9, 8), (5, 8), (2, 9)],
+    [(4, 3), (6, 5), (5, 2), (8, 3), (1, 7), (7, 3), (9, 6), (0, 5), (2, 5), (3, 9)],
+    [(6, 4), (1, 6), (3, 4), (9, 3), (4, 9), (8, 6), (7, 5), (0, 6), (2, 7), (5, 7)],
+    [(4, 7), (6, 8), (3, 6), (1, 8), (8, 7), (2, 6), (0, 7), (7, 8), (9, 9), (5, 9)],
+    [(4, 6), (6, 7), (3, 5), (9, 4), (5, 4), (2, 4), (7, 6), (8, 9), (0, 8), (1, 9)],
+    [(1, 3), (6, 6), (4, 8), (8, 5), (9, 5), (5, 5), (7, 7), (3, 8), (2, 8), (0, 9)],
+]
+
+
+def _solve_from_machine_sequences(instance):
+    result = build_schedule_from_machine_sequences(instance, EMBEDDED_REFERENCE_MACHINE_SEQUENCES)
+    if result["valid"]:
+        result["solver"] = "embedded_cp_sat_machine_order"
+    return result
+
+
+def _solve_with_ortools(instance):
+    if ORToolsSolver is None or load_benchmark_instance is None:
+        raise RuntimeError(f"external solver unavailable: {JOB_SHOP_LIB_IMPORT_ERROR}")
+    solver = ORToolsSolver(
+        max_time_in_seconds=EXTERNAL_TIME_LIMIT_S,
+        log_search_progress=False,
+    )
+    schedule = solver(load_benchmark_instance("ft10"))
+    machine_sequences = []
+    for machine_ops in schedule.schedule:
+        sequence = []
+        for scheduled_op in machine_ops:
+            op = scheduled_op.operation
+            sequence.append((int(op.job_id), int(op.position_in_job)))
+        machine_sequences.append(sequence)
+    result = build_schedule_from_machine_sequences(instance, machine_sequences)
+    if result["valid"]:
+        result["solver"] = "ortools_cp_sat"
+        result["ortools_status"] = str(schedule.metadata.get("status", "unknown"))
+        result["ortools_time_limit_s"] = EXTERNAL_TIME_LIMIT_S
+    return result
+
+
+def solve(instance):
+    best = _solve_from_machine_sequences(instance)
+    if not ENABLE_EXTERNAL_SOLVER:
+        return best
+    if ORToolsSolver is None or load_benchmark_instance is None:
+        return best
+
+    try:  # pragma: no cover - environment dependent
+        candidate = _solve_with_ortools(instance)
+    except Exception as exc:
+        best["external_solver_error"] = str(exc)
+        return best
+
+    if candidate["valid"] and candidate["makespan"] <= best["makespan"]:
+        return candidate
+    return best
+
+
+def best_run_makespan(instance):
+    return solve(instance)["makespan"]
+
+
+if __name__ == "__main__":
+    print(best_run_makespan(load_instance()))
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README.md b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README.md
new file mode 100644
index 00000000..bcd766a9
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README.md
@@ -0,0 +1,35 @@
+# Fuel-Minimizing Ship Weather Routing
+
+This teaching scaffold is derived from `benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting`.
+It explains a grid-routing problem with wind and current fields for readers who know shortest paths but are new to maritime routing.
+
+## Directory Structure
+
+```text
+teaching_fuel_minimizing_ship_weather_routing/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+## What This Benchmark Teaches
+
+The benchmark asks you to route a ship across a frozen coastal grid while minimizing fuel usage.
+Moving is constrained by land, and the edge cost depends on the deterministic wind and current fields.
+
+This is a nice teaching example of weighted shortest-path planning:
+the graph is fixed, but the edge weights come from the physics model.
+
+## Source of Truth
+
+The frozen instance and runtime helpers live in:
+
+`benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README_zh-CN.md b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README_zh-CN.md
new file mode 100644
index 00000000..17a2f283
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/README_zh-CN.md
@@ -0,0 +1,35 @@
+# 燃油最小化船舶气象航线规划
+
+这个教学版任务基于 `benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting`。
+它面向懂最短路、但还没有接触过船舶航线优化的读者。
+
+## 目录结构
+
+```text
+teaching_fuel_minimizing_ship_weather_routing/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+## 这个基准在教什么
+
+这个任务要求你在冻结的沿海栅格上规划一条船舶航线，并尽量降低燃油消耗。
+路径必须避开陆地，而且每条边的代价不仅取决于步长，还会受确定性的风场和流场影响。
+
+它非常适合讲解加权最短路：
+图结构是固定的，但边权来自物理模型。
+
+## 事实来源
+
+冻结实例和运行时辅助函数都在：
+
+`benchmarks/OperationsResearch/FuelMinimizingShipWeatherRouting/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task.md b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task.md
new file mode 100644
index 00000000..b8b7d550
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task.md
@@ -0,0 +1,115 @@
+# Fuel-Minimizing Ship Weather Routing Task
+
+## Problem
+
+You must route a ship from a fixed start cell to a fixed goal cell on a frozen coastal grid.
+The route must avoid land and minimize fuel consumption under deterministic wind and current fields.
+
+This is a weighted shortest-path problem on a grid graph.
+The graph is fixed, but the edge weights are induced by the physics model.
+
+## Background
+
+Imagine a coastal navigation problem where a ship moves one grid cell at a time.
+Some cells are land and cannot be entered.
+The interesting part is that moving east, west, north, or south does not cost the same amount everywhere.
+
+The fuel cost of a move depends on the local wind and current.
+Following favorable current can reduce travel time and fuel, while fighting headwind or adverse current increases both.
+That means the shortest geometric path is often not the cheapest route.
+
+From a CS point of view, this is still a shortest-path problem.
+The difference is that the edge weights are not uniform and are not purely geometric.
+That is what makes the benchmark a good example of physics-aware routing.
+
+## What Is Frozen
+
+- The grid map.
+- The start cell.
+- The goal cell.
+- The deterministic wind field.
+- The deterministic current field.
+- The fuel and time model used to score each move.
+- The route validator.
+
+## Input and Output
+
+Your candidate file should define:
+
+```python
+def solve(instance):
+    ...
+```
+
+`instance` is a dictionary with at least:
+
+- `grid`
+- `start`
+- `goal`
+- `current_field`
+- `wind_field`
+- `objective`
+
+The function must return either:
+
+- a list of `(x, y)` cells, or
+- a dictionary with a `path` key
+
+The path must:
+
+- start at `instance["start"]`
+- end at `instance["goal"]`
+- move only between adjacent grid cells
+- stay on water cells
+
+## Expected Result
+
+A good solution should return a feasible route with low fuel usage.
+The evaluator also reports travel time and hop count, but fuel is the objective.
+
+The reference solver in this scaffold uses exact Dijkstra search on the frozen grid with the fuel model as edge weights, so it can report the best fuel cost found for this instance.
+
+## Scoring
+
+This is a minimization task.
+We normalize the score against the baseline and the exact reference fuel cost:
+
+```text
+normalized_score = 100 * clip((baseline_fuel - candidate_fuel) / (baseline_fuel - optimal_fuel), 0, 1)
+```
+
+Interpretation:
+
+- `0` means the candidate is no better than the baseline
+- `100` means the candidate matches the exact reference fuel cost
+- invalid submissions receive `0`
+
+We also report:
+
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+- `candidate_hops`
+- `theoretical_optimum_fuel`
+
+## Why This Is Hard
+
+The simplest solution is to search for the shortest hop path and stop there.
+That is usually valid, but it ignores wind and current.
+
+The harder and more realistic version is weighted shortest path.
+You need to reason about the actual cost of each move, not just the number of moves.
+That is the central algorithmic lesson here.
+
+## Failure Cases
+
+The submission is invalid if:
+
+- `solve` is missing
+- the returned value cannot be parsed into a path
+- the path leaves the map, enters land, or uses non-adjacent steps
+- the evaluator cannot import or run the candidate file
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task_zh-CN.md b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task_zh-CN.md
new file mode 100644
index 00000000..9e6fccc4
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/Task_zh-CN.md
@@ -0,0 +1,114 @@
+# 燃油最小化船舶气象航线规划任务
+
+## 问题
+
+你需要把船从固定起点航行到固定终点，地图是冻结的沿海栅格。
+路径必须避开陆地，并且在确定性的风场和流场作用下尽量降低燃油消耗。
+
+这是一道加权最短路问题。
+图结构是固定的，但边权来自物理模型。
+
+## 背景
+
+可以把它想成一个沿海航行问题：船每次在栅格里走一步。
+有些格点是陆地，不能进入。
+真正有意思的是：在不同位置往东、西、南、北走，代价并不一样。
+
+单步代价取决于当前位置的风场和流场。
+如果顺着有利海流走，航时和燃油都会下降；如果顶着逆风或逆流走，代价就会上升。
+所以几何上最短的路径通常不是最省油的路径。
+
+从计算机角度看，这仍然是最短路，只不过边权不再是统一的，也不是纯几何距离。
+这就是这个 benchmark 想教你的地方：要做“考虑物理场的最短路”。
+
+## 哪些部分是冻结的
+
+- 栅格地图
+- 起点
+- 终点
+- 确定性的风场
+- 确定性的流场
+- 用来给每一步打分的燃油和航时模型
+- 路径校验器
+
+## 输入与输出
+
+你的候选文件需要定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+`instance` 字典至少包含：
+
+- `grid`
+- `start`
+- `goal`
+- `current_field`
+- `wind_field`
+- `objective`
+
+函数必须返回以下两种形式之一：
+
+- `(x, y)` 坐标组成的路径列表，或者
+- 带 `path` 键的字典
+
+路径必须：
+
+- 从 `instance["start"]` 开始
+- 到达 `instance["goal"]`
+- 每一步都只能走到相邻格点
+- 始终停留在水域上
+
+## 预期结果
+
+好的解法应该返回一条可行航线，并尽量降低燃油消耗。
+评测器也会报告航时和步数，但优化目标是燃油。
+
+这个教学 scaffold 里的 reference solver 会在冻结栅格上用燃油代价做精确 Dijkstra 搜索，因此它可以报告这道题在当前实例上的最佳燃油值。
+
+## 计分方式
+
+这是一个最小化任务。
+我们用 baseline 和精确 reference 的燃油值做归一化：
+
+```text
+normalized_score = 100 * clip((baseline_fuel - candidate_fuel) / (baseline_fuel - optimal_fuel), 0, 1)
+```
+
+含义是：
+
+- `0` 表示候选解不比 baseline 更好
+- `100` 表示候选解达到了 exact reference 的燃油最优值
+- 无效提交得分为 `0`
+
+我们还会报告：
+
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+- `candidate_hops`
+- `theoretical_optimum_fuel`
+
+## 为什么难
+
+最简单的办法是只找最少步数的路径。
+这通常是可行的，但它忽略了风场和流场。
+
+更现实的版本是加权最短路。
+你需要真正根据每一步的物理代价来选路，而不是只数步数。
+这正是这个 benchmark 的核心训练点。
+
+## 判为无效的情况
+
+如果出现以下任一情况，提交无效：
+
+- 缺少 `solve`
+- 返回值无法解析为路径
+- 路径越界、进入陆地，或者使用了非相邻步长
+- 评测器无法导入或运行候选文件
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/baseline/init.py b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/baseline/init.py
new file mode 100644
index 00000000..94bb0cae
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/baseline/init.py
@@ -0,0 +1,67 @@
+from __future__ import annotations
+
+from collections import deque
+import sys
+from pathlib import Path
+
+
+def _ensure_repo_root() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = str(parent)
+            if root not in sys.path:
+                sys.path.insert(0, root)
+            return
+
+
+_ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import load_instance, route_metrics
+except ModuleNotFoundError:
+    from runtime.problem import load_instance, route_metrics
+
+
+def _is_free(grid, cell):
+    x, y = cell
+    return 0 <= y < len(grid) and 0 <= x < len(grid[0]) and grid[y][x] != "#"
+
+
+def _neighbors(grid, cell):
+    x, y = cell
+    candidates = [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)]
+    return [nxt for nxt in candidates if _is_free(grid, nxt)]
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def solve(instance):
+    grid = instance["grid"]
+    start = instance["start"]
+    goal = instance["goal"]
+
+    queue = deque([start])
+    parent = {start: None}
+    while queue:
+        current = queue.popleft()
+        if current == goal:
+            return _retrace(parent, current)
+        for nxt in _neighbors(grid, current):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    raise RuntimeError("baseline route not found")
+
+
+if __name__ == "__main__":
+    instance = load_instance()
+    path = solve(instance)
+    print(route_metrics(path)["fuel"])
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/evaluate.py b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/evaluate.py
new file mode 100644
index 00000000..526300a7
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/evaluate.py
@@ -0,0 +1,158 @@
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import sys
+from pathlib import Path
+from types import ModuleType
+from typing import Any
+
+
+def _ensure_repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = parent
+            root_s = str(root)
+            if root_s not in sys.path:
+                sys.path.insert(0, root_s)
+            return root
+    raise RuntimeError("could not locate repository root")
+
+
+REPO_ROOT = _ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import (
+        REFERENCE_FUEL,
+        REFERENCE_TIME_H,
+        BASELINE_FUEL,
+        BASELINE_TIME_H,
+        load_instance,
+        route_metrics,
+        validate_path,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import REFERENCE_FUEL, REFERENCE_TIME_H, BASELINE_FUEL, BASELINE_TIME_H, load_instance, route_metrics, validate_path
+
+
+def _load_module(path: Path, module_name: str) -> ModuleType:
+    path = path.resolve()
+    if not path.is_file():
+        raise FileNotFoundError(path)
+    if str(path.parent) not in sys.path:
+        sys.path.insert(0, str(path.parent))
+    spec = importlib.util.spec_from_file_location(module_name, str(path))
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module
+
+
+def _score_from_fuel(fuel: float, baseline: float, optimum: float) -> float:
+    if not (baseline > optimum):
+        return 0.0
+    span = baseline - optimum
+    if span <= 0:
+        return 0.0
+    score = 100.0 * (baseline - float(fuel)) / span
+    return max(0.0, min(100.0, score))
+
+
+def _evaluate_solve_module(module: ModuleType, instance: Any) -> dict[str, Any]:
+    if not hasattr(module, "solve"):
+        raise AttributeError("candidate module must define solve")
+    result = module.solve(instance)
+    path = validate_path(result)
+    metrics = route_metrics(path)
+    return {"path": path, **metrics}
+
+
+def evaluate(candidate_path: str | None = None) -> dict[str, Any]:
+    instance = load_instance()
+
+    baseline_module = _load_module(
+        REPO_ROOT / "frontier_eval" / "tasks" / "teaching_fuel_minimizing_ship_weather_routing" / "baseline" / "init.py",
+        "teaching_fuel_baseline",
+    )
+    reference_module = _load_module(
+        REPO_ROOT / "frontier_eval" / "tasks" / "teaching_fuel_minimizing_ship_weather_routing" / "verification" / "reference.py",
+        "teaching_fuel_reference",
+    )
+
+    baseline_result = _evaluate_solve_module(baseline_module, instance)
+    reference_result = _evaluate_solve_module(reference_module, instance)
+    baseline_fuel = float(baseline_result["fuel"])
+    reference_fuel = float(reference_result["fuel"])
+    optimum_fuel = reference_fuel
+    reference_score = _score_from_fuel(reference_fuel, baseline_fuel, optimum_fuel)
+
+    metrics: dict[str, Any] = {
+        "baseline_fuel": baseline_fuel,
+        "baseline_time_h": float(baseline_result["time_h"]),
+        "baseline_hops": float(baseline_result["hops"]),
+        "reference_fuel": reference_fuel,
+        "reference_time_h": float(reference_result["time_h"]),
+        "reference_hops": float(reference_result["hops"]),
+        "theoretical_optimum_fuel": float(optimum_fuel),
+        "theoretical_optimum_time_h": float(REFERENCE_TIME_H),
+        "theoretical_upper_bound_score": 100.0,
+        "baseline_score": _score_from_fuel(baseline_fuel, baseline_fuel, optimum_fuel),
+        "reference_score": reference_score,
+        "valid": 1.0,
+    }
+
+    if candidate_path:
+        try:
+            candidate_module = _load_module(Path(candidate_path), "teaching_fuel_candidate")
+            candidate_result = _evaluate_solve_module(candidate_module, instance)
+            candidate_fuel = float(candidate_result["fuel"])
+            metrics.update(
+                {
+                    "candidate_fuel": candidate_fuel,
+                    "candidate_time_h": float(candidate_result["time_h"]),
+                    "candidate_hops": float(candidate_result["hops"]),
+                    "candidate_score": _score_from_fuel(candidate_fuel, baseline_fuel, optimum_fuel),
+                    "gap_to_optimum": candidate_fuel - optimum_fuel,
+                    "combined_score": _score_from_fuel(candidate_fuel, baseline_fuel, optimum_fuel),
+                }
+            )
+        except Exception as exc:
+            metrics.update(
+                {
+                    "candidate_fuel": float("inf"),
+                    "candidate_time_h": float("inf"),
+                    "candidate_hops": float("inf"),
+                    "candidate_score": 0.0,
+                    "gap_to_optimum": float("inf"),
+                    "combined_score": 0.0,
+                    "valid": 0.0,
+                    "candidate_error": str(exc),
+                }
+            )
+    else:
+        metrics.update(
+            {
+                "combined_score": reference_score,
+                "gap_to_optimum": reference_fuel - optimum_fuel,
+            }
+        )
+
+    return metrics
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("candidate_path", nargs="?", default=None)
+    parser.add_argument("--candidate", dest="candidate_flag", default=None)
+    args = parser.parse_args(argv)
+    candidate_path = args.candidate_flag or args.candidate_path
+    metrics = evaluate(candidate_path)
+    print(json.dumps(metrics, indent=2, sort_keys=True))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/reference.py b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/reference.py
new file mode 100644
index 00000000..b80eb6b9
--- /dev/null
+++ b/frontier_eval/tasks/teaching_fuel_minimizing_ship_weather_routing/verification/reference.py
@@ -0,0 +1,95 @@
+from __future__ import annotations
+
+import heapq
+import sys
+from pathlib import Path
+
+
+def _ensure_repo_root() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            root = str(parent)
+            if root not in sys.path:
+                sys.path.insert(0, root)
+            return
+
+
+_ensure_repo_root()
+
+try:
+    from benchmarks.OperationsResearch.FuelMinimizingShipWeatherRouting.runtime.problem import (
+        current_at,
+        load_instance,
+        route_metrics,
+        wind_at,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import current_at, load_instance, route_metrics, wind_at
+
+
+def _is_free(grid, cell):
+    x, y = cell
+    return 0 <= y < len(grid) and 0 <= x < len(grid[0]) and grid[y][x] != "#"
+
+
+def _neighbors(grid, cell):
+    x, y = cell
+    candidates = [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)]
+    return [nxt for nxt in candidates if _is_free(grid, nxt)]
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def _leg_metrics(prev, curr):
+    dx = curr[0] - prev[0]
+    dy = curr[1] - prev[1]
+    current_u, current_v = current_at(prev)
+    wind_u, wind_v = wind_at(prev)
+    current_along = current_u * dx + current_v * dy
+    wind_along = wind_u * dx + wind_v * dy
+    headwind = max(0.0, -wind_along)
+    crosswind = abs(-dy * wind_u + dx * wind_v)
+    speed = max(0.35, 1.0 + 0.65 * current_along - 0.45 * headwind)
+    leg_time_h = 1.0 / speed
+    fuel_rate = 1.05 + 0.55 * headwind + 0.20 * crosswind + 0.25 * max(0.0, -current_along)
+    leg_fuel = leg_time_h * fuel_rate
+    return leg_fuel, leg_time_h
+
+
+def solve(instance):
+    grid = instance["grid"]
+    start = instance["start"]
+    goal = instance["goal"]
+
+    frontier = [(0.0, 0.0, start)]
+    parent = {start: None}
+    best_cost = {start: 0.0}
+
+    while frontier:
+        cost, _, current = heapq.heappop(frontier)
+        if cost != best_cost.get(current, float("inf")):
+            continue
+        if current == goal:
+            return _retrace(parent, current)
+        for nxt in _neighbors(grid, current):
+            leg_fuel, _ = _leg_metrics(current, nxt)
+            new_cost = cost + leg_fuel
+            if new_cost < best_cost.get(nxt, float("inf")):
+                best_cost[nxt] = new_cost
+                parent[nxt] = current
+                heapq.heappush(frontier, (new_cost, len(best_cost), nxt))
+    raise RuntimeError("no feasible route found")
+
+
+if __name__ == "__main__":
+    instance = load_instance()
+    path = solve(instance)
+    print(route_metrics(path)["fuel"])
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README.md b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README.md
new file mode 100644
index 00000000..44edc226
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README.md
@@ -0,0 +1,35 @@
+# Multi-Robot Prioritized Planning
+
+This teaching scaffold is derived from `benchmarks/Robotics/MultiRobotPrioritizedPlanning`.
+It explains small-scale multi-agent path finding for readers who know algorithms and graphs but are new to robotics path coordination.
+
+## Directory Structure
+
+```text
+teaching_multi_robot_prioritized_planning/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+## What This Benchmark Teaches
+
+The benchmark asks you to plan collision-free paths for three robots on a frozen grid.
+Each robot can move one cell per step or wait in place, and the hard part is preventing robots from colliding with each other while keeping the total path cost low.
+
+This is a clean teaching example of prioritized planning:
+choose an ordering of robots, plan one robot at a time in space-time, and reserve its path so later robots avoid it.
+
+## Source of Truth
+
+The frozen instance and runtime helpers live in:
+
+`benchmarks/Robotics/MultiRobotPrioritizedPlanning/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README_zh-CN.md b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README_zh-CN.md
new file mode 100644
index 00000000..c2e72121
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/README_zh-CN.md
@@ -0,0 +1,35 @@
+# 多机器人优先级路径规划
+
+这个教学版任务基于 `benchmarks/Robotics/MultiRobotPrioritizedPlanning`。
+它面向懂算法和图搜索、但还没有接触过机器人协同规划的读者。
+
+## 目录结构
+
+```text
+teaching_multi_robot_prioritized_planning/
+├── README.md
+├── README_zh-CN.md
+├── Task.md
+├── Task_zh-CN.md
+├── baseline/
+│   └── init.py
+└── verification/
+    ├── reference.py
+    └── evaluate.py
+```
+
+## 这个基准在教什么
+
+这个任务要求你在一张冻结栅格图上，为 3 台机器人同时规划无碰撞路径。
+每台机器人每一步可以走到相邻格点，也可以原地等待，真正的难点是避免机器人之间的顶点冲突和边交换冲突，同时尽量降低总路径代价。
+
+它很适合讲解 prioritized planning：
+先决定机器人顺序，再用 space-time 搜索逐个规划，后面的机器人必须避开前面机器人已经占用的时间和空间。
+
+## 事实来源
+
+冻结实例和运行时辅助函数都在：
+
+`benchmarks/Robotics/MultiRobotPrioritizedPlanning/runtime/problem.py`
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task.md b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task.md
new file mode 100644
index 00000000..b19df09a
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task.md
@@ -0,0 +1,116 @@
+# Multi-Robot Prioritized Planning Task
+
+## Problem
+
+You must plan collision-free paths for three robots on a frozen occupancy grid.
+The evaluator checks whether the paths are individually valid, whether the robots collide with each other, and how expensive the joint plan is.
+
+This is a prioritized planning problem.
+The robots are planned one at a time in a chosen priority order.
+When a robot is planned, the already planned robots become moving obstacles in space-time.
+
+## Background
+
+Single-robot shortest path planning is not enough once multiple robots share the same aisles.
+Even if each robot has a good path by itself, two robots can still collide at the same cell at the same time, or swap cells across the same edge in opposite directions.
+
+That is why multi-robot planning is often solved by decomposition:
+pick an ordering of robots, plan the first robot, reserve its path, then plan the next robot against those reservations.
+
+This benchmark keeps the instance tiny on purpose.
+There are only three robots, which makes the teaching point clear:
+the difficult part is not path search itself, but the interaction between path search and coordination.
+
+## What Is Frozen
+
+- The occupancy grid.
+- The three start-goal pairs.
+- The collision rules.
+- The fact that each robot may wait in place or move to an adjacent free cell.
+- The reference instance has exactly three robots, which makes exact search over this tiny frozen case feasible.
+
+## Input and Output
+
+Your candidate file should define:
+
+```python
+def plan_paths(grid, starts, goals):
+    ...
+```
+
+Inputs:
+
+- `grid`: a tuple of strings, where `#` is an obstacle and `.` is free space
+- `starts`: a tuple of `(x, y)` start cells, one per robot
+- `goals`: a tuple of `(x, y)` goal cells, one per robot
+
+Output:
+
+- a list of paths, one path per robot
+
+A dictionary with a `paths` field is also accepted by the evaluator.
+
+Each path is a sequence of grid cells.
+The first cell must equal the robot start.
+The last cell must equal the robot goal.
+Each step must either stay in place or move to one of the four neighboring cells.
+
+The evaluator will reject paths that leave the grid, enter obstacles, or violate collision rules.
+
+## Expected Result
+
+A good solution should produce a feasible set of paths and keep the total path length small.
+The evaluator measures:
+
+- `candidate_total_cost = sum(len(path) - 1 for path in paths)`
+- `candidate_makespan = max(len(path) - 1 for path in paths)`
+
+The reference implementation in this scaffold performs exact search on the tiny frozen 3-robot instance, so it can report the best total cost achievable by this teaching problem.
+
+## Scoring
+
+This is a minimization task.
+We use the exact best total cost found by the reference solver as the theoretical upper bound for scoring.
+
+```text
+normalized_score = 100 * clip((baseline_total_cost - candidate_total_cost) / (baseline_total_cost - optimal_total_cost), 0, 1)
+```
+
+Interpretation:
+
+- `0` means the candidate is no better than the baseline
+- `100` means the candidate matches the best total cost found by the exact reference search
+- invalid submissions receive `0`
+
+We also report:
+
+- `baseline_total_cost`
+- `reference_total_cost`
+- `lower_bound_total_cost`
+- `candidate_makespan`
+- `theoretical_optimum_total_cost`
+
+## Why This Is Hard
+
+The obvious idea is to plan each robot independently with shortest paths.
+That fails as soon as two robots want to use the same corridor at the same time.
+
+The next step is prioritized planning, but the priority order matters.
+The first robot gets the most freedom, while later robots inherit all the reservations.
+Choosing a bad order can make the instance look infeasible or force extra detours.
+
+That is the optimization lesson here:
+you are not only solving path planning, you are also choosing a coordination policy.
+
+## Failure Cases
+
+The submission is invalid if:
+
+- `plan_paths` is missing
+- the returned value is not a list of paths
+- any path is malformed or non-adjacent
+- any path enters an obstacle
+- a vertex collision or edge-swap collision occurs
+- the evaluator cannot import or run the candidate file
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task_zh-CN.md b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task_zh-CN.md
new file mode 100644
index 00000000..a0ca8cc2
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/Task_zh-CN.md
@@ -0,0 +1,115 @@
+# 多机器人优先级路径规划任务
+
+## 问题
+
+你需要为 3 台机器人在一张冻结的占据栅格上规划无碰撞路径。
+评测器会检查每条路径是否有效、机器人之间是否发生冲突，以及整组路径的总代价有多大。
+
+这是一个 prioritized planning 问题。
+机器人不是同时一起搜，而是按某个优先级顺序逐个规划。
+在规划后面的机器人时，前面机器人已经占用的时间和空间都会变成约束。
+
+## 背景
+
+单机器人最短路并不足以解决多机器人协同问题。
+即使每台机器人单独看都是好路径，它们仍然可能在同一个时刻出现在同一个格点，或者在同一条边上相向交换位置。
+
+所以多机器人规划通常会做分解：
+先决定机器人顺序，再为第一个机器人规划路径并保留它的占用，再让下一个机器人在这些保留约束下规划。
+
+这个 benchmark 故意把实例做得很小。
+只有 3 台机器人，所以教学重点非常清楚：
+难点不只是 path search，而是 path search 和协同之间的互动。
+
+## 哪些部分是冻结的
+
+- 占据栅格
+- 3 组起点和终点
+- 冲突规则
+- 每台机器人每一步可以原地等待，或者移动到四邻接的空闲格点
+- 这个参考实例只有 3 台机器人，因此在这个很小的冻结实例上做精确搜索是可行的
+
+## 输入与输出
+
+你的候选文件需要定义：
+
+```python
+def plan_paths(grid, starts, goals):
+    ...
+```
+
+输入：
+
+- `grid`：字符串元组，`#` 表示障碍物，`.` 表示空闲空间
+- `starts`：起点元组，每台机器人一个 `(x, y)`
+- `goals`：终点元组，每台机器人一个 `(x, y)`
+
+输出：
+
+- 路径列表，每台机器人一条路径
+
+评测器也接受带 `paths` 字段的字典。
+
+每条路径都是网格坐标序列。
+第一个坐标必须是机器人起点。
+最后一个坐标必须是机器人终点。
+每一步要么原地不动，要么移动到四邻接格点。
+
+如果路径越界、进入障碍物，或者违反冲突规则，评测器都会判定失败。
+
+## 预期结果
+
+好的解法应该输出可行路径集合，并尽量降低总路径长度。
+评测器会计算：
+
+- `candidate_total_cost = sum(len(path) - 1 for path in paths)`
+- `candidate_makespan = max(len(path) - 1 for path in paths)`
+
+这个教学 scaffold 里的 reference 实现会对这个很小的冻结实例做精确搜索，因此它可以报告这道教学题里能达到的最佳总代价。
+
+## 计分方式
+
+这是一个最小化任务。
+我们使用 reference 搜索得到的最佳总代价作为理论上限。
+
+```text
+normalized_score = 100 * clip((baseline_total_cost - candidate_total_cost) / (baseline_total_cost - optimal_total_cost), 0, 1)
+```
+
+含义是：
+
+- `0` 表示候选解不比 baseline 更好
+- `100` 表示候选解达到 exact reference 搜索找到的最佳总代价
+- 无效提交得分为 `0`
+
+我们还会报告：
+
+- `baseline_total_cost`
+- `reference_total_cost`
+- `lower_bound_total_cost`
+- `candidate_makespan`
+- `theoretical_optimum_total_cost`
+
+## 为什么难
+
+最直觉的做法是把每台机器人单独规划成最短路。
+但一旦几台机器人共用同一条走廊，这样做就会冲突。
+
+下一步是 prioritized planning，但优先级顺序本身就是一个决策。
+先规划的机器人自由度更高，后规划的机器人必须绕开前面的保留路径。
+坏的顺序可能让问题看起来不可行，或者逼出更长的绕行路径。
+
+所以这个题的优化点不仅是“怎么找路”，还是“怎么选协同策略”。
+
+## 判为无效的情况
+
+如果出现以下任一情况，提交无效：
+
+- 缺少 `plan_paths`
+- 返回值不是路径列表
+- 任意路径格式错误，或者包含非相邻移动
+- 任意路径进入障碍物
+- 出现顶点冲突或边交换冲突
+- 评测器无法导入或运行候选文件
+
+<!-- AI_GENERATED -->
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/baseline/init.py b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/baseline/init.py
new file mode 100644
index 00000000..6c027fcb
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/baseline/init.py
@@ -0,0 +1,151 @@
+from __future__ import annotations
+
+from collections import deque
+from pathlib import Path
+import sys
+
+
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parents[4]
+
+
+def _ensure_repo_root() -> None:
+    root = str(_repo_root())
+    if root not in sys.path:
+        sys.path.insert(0, root)
+
+
+_ensure_repo_root()
+
+from benchmarks.Robotics.MultiRobotPrioritizedPlanning.runtime import problem as benchmark_problem
+
+
+def _is_free(grid: tuple[str, ...], cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= y < len(grid) and 0 <= x < len(grid[0]) and grid[y][x] != "#"
+
+
+def _neighbors(grid: tuple[str, ...], cell: tuple[int, int]) -> list[tuple[int, int]]:
+    x, y = cell
+    candidates = [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1), (x, y)]
+    return [candidate for candidate in candidates if _is_free(grid, candidate)]
+
+
+def _shortest_path_length(grid: tuple[str, ...], start: tuple[int, int], goal: tuple[int, int]) -> int:
+    queue = deque([start])
+    distance = {start: 0}
+    while queue:
+        current = queue.popleft()
+        if current == goal:
+            return distance[current]
+        for nxt in _neighbors(grid, current):
+            if nxt not in distance:
+                distance[nxt] = distance[current] + 1
+                queue.append(nxt)
+    raise RuntimeError("no individual path exists")
+
+
+def _space_time_bfs(
+    grid: tuple[str, ...],
+    start: tuple[int, int],
+    goal: tuple[int, int],
+    reserved_vertices: set[tuple[tuple[int, int], int]],
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]],
+    horizon: int,
+) -> list[tuple[int, int]] | None:
+    start_state = (start, 0)
+    if (start, 0) in reserved_vertices:
+        return None
+
+    queue = deque([start_state])
+    parent: dict[tuple[tuple[int, int], int], tuple[tuple[int, int], int] | None] = {start_state: None}
+
+    while queue:
+        current_cell, current_time = queue.popleft()
+        if current_cell == goal:
+            path: list[tuple[int, int]] = []
+            state: tuple[tuple[int, int], int] | None = (current_cell, current_time)
+            while state is not None:
+                path.append(state[0])
+                state = parent[state]
+            return path[::-1]
+        if current_time >= horizon:
+            continue
+
+        for nxt in _neighbors(grid, current_cell):
+            next_time = current_time + 1
+            next_state = (nxt, next_time)
+            if next_state in parent:
+                continue
+            if next_state in reserved_vertices:
+                continue
+            if ((current_cell, nxt), next_time) in reserved_edges:
+                continue
+            if ((nxt, current_cell), next_time) in reserved_edges:
+                continue
+            parent[next_state] = (current_cell, current_time)
+            queue.append(next_state)
+
+    return None
+
+
+def _reserve_path(
+    path: list[tuple[int, int]],
+    reserved_vertices: set[tuple[tuple[int, int], int]],
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]],
+    horizon: int,
+) -> None:
+    for t, cell in enumerate(path):
+        reserved_vertices.add((cell, t))
+        if t > 0:
+            reserved_edges.add(((path[t - 1], cell), t))
+    goal = path[-1]
+    for t in range(len(path), horizon + 1):
+        reserved_vertices.add((goal, t))
+
+
+def plan_paths(grid, starts, goals):
+    grid = tuple(str(row) for row in grid)
+    starts = tuple(tuple(cell) for cell in starts)
+    goals = tuple(tuple(cell) for cell in goals)
+
+    if len(starts) != len(goals):
+        raise ValueError("starts and goals must have the same length")
+
+    order = sorted(
+        range(len(starts)),
+        key=lambda idx: _shortest_path_length(grid, starts[idx], goals[idx]),
+        reverse=True,
+    )
+    horizon = max(40, len(grid) * len(grid[0]) * 2)
+
+    reserved_vertices: set[tuple[tuple[int, int], int]] = set()
+    reserved_edges: set[tuple[tuple[tuple[int, int], tuple[int, int]], int]] = set()
+    paths: list[list[tuple[int, int]] | None] = [None] * len(starts)
+
+    for robot_idx in order:
+        path = _space_time_bfs(
+            grid,
+            starts[robot_idx],
+            goals[robot_idx],
+            reserved_vertices,
+            reserved_edges,
+            horizon,
+        )
+        if path is None:
+            raise RuntimeError(f"failed to plan path for robot {robot_idx}")
+        paths[robot_idx] = path
+        _reserve_path(path, reserved_vertices, reserved_edges, horizon)
+
+    return [path for path in paths if path is not None]
+
+
+def solve(instance):
+    return plan_paths(instance["grid"], instance["starts"], instance["goals"])
+
+
+if __name__ == "__main__":
+    instance = benchmark_problem.load_instance()
+    paths = plan_paths(instance["grid"], instance["starts"], instance["goals"])
+    print(benchmark_problem.total_cost(paths))
+
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/evaluate.py b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/evaluate.py
new file mode 100644
index 00000000..c8a8609c
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/evaluate.py
@@ -0,0 +1,166 @@
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import math
+from pathlib import Path
+import sys
+from time import perf_counter
+from types import ModuleType
+from typing import Any, Callable
+
+
+HERE = Path(__file__).resolve().parent
+TASK_ROOT = HERE.parent
+REPO_ROOT = HERE.parents[3]
+
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+
+def _load_module(path: Path, module_name: str) -> ModuleType:
+    path = path.resolve()
+    if not path.is_file():
+        raise FileNotFoundError(path)
+    spec = importlib.util.spec_from_file_location(module_name, str(path))
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module
+
+
+def _benchmark_problem():
+    module_path = (
+        REPO_ROOT
+        / "benchmarks"
+        / "Robotics"
+        / "MultiRobotPrioritizedPlanning"
+        / "runtime"
+        / "problem.py"
+    )
+    return _load_module(module_path, "teaching_mrpp_benchmark_problem")
+
+
+def _load_solver(path: Path) -> Callable[[Any], Any]:
+    module = _load_module(path, f"teaching_mrpp_{path.stem}")
+    if hasattr(module, "plan_paths"):
+        return getattr(module, "plan_paths")
+    if hasattr(module, "solve"):
+        return getattr(module, "solve")
+    raise AttributeError(f"{path} must define plan_paths(grid, starts, goals) or solve(instance)")
+
+
+def _evaluate_solver(label: str, solver: Callable[[Any], Any], instance: dict[str, Any]) -> dict[str, Any]:
+    benchmark_problem = _benchmark_problem()
+    started = perf_counter()
+    try:
+        try:
+            result = solver(instance["grid"], instance["starts"], instance["goals"])
+        except TypeError:
+            result = solver(instance)
+        total_cost = float(benchmark_problem.total_cost(result))
+        makespan = float(benchmark_problem.makespan(result))
+        valid = 1.0
+        error = ""
+    except Exception as exc:
+        total_cost = float("inf")
+        makespan = float("inf")
+        valid = 0.0
+        error = str(exc)
+    runtime_s = perf_counter() - started
+    return {
+        "label": label,
+        "valid": valid,
+        "total_cost": total_cost,
+        "makespan": makespan,
+        "runtime_s": float(runtime_s),
+        "error": error,
+    }
+
+
+def _normalized_score(candidate_cost: float, baseline_cost: float, optimum_cost: float) -> float:
+    if not all(math.isfinite(x) for x in (candidate_cost, baseline_cost, optimum_cost)):
+        return 0.0
+    span = baseline_cost - optimum_cost
+    if span <= 0:
+        return 100.0 if candidate_cost <= optimum_cost else 0.0
+    score = 100.0 * (baseline_cost - candidate_cost) / span
+    return float(max(0.0, min(100.0, score)))
+
+
+def evaluate(candidate_path: str | None = None) -> dict[str, Any]:
+    benchmark_problem = _benchmark_problem()
+    instance = benchmark_problem.load_instance()
+
+    baseline_solver = _load_solver(TASK_ROOT / "baseline" / "init.py")
+    reference_solver = _load_solver(HERE / "reference.py")
+
+    baseline = _evaluate_solver("baseline", baseline_solver, instance)
+    reference = _evaluate_solver("reference", reference_solver, instance)
+
+    if candidate_path is None:
+        candidate = baseline
+        candidate_label = "baseline"
+    else:
+        candidate_label = str(Path(candidate_path).expanduser().resolve())
+        try:
+            candidate_solver = _load_solver(Path(candidate_path))
+            candidate = _evaluate_solver("candidate", candidate_solver, instance)
+        except Exception as exc:
+            candidate = {
+                "label": "candidate",
+                "valid": 0.0,
+                "total_cost": float("inf"),
+                "makespan": float("inf"),
+                "runtime_s": 0.0,
+                "error": str(exc),
+            }
+
+    optimum_cost = reference["total_cost"] if reference["valid"] else float("inf")
+    candidate_score = _normalized_score(candidate["total_cost"], baseline["total_cost"], optimum_cost)
+    reference_score = _normalized_score(reference["total_cost"], baseline["total_cost"], optimum_cost)
+
+    result: dict[str, Any] = {
+        "candidate_label": candidate_label,
+        "candidate_valid": candidate["valid"],
+        "candidate_total_cost": candidate["total_cost"],
+        "candidate_makespan": candidate["makespan"],
+        "candidate_runtime_s": candidate["runtime_s"],
+        "candidate_score": candidate_score,
+        "baseline_valid": baseline["valid"],
+        "baseline_total_cost": baseline["total_cost"],
+        "baseline_makespan": baseline["makespan"],
+        "baseline_runtime_s": baseline["runtime_s"],
+        "baseline_score": _normalized_score(baseline["total_cost"], baseline["total_cost"], optimum_cost),
+        "reference_valid": reference["valid"],
+        "reference_total_cost": reference["total_cost"],
+        "reference_makespan": reference["makespan"],
+        "reference_runtime_s": reference["runtime_s"],
+        "reference_score": reference_score,
+        "lower_bound_total_cost": float(benchmark_problem.LOWER_BOUND_TOTAL_COST),
+        "theoretical_optimum_total_cost": optimum_cost,
+        "theoretical_upper_bound_score": 100.0,
+        "combined_score": candidate_score,
+    }
+    if candidate["error"]:
+        result["candidate_error"] = candidate["error"]
+    if baseline["error"]:
+        result["baseline_error"] = baseline["error"]
+    if reference["error"]:
+        result["reference_error"] = reference["error"]
+    return result
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="Evaluate the teaching multi-robot prioritized planning scaffold.")
+    parser.add_argument("candidate_path", nargs="?", default=None, help="optional candidate Python file path")
+    parser.add_argument("--candidate", dest="candidate_flag", default=None, help="optional candidate Python file path")
+    args = parser.parse_args(argv)
+    candidate_path = args.candidate_flag or args.candidate_path
+    print(json.dumps(evaluate(candidate_path), indent=2, sort_keys=True))
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/reference.py b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/reference.py
new file mode 100644
index 00000000..da33cc64
--- /dev/null
+++ b/frontier_eval/tasks/teaching_multi_robot_prioritized_planning/verification/reference.py
@@ -0,0 +1,151 @@
+from __future__ import annotations
+
+from collections import deque
+from heapq import heappop, heappush
+from pathlib import Path
+import sys
+
+
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parents[4]
+
+
+def _ensure_repo_root() -> None:
+    root = str(_repo_root())
+    if root not in sys.path:
+        sys.path.insert(0, root)
+
+
+_ensure_repo_root()
+
+from benchmarks.Robotics.MultiRobotPrioritizedPlanning.runtime import problem as benchmark_problem
+
+
+def _is_free(grid: tuple[str, ...], cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= y < len(grid) and 0 <= x < len(grid[0]) and grid[y][x] != "#"
+
+
+def _neighbors(grid: tuple[str, ...], cell: tuple[int, int]) -> list[tuple[int, int]]:
+    x, y = cell
+    candidates = [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1), (x, y)]
+    return [candidate for candidate in candidates if _is_free(grid, candidate)]
+
+
+def _distance_map(grid: tuple[str, ...], goal: tuple[int, int]) -> dict[tuple[int, int], int]:
+    queue = deque([goal])
+    distance = {goal: 0}
+    while queue:
+        current = queue.popleft()
+        for nxt in _neighbors(grid, current):
+            if nxt not in distance:
+                distance[nxt] = distance[current] + 1
+                queue.append(nxt)
+    return distance
+
+
+def _heuristic(state: tuple[tuple[int, int], ...], goals: tuple[tuple[int, int], ...], distance_maps: list[dict[tuple[int, int], int]]) -> int:
+    total = 0
+    for idx, pos in enumerate(state):
+        if pos == goals[idx]:
+            continue
+        distance = distance_maps[idx].get(pos)
+        if distance is None:
+            return 10**9
+        total += distance
+    return total
+
+
+def _reconstruct_states(parent: dict[tuple[tuple[int, int], ...], tuple[tuple[int, int], ...] | None], goal_state: tuple[tuple[int, int], ...]) -> list[tuple[tuple[int, int], ...]]:
+    sequence = []
+    state: tuple[tuple[int, int], ...] | None = goal_state
+    while state is not None:
+        sequence.append(state)
+        state = parent[state]
+    return sequence[::-1]
+
+
+def _next_states(current: tuple[tuple[int, int], ...], goals: tuple[tuple[int, int], ...], grid: tuple[str, ...]) -> list[tuple[tuple[int, int], ...]]:
+    action_sets: list[list[tuple[int, int]]] = []
+    for idx, pos in enumerate(current):
+        if pos == goals[idx]:
+            action_sets.append([pos])
+        else:
+            action_sets.append(_neighbors(grid, pos))
+
+    candidates: list[tuple[tuple[int, int], ...]] = []
+
+    def recurse(robot_idx: int, prefix: list[tuple[int, int]]) -> None:
+        if robot_idx == len(current):
+            nxt = tuple(prefix)
+            if len(set(nxt)) != len(nxt):
+                return
+            for i in range(len(current)):
+                for j in range(i + 1, len(current)):
+                    if current[i] == nxt[j] and current[j] == nxt[i]:
+                        return
+            candidates.append(nxt)
+            return
+
+        for choice in action_sets[robot_idx]:
+            recurse(robot_idx + 1, prefix + [choice])
+
+    recurse(0, [])
+    return candidates
+
+
+def plan_paths(grid, starts, goals):
+    grid = tuple(str(row) for row in grid)
+    starts = tuple(tuple(cell) for cell in starts)
+    goals = tuple(tuple(cell) for cell in goals)
+
+    distance_maps = [_distance_map(grid, goal) for goal in goals]
+    start_state = starts
+    goal_state = goals
+
+    open_heap: list[tuple[int, int, tuple[tuple[int, int], ...]]] = []
+    heappush(open_heap, (_heuristic(start_state, goals, distance_maps), 0, start_state))
+    best_g = {start_state: 0}
+    parent: dict[tuple[tuple[int, int], ...], tuple[tuple[int, int], ...] | None] = {start_state: None}
+
+    while open_heap:
+        _, g, current = heappop(open_heap)
+        if g != best_g.get(current):
+            continue
+        if current == goal_state:
+            joint_states = _reconstruct_states(parent, current)
+            paths: list[list[tuple[int, int]]] = []
+            for robot_idx, goal in enumerate(goals):
+                path: list[tuple[int, int]] = []
+                for state in joint_states:
+                    path.append(state[robot_idx])
+                    if state[robot_idx] == goal:
+                        break
+                paths.append(path)
+            return paths
+
+        step_cost = sum(1 for pos, goal in zip(current, goals) if pos != goal)
+        for nxt in _next_states(current, goals, grid):
+            tentative_g = g + step_cost
+            if tentative_g >= best_g.get(nxt, 10**18):
+                continue
+            best_g[nxt] = tentative_g
+            parent[nxt] = current
+            f = tentative_g + _heuristic(nxt, goals, distance_maps)
+            heappush(open_heap, (f, tentative_g, nxt))
+
+    raise RuntimeError("exact joint-state search failed to find a solution")
+
+
+def solve(instance):
+    return plan_paths(instance["grid"], instance["starts"], instance["goals"])
+
+
+EXACT_OPTIMUM_AVAILABLE = True
+
+
+if __name__ == "__main__":
+    instance = benchmark_problem.load_instance()
+    paths = plan_paths(instance["grid"], instance["starts"], instance["goals"])
+    print(benchmark_problem.total_cost(paths))
+
diff --git a/scripts/bootstrap_duckdb_benchmarks.py b/scripts/bootstrap_duckdb_benchmarks.py
new file mode 100644
index 00000000..077480bd
--- /dev/null
+++ b/scripts/bootstrap_duckdb_benchmarks.py
@@ -0,0 +1,1357 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import textwrap
+from pathlib import Path
+
+
+TASKS = [
+    {
+        "slug": "DuckDBIndexSelection",
+        "title": "DuckDB Index Selection",
+        "short": "Choose a small set of DuckDB indexes for a frozen analytical lookup workload.",
+        "kind": "index",
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL and index support
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the schema and workload lineage are traceable to official DuckDB/TPC-H benchmarking materials, but the data itself is a local frozen synthetic asset used because online extension-based generation was not reliable in this environment.
+- License lineage: DuckDB is released under the MIT License.
+""",
+    },
+    {
+        "slug": "DuckDBQueryRewrite",
+        "title": "DuckDB Query Rewrite",
+        "short": "Rewrite a frozen DuckDB analytical SQL query to preserve results while reducing total runtime.",
+        "kind": "rewrite",
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL optimizer and query execution model
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the workload shape is traceable to official DuckDB/TPC-H analytical reporting patterns, while the exact query instance is a benchmark-local frozen SQL task chosen to expose meaningful rewrite opportunities.
+- License lineage: DuckDB is released under the MIT License.
+""",
+    },
+    {
+        "slug": "DuckDBPreAggregationSelection",
+        "title": "DuckDB Pre-Aggregation Selection",
+        "short": "Choose a small set of pre-aggregation tables for a frozen DuckDB reporting workload.",
+        "kind": "preaggregation",
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL execution on analytical reporting queries
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the reporting queries and schema family are traceable to official analytical benchmark patterns, while the candidate pre-aggregations are benchmark-local frozen physical-design options.
+- License lineage: DuckDB is released under the MIT License.
+""",
+    },
+]
+
+
+HELPER_TEXT = """\
+from __future__ import annotations
+
+import math
+import time
+from typing import Any
+
+import duckdb
+
+
+CUSTOMER_COUNT = 20_000
+ORDER_COUNT = 120_000
+LINEITEM_COUNT = 600_000
+
+SEGMENTS = ("BUILDING", "AUTOMOBILE", "HOUSEHOLD", "FURNITURE", "MACHINERY")
+SHIPMODES = ("AIR", "MAIL", "RAIL", "TRUCK", "SHIP")
+
+CUSTOMER_KEYS = tuple(1 + ((i * 97) % CUSTOMER_COUNT) for i in range(1, 301))
+ORDER_KEYS = tuple(1 + ((i * 193) % ORDER_COUNT) for i in range(1, 301))
+
+
+INDEX_CANDIDATES = {
+    "idx_orders_cust": "CREATE INDEX idx_orders_cust ON orders(o_custkey)",
+    "idx_orders_date": "CREATE INDEX idx_orders_date ON orders(o_orderdate)",
+    "idx_lineitem_order": "CREATE INDEX idx_lineitem_order ON lineitem(l_orderkey)",
+    "idx_customer_segment": "CREATE INDEX idx_customer_segment ON customer(c_mktsegment)",
+    "idx_orders_priority": "CREATE INDEX idx_orders_priority ON orders(o_orderpriority)",
+}
+
+INDEX_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_indexes": tuple(sorted(INDEX_CANDIDATES)),
+    "workload_notes": (
+        "Repeated selective customer lookups on orders",
+        "Repeated selective order lookups on lineitem",
+        "Repeated priority-filtered joins from customer to orders",
+    ),
+    "repetitions": 4,
+}
+
+
+PREAGGREGATION_CANDIDATES = {
+    "agg_quarter_segment_revenue": (
+        "CREATE TABLE agg_quarter_segment_revenue AS "
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_month_shipmode_revenue": (
+        "CREATE TABLE agg_month_shipmode_revenue AS "
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "GROUP BY 1, 2"
+    ),
+    "agg_customer_year_revenue": (
+        "CREATE TABLE agg_customer_year_revenue AS "
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2"
+    ),
+    "agg_unused_priority_only": (
+        "CREATE TABLE agg_unused_priority_only AS "
+        "SELECT o.o_orderpriority, count(*) AS order_count "
+        "FROM orders o "
+        "GROUP BY 1"
+    ),
+}
+
+PREAGGREGATION_WORKLOAD_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "candidate_preaggregations": tuple(sorted(PREAGGREGATION_CANDIDATES)),
+    "workload_notes": (
+        "Quarter revenue by customer segment",
+        "Monthly revenue by ship mode",
+        "Top customers by yearly revenue",
+    ),
+    "repetitions": 4,
+}
+
+
+ORIGINAL_QUERY_SQL = '''
+WITH revenue AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+),
+order_counts AS (
+  SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket,
+         c.c_mktsegment AS segment,
+         count(DISTINCT o.o_orderkey) AS order_count
+  FROM customer c
+  JOIN orders o ON o.o_custkey = c.c_custkey
+  JOIN lineitem l ON l.l_orderkey = o.o_orderkey
+  WHERE c.c_mktsegment IN ('BUILDING', 'AUTOMOBILE', 'HOUSEHOLD')
+  GROUP BY 1, 2
+)
+SELECT r.quarter_bucket, r.segment, r.revenue, o.order_count
+FROM revenue r
+JOIN order_counts o USING (quarter_bucket, segment)
+ORDER BY quarter_bucket, segment
+'''.strip()
+
+QUERY_REWRITE_MANIFEST = {
+    "schema_lineage": "TPC-H-inspired customer/orders/lineitem local workload",
+    "query_goal": "Fuse repeated scans of the same join into one grouped aggregation while preserving results and ordering.",
+    "result_order_required": True,
+    "repetitions": 4,
+}
+
+
+def build_connection() -> duckdb.DuckDBPyConnection:
+    con = duckdb.connect(database=":memory:")
+    con.execute("PRAGMA threads=1")
+    con.execute(
+        f\"\"\"
+        CREATE TABLE customer AS
+        SELECT i AS c_custkey,
+               'Customer #' || i AS c_name,
+               CASE i % 5
+                 WHEN 0 THEN 'BUILDING'
+                 WHEN 1 THEN 'AUTOMOBILE'
+                 WHEN 2 THEN 'HOUSEHOLD'
+                 WHEN 3 THEN 'FURNITURE'
+                 ELSE 'MACHINERY'
+               END AS c_mktsegment,
+               i % 25 AS c_nationkey
+        FROM range(1, {CUSTOMER_COUNT + 1}) t(i)
+        \"\"\"
+    )
+    con.execute(
+        f\"\"\"
+        CREATE TABLE orders AS
+        SELECT i AS o_orderkey,
+               1 + ((i * 17) % {CUSTOMER_COUNT}) AS o_custkey,
+               DATE '1995-01-01' + (((i * 13) % 1460) * INTERVAL 1 DAY) AS o_orderdate,
+               100 + (((i * 37) % 100000) / 10.0) AS o_totalprice,
+               CASE i % 5
+                 WHEN 0 THEN '1-URGENT'
+                 WHEN 1 THEN '2-HIGH'
+                 WHEN 2 THEN '3-MEDIUM'
+                 WHEN 3 THEN '4-NOT SPECIFIED'
+                 ELSE '5-LOW'
+               END AS o_orderpriority
+        FROM range(1, {ORDER_COUNT + 1}) t(i)
+        \"\"\"
+    )
+    con.execute(
+        f\"\"\"
+        CREATE TABLE lineitem AS
+        SELECT i AS l_lineitemkey,
+               1 + ((i * 7) % {ORDER_COUNT}) AS l_orderkey,
+               1 + ((i * 11) % 50000) AS l_partkey,
+               1 + ((i * 13) % 10000) AS l_suppkey,
+               1 + ((i * 5) % 50) AS l_quantity,
+               10 + (((i * 19) % 100000) / 20.0) AS l_extendedprice,
+               (((i * 3) % 10) / 100.0) AS l_discount,
+               DATE '1995-01-01' + (((i * 29) % 1460) * INTERVAL 1 DAY) AS l_shipdate,
+               CASE i % 5
+                 WHEN 0 THEN 'AIR'
+                 WHEN 1 THEN 'MAIL'
+                 WHEN 2 THEN 'RAIL'
+                 WHEN 3 THEN 'TRUCK'
+                 ELSE 'SHIP'
+               END AS l_shipmode
+        FROM range(1, {LINEITEM_COUNT + 1}) t(i)
+        \"\"\"
+    )
+    return con
+
+
+def normalize_name_list(value: Any, key: str) -> list[str]:
+    if isinstance(value, dict):
+        if key not in value:
+            raise ValueError(f"missing {key}")
+        value = value[key]
+    if not isinstance(value, (list, tuple)):
+        raise ValueError(f"{key} must be a list or tuple")
+    out: list[str] = []
+    seen = set()
+    for item in value:
+        name = str(item)
+        if name not in seen:
+            out.append(name)
+            seen.add(name)
+    return out
+
+
+def compare_results(lhs: list[tuple[Any, ...]], rhs: list[tuple[Any, ...]], tol: float = 1e-6) -> bool:
+    if len(lhs) != len(rhs):
+        return False
+    for left_row, right_row in zip(lhs, rhs):
+        if len(left_row) != len(right_row):
+            return False
+        for left_value, right_value in zip(left_row, right_row):
+            if isinstance(left_value, float) or isinstance(right_value, float):
+                if not math.isfinite(float(left_value)) or not math.isfinite(float(right_value)):
+                    return False
+                if abs(float(left_value) - float(right_value)) > tol:
+                    return False
+            else:
+                if left_value != right_value:
+                    return False
+    return True
+
+
+def _report_quarter_segment(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT quarter_bucket, segment, revenue "
+            "FROM agg_quarter_segment_revenue "
+            "ORDER BY quarter_bucket, segment"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('quarter', o.o_orderdate) AS quarter_bucket, "
+        "       c.c_mktsegment AS segment, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "ORDER BY quarter_bucket, segment"
+    ).fetchall()
+
+
+def _report_month_shipmode(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT month_bucket, shipmode, revenue "
+            "FROM agg_month_shipmode_revenue "
+            "WHERE month_bucket >= DATE '1997-01-01' "
+            "ORDER BY month_bucket, shipmode"
+        ).fetchall()
+    return con.execute(
+        "SELECT date_trunc('month', l.l_shipdate) AS month_bucket, "
+        "       l.l_shipmode AS shipmode, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM lineitem l "
+        "WHERE l.l_shipdate >= DATE '1997-01-01' "
+        "GROUP BY 1, 2 "
+        "ORDER BY month_bucket, shipmode"
+    ).fetchall()
+
+
+def _report_customer_year(con: duckdb.DuckDBPyConnection, use_aggregate: bool) -> list[tuple[Any, ...]]:
+    if use_aggregate:
+        return con.execute(
+            "SELECT revenue_year, c_custkey, revenue "
+            "FROM agg_customer_year_revenue "
+            "WHERE revenue_year = 1998 "
+            "ORDER BY revenue DESC, c_custkey "
+            "LIMIT 100"
+        ).fetchall()
+    return con.execute(
+        "SELECT year(o.o_orderdate) AS revenue_year, "
+        "       c.c_custkey, "
+        "       sum(l.l_extendedprice * (1 - l.l_discount)) AS revenue "
+        "FROM customer c "
+        "JOIN orders o ON o.o_custkey = c.c_custkey "
+        "JOIN lineitem l ON l.l_orderkey = o.o_orderkey "
+        "GROUP BY 1, 2 "
+        "HAVING year(o.o_orderdate) = 1998 "
+        "ORDER BY revenue DESC, c.c_custkey "
+        "LIMIT 100"
+    ).fetchall()
+
+
+def run_index_workload(con: duckdb.DuckDBPyConnection) -> float:
+    start_time = time.perf_counter()
+    for customer_key in CUSTOMER_KEYS:
+        con.execute(
+            "SELECT sum(o_totalprice) "
+            "FROM orders "
+            "WHERE o_custkey = ? AND o_orderdate >= DATE '1997-01-01'",
+            [customer_key],
+        ).fetchone()
+    for order_key in ORDER_KEYS:
+        con.execute(
+            "SELECT sum(l_extendedprice * (1 - l_discount)) "
+            "FROM lineitem "
+            "WHERE l_orderkey = ?",
+            [order_key],
+        ).fetchone()
+    for customer_key in CUSTOMER_KEYS[:120]:
+        con.execute(
+            "SELECT count(*) "
+            "FROM customer c "
+            "JOIN orders o ON c.c_custkey = o.o_custkey "
+            "WHERE c.c_custkey = ? AND o.o_orderpriority = '1-URGENT'",
+            [customer_key],
+        ).fetchone()
+    return time.perf_counter() - start_time
+
+
+def measure_index_design(selected_indexes: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_indexes if name not in INDEX_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown index names: {unknown}")
+    con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_indexes:
+        con.execute(INDEX_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+    run_index_workload(con)
+    workload_runtime = 0.0
+    for _ in range(int(INDEX_WORKLOAD_MANIFEST["repetitions"])):
+        workload_runtime += run_index_workload(con)
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "workload_runtime_s": float(workload_runtime),
+        "total_runtime_s": float(setup_runtime + workload_runtime),
+        "selected_index_count": len(selected_indexes),
+    }
+
+
+def measure_query_rewrite(sql: str) -> dict[str, Any]:
+    sql = str(sql).strip()
+    if not sql:
+        raise ValueError("query must not be empty")
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    baseline_rows = baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    candidate_rows = candidate_con.execute(sql).fetchall()
+    if not compare_results(candidate_rows, baseline_rows):
+        raise ValueError("candidate query result does not match the baseline result")
+
+    baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        baseline_con.execute(ORIGINAL_QUERY_SQL).fetchall()
+    baseline_runtime = time.perf_counter() - baseline_start
+
+    candidate_con.execute(sql).fetchall()
+    candidate_start = time.perf_counter()
+    for _ in range(int(QUERY_REWRITE_MANIFEST["repetitions"])):
+        candidate_rows = candidate_con.execute(sql).fetchall()
+    candidate_runtime = time.perf_counter() - candidate_start
+
+    return {
+        "baseline_runtime_s": float(baseline_runtime),
+        "candidate_runtime_s": float(candidate_runtime),
+        "row_count": len(candidate_rows),
+    }
+
+
+def _run_preaggregation_reports(con: duckdb.DuckDBPyConnection, selected: set[str]) -> tuple[float, tuple[list[tuple[Any, ...]], ...]]:
+    start_time = time.perf_counter()
+    result_a = _report_quarter_segment(con, "agg_quarter_segment_revenue" in selected)
+    result_b = _report_month_shipmode(con, "agg_month_shipmode_revenue" in selected)
+    result_c = _report_customer_year(con, "agg_customer_year_revenue" in selected)
+    runtime = time.perf_counter() - start_time
+    return runtime, (result_a, result_b, result_c)
+
+
+def measure_preaggregation_design(selected_preaggregations: list[str]) -> dict[str, float | int]:
+    unknown = [name for name in selected_preaggregations if name not in PREAGGREGATION_CANDIDATES]
+    if unknown:
+        raise ValueError(f"unknown pre-aggregation names: {unknown}")
+    if not selected_preaggregations:
+        con = build_connection()
+        _run_preaggregation_reports(con, set())
+        repeated_runtime = 0.0
+        for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+            extra_runtime, _ = _run_preaggregation_reports(con, set())
+            repeated_runtime += extra_runtime
+        return {
+            "setup_runtime_s": 0.0,
+            "candidate_workload_runtime_s": float(repeated_runtime),
+            "candidate_total_runtime_s": float(repeated_runtime),
+            "baseline_total_runtime_s": float(repeated_runtime),
+            "selected_preaggregation_count": 0,
+        }
+    baseline_con = build_connection()
+    candidate_con = build_connection()
+    start_setup = time.perf_counter()
+    for name in selected_preaggregations:
+        candidate_con.execute(PREAGGREGATION_CANDIDATES[name])
+    setup_runtime = time.perf_counter() - start_setup
+
+    _, baseline_results = _run_preaggregation_reports(baseline_con, set())
+    _, candidate_results = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+    if any(not compare_results(left, right) for left, right in zip(candidate_results, baseline_results)):
+        raise ValueError("candidate pre-aggregation selection changed the query results")
+
+    _run_preaggregation_reports(baseline_con, set())
+    _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+
+    repeated_baseline_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(baseline_con, set())
+        repeated_baseline_runtime += extra_runtime
+
+    repeated_candidate_runtime = 0.0
+    for _ in range(int(PREAGGREGATION_WORKLOAD_MANIFEST["repetitions"])):
+        extra_runtime, _ = _run_preaggregation_reports(candidate_con, set(selected_preaggregations))
+        repeated_candidate_runtime += extra_runtime
+
+    candidate_total_runtime = setup_runtime + repeated_candidate_runtime
+    baseline_total_runtime = repeated_baseline_runtime
+    return {
+        "setup_runtime_s": float(setup_runtime),
+        "candidate_workload_runtime_s": float(repeated_candidate_runtime),
+        "candidate_total_runtime_s": float(candidate_total_runtime),
+        "baseline_total_runtime_s": float(baseline_total_runtime),
+        "selected_preaggregation_count": len(selected_preaggregations),
+    }
+"""
+
+
+README_TEMPLATE = """\
+# __TITLE__
+
+__SHORT__
+
+## Provenance
+
+- Provenance class: `traceable local workload with DuckDB/TPC-H schema lineage`
+- Engine lineage: `DuckDB`
+- Data asset: benchmark-local deterministic SQL-generated tables
+- Full provenance note: see `references/source_manifest.md`
+
+## File Layout
+
+- `Task.md`: task contract and scoring rules.
+- `Task_zh-CN.md`: Chinese translation.
+- `README_zh-CN.md`: Chinese overview.
+- `scripts/init.py`: initial candidate file exposed to agents.
+- `baseline/solution.py`: reference implementation.
+- `runtime/problem.py`: task-local interface to the frozen workload.
+- `verification/evaluator.py`: evaluator entry.
+- `references/source_manifest.md`: provenance and authenticity notes.
+
+## Quick Run
+
+From repository root:
+
+```bash
+.venv/bin/python benchmarks/ComputerSystems/__SLUG__/verification/evaluator.py \\
+  benchmarks/ComputerSystems/__SLUG__/scripts/init.py \\
+  --metrics-out /tmp/__SLUG___metrics.json
+```
+"""
+
+
+README_ZH_TEMPLATE = """\
+# __TITLE__
+
+__SHORT__
+
+## 说明
+
+- 数据来源类型：`traceable local workload with DuckDB/TPC-H schema lineage`
+- 执行引擎：`DuckDB`
+- 数据资产：本 benchmark 内部固定的 deterministic SQL 生成表
+- 完整来源说明见 `references/source_manifest.md`
+"""
+
+
+TASK_INDEX = """\
+# __TITLE__ Task
+
+## Objective
+
+__SHORT__
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+Return a list of candidate index names from the whitelist in `workload_manifest["candidate_indexes"]`.
+A dict with key `indexes` is also accepted.
+
+## Evaluation
+
+The evaluator will:
+
+1. Build the frozen DuckDB workload.
+2. Create the selected indexes.
+3. Run the fixed lookup workload four times.
+4. Record the candidate total runtime and log the no-index baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_runtime_s`
+- `valid`: `1.0` only if every selected index name is valid and execution succeeds
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+"""
+
+
+TASK_REWRITE = """\
+# __TITLE__ Task
+
+## Objective
+
+__SHORT__
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def rewrite_query(sql, workload_manifest):
+    ...
+```
+
+Return a rewritten SQL string. A dict with key `sql` is also accepted.
+
+## Evaluation
+
+The evaluator will:
+
+1. Build the frozen DuckDB workload.
+2. Execute the original SQL to get the reference result.
+3. Execute your rewritten SQL and verify exact result equivalence.
+4. Time the candidate query over repeated runs and log the baseline rewrite runtime for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_runtime_s`
+- `valid`: `1.0` only if the rewritten query preserves results
+- `candidate_runtime_s`
+- `baseline_runtime_s`
+- `row_count`
+"""
+
+
+TASK_PREAGG = """\
+# __TITLE__ Task
+
+## Objective
+
+__SHORT__
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def select_preaggregations(workload_manifest):
+    ...
+```
+
+Return a list of candidate pre-aggregation names from the whitelist in `workload_manifest["candidate_preaggregations"]`.
+A dict with key `preaggregations` is also accepted.
+
+## Evaluation
+
+The evaluator will:
+
+1. Build the frozen DuckDB workload.
+2. Create the selected pre-aggregation tables.
+3. Run the fixed reporting workload and verify result equivalence.
+4. Measure candidate total runtime as setup cost plus repeated report execution, and log the baseline for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_runtime_s`
+- `valid`: `1.0` only if all selected names are valid and results stay unchanged
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+"""
+
+
+TASK_INDEX_ZH = """\
+# __TITLE__ 任务
+
+## 目标
+
+__SHORT__
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+返回值必须是 whitelist 中的索引名列表。也接受包含 `indexes` 字段的字典。
+
+## 评测方式
+
+评测器会：
+
+1. 构建固定的 DuckDB workload。
+2. 创建所选索引。
+3. 固定重复执行查询 workload 四次。
+4. 记录候选总运行时间，并把无索引 baseline 作为诊断信息一并输出。
+
+## 指标
+
+- `combined_score`：`-candidate_total_runtime_s`
+- `valid`：只有索引名合法且执行成功时才为 `1.0`
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+"""
+
+
+TASK_REWRITE_ZH = """\
+# __TITLE__ 任务
+
+## 目标
+
+__SHORT__
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def rewrite_query(sql, workload_manifest):
+    ...
+```
+
+返回值必须是重写后的 SQL 字符串。也接受包含 `sql` 字段的字典。
+
+## 评测方式
+
+评测器会：
+
+1. 构建固定的 DuckDB workload。
+2. 执行原始 SQL，得到参考结果。
+3. 执行候选重写 SQL，并严格检查结果等价。
+4. 多次计时候选查询，并把 baseline 重写的运行时间作为诊断信息输出。
+
+## 指标
+
+- `combined_score`：`-candidate_runtime_s`
+- `valid`：只有重写 SQL 保持结果不变时才为 `1.0`
+- `candidate_runtime_s`
+- `baseline_runtime_s`
+- `row_count`
+"""
+
+
+TASK_PREAGG_ZH = """\
+# __TITLE__ 任务
+
+## 目标
+
+__SHORT__
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def select_preaggregations(workload_manifest):
+    ...
+```
+
+返回值必须是 whitelist 中的预聚合表名列表。也接受包含 `preaggregations` 字段的字典。
+
+## 评测方式
+
+评测器会：
+
+1. 构建固定的 DuckDB workload。
+2. 创建所选预聚合表。
+3. 运行固定 reporting workload，并检查结果是否保持一致。
+4. 记录候选总运行时间，并把 baseline 总运行时间作为诊断信息输出。
+
+## 指标
+
+- `combined_score`：`-candidate_total_runtime_s`
+- `valid`：只有名称合法且结果保持不变时才为 `1.0`
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+"""
+
+
+INIT_INDEX = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import select_indexes as _baseline_select_indexes
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_indexes as _baseline_select_indexes
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+# EVOLVE-BLOCK-START
+def select_indexes(workload_manifest):
+    return _baseline_select_indexes(workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_selection(select_indexes(WORKLOAD_MANIFEST)))
+"""
+
+
+INIT_REWRITE = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import rewrite_query as _baseline_rewrite_query
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+except ModuleNotFoundError:
+    from baseline.solution import rewrite_query as _baseline_rewrite_query
+    from runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+
+
+# EVOLVE-BLOCK-START
+def rewrite_query(sql, workload_manifest):
+    return _baseline_rewrite_query(sql, workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_query(rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST)))
+"""
+
+
+INIT_PREAGG = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import select_preaggregations as _baseline_select_preaggregations
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_preaggregations as _baseline_select_preaggregations
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+# EVOLVE-BLOCK-START
+def select_preaggregations(workload_manifest):
+    return _baseline_select_preaggregations(workload_manifest)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    print(evaluate_selection(select_preaggregations(WORKLOAD_MANIFEST)))
+"""
+
+
+BASELINE_INDEX = """\
+from __future__ import annotations
+
+
+def select_indexes(workload_manifest):
+    return []
+"""
+
+
+BASELINE_REWRITE = """\
+from __future__ import annotations
+
+
+def rewrite_query(sql, workload_manifest):
+    return sql
+"""
+
+
+BASELINE_PREAGG = """\
+from __future__ import annotations
+
+
+def select_preaggregations(workload_manifest):
+    return []
+"""
+
+
+RUNTIME_INDEX = """\
+from __future__ import annotations
+
+from .duckdb_local_workload import INDEX_WORKLOAD_MANIFEST, measure_index_design, normalize_name_list
+
+
+WORKLOAD_MANIFEST = dict(INDEX_WORKLOAD_MANIFEST)
+
+
+def load_instance():
+    return dict(WORKLOAD_MANIFEST)
+
+
+def evaluate_selection(selection):
+    return measure_index_design(normalize_name_list(selection, "indexes"))
+"""
+
+
+RUNTIME_REWRITE = """\
+from __future__ import annotations
+
+from .duckdb_local_workload import ORIGINAL_QUERY_SQL, QUERY_REWRITE_MANIFEST, measure_query_rewrite
+
+
+WORKLOAD_MANIFEST = dict(QUERY_REWRITE_MANIFEST)
+
+
+def load_instance():
+    return {"sql": ORIGINAL_QUERY_SQL, "manifest": dict(WORKLOAD_MANIFEST)}
+
+
+def evaluate_query(value):
+    if isinstance(value, dict):
+        if "sql" not in value:
+            raise ValueError("missing sql")
+        value = value["sql"]
+    return measure_query_rewrite(str(value))
+"""
+
+
+RUNTIME_PREAGG = """\
+from __future__ import annotations
+
+from .duckdb_local_workload import PREAGGREGATION_WORKLOAD_MANIFEST, measure_preaggregation_design, normalize_name_list
+
+
+WORKLOAD_MANIFEST = dict(PREAGGREGATION_WORKLOAD_MANIFEST)
+
+
+def load_instance():
+    return dict(WORKLOAD_MANIFEST)
+
+
+def evaluate_selection(selection):
+    return measure_preaggregation_design(normalize_name_list(selection, "preaggregations"))
+"""
+
+
+EVALUATOR_INDEX = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import select_indexes as baseline_select_indexes
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_indexes as baseline_select_indexes
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_total_runtime_s": 0.0,
+        "baseline_total_runtime_s": 0.0,
+        "candidate_setup_runtime_s": 0.0,
+        "candidate_workload_runtime_s": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    select_indexes = namespace.get("select_indexes")
+    if not callable(select_indexes):
+        artifacts["error_message"] = "candidate must define select_indexes(workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_selection(baseline_select_indexes(WORKLOAD_MANIFEST))
+        candidate = evaluate_selection(select_indexes(WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_total = float(candidate["total_runtime_s"])
+    baseline_total = float(baseline["total_runtime_s"])
+    if not math.isfinite(candidate_total) or candidate_total <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_total_runtime_s"] = candidate_total
+    metrics["baseline_total_runtime_s"] = baseline_total
+    metrics["candidate_setup_runtime_s"] = float(candidate["setup_runtime_s"])
+    metrics["candidate_workload_runtime_s"] = float(candidate["workload_runtime_s"])
+    metrics["combined_score"] = -candidate_total
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+EVALUATOR_REWRITE = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import rewrite_query as baseline_rewrite_query
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+except ModuleNotFoundError:
+    from baseline.solution import rewrite_query as baseline_rewrite_query
+    from runtime.problem import ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST, evaluate_query
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_runtime_s": 0.0,
+        "baseline_runtime_s": 0.0,
+        "row_count": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    rewrite_query = namespace.get("rewrite_query")
+    if not callable(rewrite_query):
+        artifacts["error_message"] = "candidate must define rewrite_query(sql, workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_query(baseline_rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST))
+        candidate = evaluate_query(rewrite_query(ORIGINAL_QUERY_SQL, WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_runtime = float(candidate["candidate_runtime_s"])
+    baseline_runtime = float(baseline["candidate_runtime_s"])
+    if not math.isfinite(candidate_runtime) or candidate_runtime <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_runtime_s"] = candidate_runtime
+    metrics["baseline_runtime_s"] = baseline_runtime
+    metrics["row_count"] = float(candidate["row_count"])
+    metrics["combined_score"] = -candidate_runtime
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+EVALUATOR_PREAGG = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.ComputerSystems.__SLUG__.baseline.solution import select_preaggregations as baseline_select_preaggregations
+    from benchmarks.ComputerSystems.__SLUG__.runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+except ModuleNotFoundError:
+    from baseline.solution import select_preaggregations as baseline_select_preaggregations
+    from runtime.problem import WORKLOAD_MANIFEST, evaluate_selection
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_total_runtime_s": 0.0,
+        "baseline_total_runtime_s": 0.0,
+        "candidate_setup_runtime_s": 0.0,
+        "candidate_workload_runtime_s": 0.0,
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    select_preaggregations = namespace.get("select_preaggregations")
+    if not callable(select_preaggregations):
+        artifacts["error_message"] = "candidate must define select_preaggregations(workload_manifest)"
+        return metrics, artifacts
+    try:
+        baseline = evaluate_selection(baseline_select_preaggregations(WORKLOAD_MANIFEST))
+        candidate = evaluate_selection(select_preaggregations(WORKLOAD_MANIFEST))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+    candidate_total = float(candidate["candidate_total_runtime_s"])
+    baseline_total = float(candidate["baseline_total_runtime_s"])
+    if not math.isfinite(candidate_total) or candidate_total <= 0:
+        artifacts["error_message"] = "candidate runtime is invalid"
+        return metrics, artifacts
+    metrics["valid"] = 1.0
+    metrics["candidate_total_runtime_s"] = candidate_total
+    metrics["baseline_total_runtime_s"] = baseline_total
+    metrics["candidate_setup_runtime_s"] = float(candidate["setup_runtime_s"])
+    metrics["candidate_workload_runtime_s"] = float(candidate["candidate_workload_runtime_s"])
+    metrics["combined_score"] = -candidate_total
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+def render(template: str, **values: str) -> str:
+    out = template
+    for key, value in values.items():
+        out = out.replace(f"__{key}__", value)
+    return out
+
+
+def write(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(textwrap.dedent(content).rstrip() + "\n", encoding="utf-8")
+
+
+def frontier_eval_files() -> dict[str, str]:
+    return {
+        "frontier_eval/agent_files.txt": "Task.md\nTask_zh-CN.md\nREADME.md\nbaseline/solution.py\nruntime/problem.py\n",
+        "frontier_eval/candidate_destination.txt": "scripts/init.py\n",
+        "frontier_eval/constraints.txt": (
+            "Edit only `scripts/init.py`.\n"
+            "Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.\n"
+            "Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.\n"
+            "Keep outputs valid and finite.\n"
+        ),
+        "frontier_eval/eval_command.txt": "{python} verification/evaluator.py {candidate} --metrics-out metrics.json\n",
+        "frontier_eval/eval_cwd.txt": ".\n",
+        "frontier_eval/initial_program.txt": "scripts/init.py\n",
+        "frontier_eval/readonly_files.txt": (
+            "baseline/solution.py\n"
+            "runtime/problem.py\n"
+            "runtime/duckdb_local_workload.py\n"
+            "verification/evaluator.py\n"
+            "references/source_manifest.md\n"
+        ),
+    }
+
+
+def main() -> None:
+    repo_root = Path(__file__).resolve().parents[1]
+    domain_root = repo_root / "benchmarks" / "ComputerSystems"
+
+    for task in TASKS:
+        root = domain_root / task["slug"]
+        values = {
+            "TITLE": task["title"],
+            "SHORT": task["short"],
+            "SLUG": task["slug"],
+        }
+        write(root / "README.md", render(README_TEMPLATE, **values))
+        write(root / "README_zh-CN.md", render(README_ZH_TEMPLATE, **values))
+        write(root / "references" / "source_manifest.md", task["source_manifest"])
+        write(root / "verification" / "requirements.txt", "duckdb\n")
+
+        if task["kind"] == "index":
+            write(root / "Task.md", render(TASK_INDEX, **values))
+            write(root / "Task_zh-CN.md", render(TASK_INDEX_ZH, **values))
+            write(root / "scripts" / "init.py", render(INIT_INDEX, **values))
+            write(root / "baseline" / "solution.py", BASELINE_INDEX)
+            write(root / "runtime" / "problem.py", RUNTIME_INDEX)
+            write(root / "runtime" / "duckdb_local_workload.py", HELPER_TEXT)
+            write(root / "verification" / "evaluator.py", render(EVALUATOR_INDEX, **values))
+        elif task["kind"] == "rewrite":
+            write(root / "Task.md", render(TASK_REWRITE, **values))
+            write(root / "Task_zh-CN.md", render(TASK_REWRITE_ZH, **values))
+            write(root / "scripts" / "init.py", render(INIT_REWRITE, **values))
+            write(root / "baseline" / "solution.py", BASELINE_REWRITE)
+            write(root / "runtime" / "problem.py", RUNTIME_REWRITE)
+            write(root / "runtime" / "duckdb_local_workload.py", HELPER_TEXT)
+            write(root / "verification" / "evaluator.py", render(EVALUATOR_REWRITE, **values))
+        else:
+            write(root / "Task.md", render(TASK_PREAGG, **values))
+            write(root / "Task_zh-CN.md", render(TASK_PREAGG_ZH, **values))
+            write(root / "scripts" / "init.py", render(INIT_PREAGG, **values))
+            write(root / "baseline" / "solution.py", BASELINE_PREAGG)
+            write(root / "runtime" / "problem.py", RUNTIME_PREAGG)
+            write(root / "runtime" / "duckdb_local_workload.py", HELPER_TEXT)
+            write(root / "verification" / "evaluator.py", render(EVALUATOR_PREAGG, **values))
+
+        for relative, content in frontier_eval_files().items():
+            write(root / relative, content)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/bootstrap_first_inventory_benchmarks.py b/scripts/bootstrap_first_inventory_benchmarks.py
new file mode 100644
index 00000000..14ccdc7a
--- /dev/null
+++ b/scripts/bootstrap_first_inventory_benchmarks.py
@@ -0,0 +1,835 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import json
+import textwrap
+from pathlib import Path
+
+
+TASKS = [
+    {
+        "slug": "EOQWithMinimumOrderQuantity",
+        "title": "EOQ with Minimum Order Quantity",
+        "short": "Optimize annual cost for deterministic EOQ instances with a hard minimum order quantity.",
+        "kind": "eoq_moq",
+        "domain": "OperationsResearch",
+        "cases": [
+            {"fixed_cost": 8.0, "holding_cost_rate": 0.225, "demand_rate": 1300.0, "minimum_order_quantity": 80.0},
+            {"fixed_cost": 14.0, "holding_cost_rate": 0.18, "demand_rate": 1800.0, "minimum_order_quantity": 140.0},
+            {"fixed_cost": 11.0, "holding_cost_rate": 0.25, "demand_rate": 950.0, "minimum_order_quantity": 100.0},
+            {"fixed_cost": 6.0, "holding_cost_rate": 0.16, "demand_rate": 2200.0, "minimum_order_quantity": 120.0},
+        ],
+    },
+    {
+        "slug": "EOQWithAllUnitsDiscounts",
+        "title": "EOQ with All-Units Discounts",
+        "short": "Choose an order quantity under piecewise all-units discount pricing.",
+        "kind": "eoq_all_units",
+        "domain": "OperationsResearch",
+        "cases": [
+            {"fixed_cost": 8.0, "holding_cost_rate": 0.225, "demand_rate": 1300.0, "breakpoints": [0.0, 350.0, 700.0], "unit_costs": [0.50, 0.47, 0.44]},
+            {"fixed_cost": 10.0, "holding_cost_rate": 0.18, "demand_rate": 2200.0, "breakpoints": [0.0, 300.0, 900.0], "unit_costs": [0.82, 0.79, 0.73]},
+            {"fixed_cost": 12.0, "holding_cost_rate": 0.20, "demand_rate": 1700.0, "breakpoints": [0.0, 500.0, 1000.0], "unit_costs": [1.10, 1.03, 0.98]},
+            {"fixed_cost": 6.0, "holding_cost_rate": 0.16, "demand_rate": 2400.0, "breakpoints": [0.0, 250.0, 600.0], "unit_costs": [0.42, 0.39, 0.36]},
+        ],
+    },
+    {
+        "slug": "EOQWithIncrementalDiscounts",
+        "title": "EOQ with Incremental Discounts",
+        "short": "Choose an order quantity under incremental quantity discounts.",
+        "kind": "eoq_incremental",
+        "domain": "OperationsResearch",
+        "cases": [
+            {"fixed_cost": 150.0, "holding_cost_rate": 0.25, "demand_rate": 2400.0, "breakpoints": [0.0, 300.0, 600.0], "unit_costs": [100.0, 90.0, 80.0]},
+            {"fixed_cost": 60.0, "holding_cost_rate": 0.18, "demand_rate": 3000.0, "breakpoints": [0.0, 200.0, 400.0], "unit_costs": [15.0, 14.0, 12.5]},
+            {"fixed_cost": 90.0, "holding_cost_rate": 0.22, "demand_rate": 1600.0, "breakpoints": [0.0, 250.0, 550.0], "unit_costs": [24.0, 22.5, 21.0]},
+            {"fixed_cost": 45.0, "holding_cost_rate": 0.15, "demand_rate": 4200.0, "breakpoints": [0.0, 500.0, 1200.0], "unit_costs": [9.0, 8.7, 8.2]},
+        ],
+    },
+    {
+        "slug": "PoissonRQServiceLevel",
+        "title": "Poisson (r,Q) with Service-Level Constraint",
+        "short": "Select reorder point and lot size for Poisson-demand (r,Q) instances with a hard cycle-service-level target.",
+        "kind": "rq_poisson",
+        "domain": "OperationsResearch",
+        "cases": [
+            {"holding_cost": 0.18, "stockout_cost": 0.70, "fixed_cost": 4.0, "demand_mean": 1300.0, "lead_time": 0.05, "target_csl": 0.95},
+            {"holding_cost": 0.25, "stockout_cost": 0.95, "fixed_cost": 6.0, "demand_mean": 900.0, "lead_time": 0.10, "target_csl": 0.95},
+            {"holding_cost": 0.14, "stockout_cost": 0.80, "fixed_cost": 5.0, "demand_mean": 1500.0, "lead_time": 0.04, "target_csl": 0.97},
+            {"holding_cost": 0.22, "stockout_cost": 1.10, "fixed_cost": 7.0, "demand_mean": 700.0, "lead_time": 0.12, "target_csl": 0.95},
+        ],
+    },
+    {
+        "slug": "NormalRQServiceLevel95",
+        "title": "Normal (r,Q) with 95% Service-Level Constraint",
+        "short": "Select reorder point and lot size for Normal-demand (r,Q) instances with a hard cycle-service-level target.",
+        "kind": "rq_normal",
+        "domain": "OperationsResearch",
+        "cases": [
+            {"holding_cost": 0.18, "stockout_cost": 0.70, "fixed_cost": 4.0, "demand_mean": 1300.0, "demand_sd": 120.0, "lead_time": 0.05, "target_csl": 0.95},
+            {"holding_cost": 0.20, "stockout_cost": 0.85, "fixed_cost": 5.5, "demand_mean": 950.0, "demand_sd": 90.0, "lead_time": 0.08, "target_csl": 0.95},
+            {"holding_cost": 0.16, "stockout_cost": 0.92, "fixed_cost": 6.0, "demand_mean": 1500.0, "demand_sd": 170.0, "lead_time": 0.04, "target_csl": 0.97},
+            {"holding_cost": 0.24, "stockout_cost": 1.25, "fixed_cost": 7.0, "demand_mean": 720.0, "demand_sd": 75.0, "lead_time": 0.12, "target_csl": 0.95},
+        ],
+    },
+]
+
+
+GENERIC_README = """\
+# {title}
+
+{short}
+
+## Provenance
+
+- {provenance_summary}
+- Data asset: benchmark-local frozen parameter tables defined in `runtime/problem.py`.
+- Full provenance note: see `references/source_manifest.md`.
+
+## File Layout
+
+- `Task.md`: task contract and scoring rules.
+- `Task_zh-CN.md`: Chinese translation of the contract.
+- `scripts/init.py`: initial candidate file exposed to agents.
+- `baseline/solution.py`: reference implementation.
+- `runtime/problem.py`: frozen cases, baseline solver, and scoring helpers.
+- `verification/evaluator.py`: evaluator entry.
+- `verification/requirements.txt`: minimal dependencies for this benchmark.
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/OperationsResearch/{slug}/verification/evaluator.py \
+  benchmarks/OperationsResearch/{slug}/scripts/init.py \
+  --metrics-out /tmp/{slug}_metrics.json
+```
+
+Run through `frontier_eval` with:
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/{slug} \
+  algorithm.iterations=0
+```
+"""
+
+
+GENERIC_TASK = """\
+# {title} Task
+
+## Objective
+
+{short}
+
+{task_source_en}
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+The return value must be:
+
+- For EOQ tasks: a dict with `order_quantity`, or a raw numeric quantity.
+- For `(r,Q)` tasks: a dict with `reorder_point` and `order_quantity`, or a 2-tuple `(r, Q)`.
+
+## Evaluation
+
+The evaluator will:
+
+1. Load the frozen case set from `runtime/problem.py`.
+2. Run the reference baseline for each case.
+3. Run your `solve(instance)` implementation for each case.
+4. Convert the returned quantity or `(r, Q)` pair into a cost and feasibility result.
+5. Compute the average candidate cost and expose it directly as the optimization score.
+
+## Metrics
+
+- `combined_score`: `-avg_cost`
+- `valid`: `1.0` only if every case is feasible and every output is finite
+- `avg_cost`: average candidate cost
+- `avg_cost_ratio`: average `baseline_cost / candidate_cost` for diagnostics only
+
+## Failure Cases
+
+The submission is marked invalid and receives a very low score if:
+
+- `solve()` is missing
+- the returned output cannot be parsed
+- any case violates feasibility constraints
+- any metric becomes non-finite
+"""
+
+
+GENERIC_TASK_ZH = """\
+# {title} 任务
+
+## 目标
+
+{short}
+
+{task_source_zh}
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回值要求：
+
+- EOQ 类任务：返回 `order_quantity` 字段的字典，或者直接返回数值型订货批量。
+- `(r,Q)` 类任务：返回包含 `reorder_point` 和 `order_quantity` 的字典，或者直接返回二元组 `(r, Q)`。
+
+## 评测方式
+
+评测器会：
+
+1. 读取 `runtime/problem.py` 中的固定样例。
+2. 运行 baseline。
+3. 运行选手的 `solve(instance)`。
+4. 计算成本和可行性。
+5. 计算平均候选成本，并将其直接暴露为优化分数。
+
+## 指标
+
+- `combined_score`：`-avg_cost`
+- `valid`：所有 case 都可行且数值有限时为 `1.0`
+- `avg_cost`：平均候选成本
+- `avg_cost_ratio`：仅用于诊断的平均 `baseline_cost / candidate_cost`
+"""
+
+
+GENERIC_INIT = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.baseline.solution import solve as _baseline_solve
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.{slug}.runtime.problem import SAMPLE_INSTANCE
+    except ModuleNotFoundError:
+        from runtime.problem import SAMPLE_INSTANCE
+    print(solve(SAMPLE_INSTANCE))
+"""
+
+
+GENERIC_BASELINE = """\
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            parent_s = str(parent)
+            if parent_s not in sys.path:
+                sys.path.insert(0, parent_s)
+            return
+    benchmark_root = here.parents[1]
+    benchmark_root_s = str(benchmark_root)
+    if benchmark_root_s not in sys.path:
+        sys.path.insert(0, benchmark_root_s)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.runtime.problem import solve_baseline as solve
+except ModuleNotFoundError:
+    from runtime.problem import solve_baseline as solve
+"""
+
+
+GENERIC_EVALUATOR = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    repo_root = _repo_root()
+    benchmark_root = _benchmark_root()
+    for p in (repo_root, benchmark_root):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.runtime.problem import CASES, evaluate_solution
+    from benchmarks.OperationsResearch.{slug}.baseline.solution import solve as baseline_solve
+except ModuleNotFoundError:
+    from runtime.problem import CASES, evaluate_solution
+    from baseline.solution import solve as baseline_solve
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {{
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "avg_cost": 0.0,
+        "avg_cost_ratio": 0.0,
+        "num_cases": 0.0,
+    }}
+    artifacts: dict[str, str] = {{}}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    solve = namespace.get("solve")
+    if not callable(solve):
+        artifacts["error_message"] = "candidate file must define solve(instance)"
+        return metrics, artifacts
+
+    total_cost = 0.0
+    total_ratio = 0.0
+    for idx, case in enumerate(CASES):
+        baseline_solution = baseline_solve(case)
+        baseline_eval = evaluate_solution(case, baseline_solution)
+        if not baseline_eval["valid"]:
+            artifacts["error_message"] = f"internal baseline invalid on case {{idx}}"
+            return metrics, artifacts
+
+        try:
+            candidate_solution = solve(case)
+            candidate_eval = evaluate_solution(case, candidate_solution)
+        except Exception:
+            artifacts["error_message"] = f"candidate exception on case {{idx}}\\n{{traceback.format_exc()}}"
+            return metrics, artifacts
+
+        if not candidate_eval["valid"]:
+            artifacts["error_message"] = f"candidate infeasible on case {{idx}}"
+            return metrics, artifacts
+
+        ratio = baseline_eval["cost"] / candidate_eval["cost"]
+        total_cost += candidate_eval["cost"]
+        total_ratio += ratio
+
+    n = float(len(CASES))
+    metrics["valid"] = 1.0
+    metrics["num_cases"] = n
+    metrics["avg_cost"] = total_cost / n
+    metrics["avg_cost_ratio"] = total_ratio / n
+    metrics["combined_score"] = -metrics["avg_cost"]
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+
+    metrics, artifacts = evaluate(args.program)
+    metrics_path = Path(args.metrics_out)
+    metrics_path.write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+GENERIC_REQUIREMENTS = """\
+stockpyl @ git+https://github.com/LarrySnyder/stockpyl.git
+numpy
+scipy
+"""
+
+
+GENERIC_CONSTRAINTS = """\
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, or `verification/`.
+Return a finite and feasible solution for every frozen case.
+"""
+
+
+def render_problem(task: dict) -> str:
+    cases_json = json.dumps(task["cases"], indent=4)
+    kind = task["kind"]
+    header = """\
+from __future__ import annotations
+
+import math
+from typing import Any
+
+from scipy.stats import norm, poisson
+from stockpyl.eoq import (
+    economic_order_quantity,
+    economic_order_quantity_with_all_units_discounts,
+    economic_order_quantity_with_incremental_discounts,
+)
+from stockpyl.rq import (
+    r_q_cost,
+    r_q_cost_poisson,
+    r_q_eil_approximation,
+    r_q_eoqss_approximation,
+    r_q_loss_function_approximation,
+    r_q_poisson_exact,
+)
+
+CASES = {cases_json}
+SAMPLE_INSTANCE = CASES[0]
+
+
+def _to_float(value: Any) -> float:
+    value = float(value)
+    if not math.isfinite(value):
+        raise ValueError("non-finite numeric value")
+    return value
+
+
+def _extract_order_quantity(solution: Any) -> float:
+    if isinstance(solution, dict):
+        if "order_quantity" not in solution:
+            raise ValueError("missing order_quantity")
+        return _to_float(solution["order_quantity"])
+    return _to_float(solution)
+
+
+def _extract_rq(solution: Any) -> tuple[int, int]:
+    if isinstance(solution, dict):
+        if "reorder_point" not in solution or "order_quantity" not in solution:
+            raise ValueError("missing reorder_point/order_quantity")
+        r = int(round(_to_float(solution["reorder_point"])))
+        q = int(round(_to_float(solution["order_quantity"])))
+        return r, q
+    if isinstance(solution, (tuple, list)) and len(solution) == 2:
+        r = int(round(_to_float(solution[0])))
+        q = int(round(_to_float(solution[1])))
+        return r, q
+    raise ValueError("solution must be a dict or length-2 tuple/list")
+"""
+
+    if kind == "eoq_moq":
+        body = """\
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q_star, _ = economic_order_quantity(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+    )
+    q = max(q_star, instance["minimum_order_quantity"])
+    return {"order_quantity": float(q)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q < instance["minimum_order_quantity"] or q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    _, cost = economic_order_quantity(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        order_quantity=q,
+    )
+    return {"valid": True, "cost": float(cost), "order_quantity": float(q)}
+"""
+    elif kind == "eoq_all_units":
+        body = """\
+
+def _region(instance: dict[str, float], q: float) -> int:
+    region = 0
+    for idx, bp in enumerate(instance["breakpoints"]):
+        if q >= bp:
+            region = idx
+    return region
+
+
+def _cost(instance: dict[str, float], q: float) -> float:
+    region = _region(instance, q)
+    unit_cost = instance["unit_costs"][region]
+    return (
+        unit_cost * instance["demand_rate"]
+        + instance["fixed_cost"] * instance["demand_rate"] / q
+        + instance["holding_cost_rate"] * unit_cost * q / 2.0
+    )
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q, region, cost = economic_order_quantity_with_all_units_discounts(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        list(instance["breakpoints"]),
+        list(instance["unit_costs"]),
+    )
+    return {"order_quantity": float(q), "region": int(region), "cost": float(cost)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    return {"valid": True, "cost": float(_cost(instance, q)), "order_quantity": float(q)}
+"""
+    elif kind == "eoq_incremental":
+        body = """\
+
+def _c_bar(instance: dict[str, float], region: int) -> float:
+    if region == 0:
+        return 0.0
+    breakpoints = instance["breakpoints"]
+    unit_costs = instance["unit_costs"]
+    return sum(unit_costs[i] * (breakpoints[i + 1] - breakpoints[i]) for i in range(region)) - unit_costs[region] * breakpoints[region]
+
+
+def _region(instance: dict[str, float], q: float) -> int:
+    region = 0
+    for idx, bp in enumerate(instance["breakpoints"]):
+        if q >= bp:
+            region = idx
+    return region
+
+
+def _cost(instance: dict[str, float], q: float) -> float:
+    region = _region(instance, q)
+    unit_cost = instance["unit_costs"][region]
+    c_bar = _c_bar(instance, region)
+    return (
+        unit_cost * instance["demand_rate"]
+        + instance["holding_cost_rate"] * c_bar / 2.0
+        + (instance["fixed_cost"] + c_bar) * instance["demand_rate"] / q
+        + instance["holding_cost_rate"] * unit_cost * q / 2.0
+    )
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    q, region, cost = economic_order_quantity_with_incremental_discounts(
+        instance["fixed_cost"],
+        instance["holding_cost_rate"],
+        instance["demand_rate"],
+        list(instance["breakpoints"]),
+        list(instance["unit_costs"]),
+    )
+    return {"order_quantity": float(q), "region": int(region), "cost": float(cost)}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        q = _extract_order_quantity(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    return {"valid": True, "cost": float(_cost(instance, q)), "order_quantity": float(q)}
+"""
+    elif kind == "rq_poisson":
+        body = """\
+
+def _service_level(instance: dict[str, float], r: int) -> float:
+    mean_lt = instance["demand_mean"] * instance["lead_time"]
+    return float(poisson.cdf(r, mean_lt))
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    r, q, _ = r_q_poisson_exact(
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["lead_time"],
+    )
+    r = int(round(r))
+    q = max(1, int(round(q)))
+    while _service_level(instance, r) < instance["target_csl"]:
+        r += 1
+    return {"reorder_point": r, "order_quantity": q}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        r, q = _extract_rq(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    csl = _service_level(instance, r)
+    if csl < instance["target_csl"]:
+        return {"valid": False, "cost": float("inf")}
+    cost = r_q_cost_poisson(
+        r,
+        q,
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["lead_time"],
+    )
+    return {"valid": True, "cost": float(cost), "reorder_point": int(r), "order_quantity": int(q), "service_level": float(csl)}
+"""
+    elif kind == "rq_normal":
+        body = """\
+
+def _service_level(instance: dict[str, float], r: int) -> float:
+    mean_lt = instance["demand_mean"] * instance["lead_time"]
+    sd_lt = instance["demand_sd"] * math.sqrt(instance["lead_time"])
+    z = (r - mean_lt) / sd_lt
+    return float(norm.cdf(z))
+
+
+def _candidate_pairs(instance: dict[str, float]) -> list[tuple[int, int]]:
+    pairs: list[tuple[int, int]] = []
+    for fn in (r_q_eil_approximation, r_q_eoqss_approximation, r_q_loss_function_approximation):
+        result = fn(
+            instance["holding_cost"],
+            instance["stockout_cost"],
+            instance["fixed_cost"],
+            instance["demand_mean"],
+            instance["demand_sd"],
+            instance["lead_time"],
+        )
+        if len(result) >= 2:
+            r = int(round(float(result[0])))
+            q = max(1, int(round(float(result[1]))))
+            pairs.append((r, q))
+    return pairs
+
+
+def solve_baseline(instance: dict[str, float]) -> dict[str, float]:
+    best = None
+    for r, q in _candidate_pairs(instance):
+        while _service_level(instance, r) < instance["target_csl"]:
+            r += 1
+        cost = r_q_cost(
+            r,
+            q,
+            instance["holding_cost"],
+            instance["stockout_cost"],
+            instance["fixed_cost"],
+            instance["demand_mean"],
+            instance["demand_sd"],
+            instance["lead_time"],
+        )
+        candidate = (float(cost), int(r), int(q))
+        if best is None or candidate < best:
+            best = candidate
+    if best is None:
+        raise RuntimeError("no feasible baseline candidate")
+    _, r, q = best
+    return {"reorder_point": r, "order_quantity": q}
+
+
+def evaluate_solution(instance: dict[str, float], solution: Any) -> dict[str, float | bool]:
+    try:
+        r, q = _extract_rq(solution)
+    except Exception:
+        return {"valid": False, "cost": float("inf")}
+    if q <= 0:
+        return {"valid": False, "cost": float("inf")}
+    csl = _service_level(instance, r)
+    if csl < instance["target_csl"]:
+        return {"valid": False, "cost": float("inf")}
+    cost = r_q_cost(
+        r,
+        q,
+        instance["holding_cost"],
+        instance["stockout_cost"],
+        instance["fixed_cost"],
+        instance["demand_mean"],
+        instance["demand_sd"],
+        instance["lead_time"],
+    )
+    return {"valid": True, "cost": float(cost), "reorder_point": int(r), "order_quantity": int(q), "service_level": float(csl)}
+"""
+    else:
+        raise ValueError(f"unknown task kind: {kind}")
+
+    return textwrap.dedent(header.format(cases_json=cases_json) + body)
+
+
+def write_text(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(textwrap.dedent(content).rstrip() + "\n", encoding="utf-8")
+
+
+def provenance_summary(kind: str) -> str:
+    if kind == "eoq_moq":
+        return "Upstream lineage: `Stockpyl` EOQ routines and the classic deterministic EOQ model family."
+    if kind == "eoq_all_units":
+        return "Upstream lineage: `Stockpyl` EOQ with all-units discount routines and the standard all-units discount EOQ model family."
+    if kind == "eoq_incremental":
+        return "Upstream lineage: `Stockpyl` EOQ with incremental discount routines and the standard incremental discount EOQ model family."
+    if kind == "rq_poisson":
+        return "Upstream lineage: `Stockpyl` single-echelon `(r,Q)` routines for Poisson demand."
+    if kind == "rq_normal":
+        return "Upstream lineage: `Stockpyl` single-echelon `(r,Q)` routines for Normal demand."
+    raise ValueError(f"unknown task kind: {kind}")
+
+
+def task_source_en(kind: str) -> str:
+    if kind == "eoq_moq":
+        return "Canonical source lineage comes from `Stockpyl` EOQ routines and standard deterministic EOQ formulas. The benchmark uses frozen benchmark-local cases defined in `runtime/problem.py`."
+    if kind == "eoq_all_units":
+        return "Canonical source lineage comes from `Stockpyl` all-units discount EOQ routines. The benchmark uses frozen benchmark-local cases defined in `runtime/problem.py`."
+    if kind == "eoq_incremental":
+        return "Canonical source lineage comes from `Stockpyl` incremental discount EOQ routines. The benchmark uses frozen benchmark-local cases defined in `runtime/problem.py`."
+    if kind == "rq_poisson":
+        return "Canonical source lineage comes from `Stockpyl` Poisson-demand `(r,Q)` routines. The benchmark uses frozen benchmark-local cases defined in `runtime/problem.py`."
+    if kind == "rq_normal":
+        return "Canonical source lineage comes from `Stockpyl` Normal-demand `(r,Q)` routines. The benchmark uses frozen benchmark-local cases defined in `runtime/problem.py`."
+    raise ValueError(f"unknown task kind: {kind}")
+
+
+def task_source_zh(kind: str) -> str:
+    if kind == "eoq_moq":
+        return "规范来源来自 `Stockpyl` 的 EOQ 公式实现与经典确定性 EOQ 模型。固定评测样例定义在 `runtime/problem.py` 中，属于 benchmark 内部冻结参数表。"
+    if kind == "eoq_all_units":
+        return "规范来源来自 `Stockpyl` 的 all-units discount EOQ 实现。固定评测样例定义在 `runtime/problem.py` 中，属于 benchmark 内部冻结参数表。"
+    if kind == "eoq_incremental":
+        return "规范来源来自 `Stockpyl` 的 incremental discount EOQ 实现。固定评测样例定义在 `runtime/problem.py` 中，属于 benchmark 内部冻结参数表。"
+    if kind == "rq_poisson":
+        return "规范来源来自 `Stockpyl` 的 Poisson-demand `(r,Q)` 实现。固定评测样例定义在 `runtime/problem.py` 中，属于 benchmark 内部冻结参数表。"
+    if kind == "rq_normal":
+        return "规范来源来自 `Stockpyl` 的 Normal-demand `(r,Q)` 实现。固定评测样例定义在 `runtime/problem.py` 中，属于 benchmark 内部冻结参数表。"
+    raise ValueError(f"unknown task kind: {kind}")
+
+
+def source_manifest_text(kind: str) -> str:
+    if kind == "eoq_moq":
+        upstream = "- `stockpyl.eoq.economic_order_quantity`\n- deterministic EOQ formulas as documented in standard inventory theory references used by Stockpyl"
+    elif kind == "eoq_all_units":
+        upstream = "- `stockpyl.eoq.economic_order_quantity_with_all_units_discounts`\n- all-units discount EOQ formulas as documented in standard inventory theory references used by Stockpyl"
+    elif kind == "eoq_incremental":
+        upstream = "- `stockpyl.eoq.economic_order_quantity_with_incremental_discounts`\n- incremental discount EOQ formulas as documented in standard inventory theory references used by Stockpyl"
+    elif kind == "rq_poisson":
+        upstream = "- `stockpyl.rq.r_q_poisson_exact`\n- single-echelon `(r,Q)` formulas for Poisson demand used in Stockpyl"
+    elif kind == "rq_normal":
+        upstream = "- `stockpyl.rq.r_q_eil_approximation`\n- `stockpyl.rq.r_q_eoqss_approximation`\n- `stockpyl.rq.r_q_loss_function_approximation`\n- single-echelon `(r,Q)` formulas for Normal demand used in Stockpyl"
+    else:
+        raise ValueError(f"unknown task kind: {kind}")
+    return textwrap.dedent(
+        f"""\
+        # Source Manifest
+
+        - Upstream library: `Stockpyl`
+        - Upstream lineage:
+          {upstream}
+        - Data provenance: this benchmark does not use an external dataset. It uses benchmark-local frozen numeric instances defined in `runtime/problem.py`.
+        - Transformation path: no preprocessing pipeline; the parameter tables are authored directly in the benchmark runtime.
+        - License lineage: Stockpyl is released under the MIT License.
+        """
+    )
+
+
+def bootstrap_task(repo_root: Path, task: dict) -> None:
+    task_dir = repo_root / "benchmarks" / task["domain"] / task["slug"]
+    task_values = dict(task)
+    task_values["provenance_summary"] = provenance_summary(task["kind"])
+    task_values["task_source_en"] = task_source_en(task["kind"])
+    task_values["task_source_zh"] = task_source_zh(task["kind"])
+
+    write_text(task_dir / "README.md", GENERIC_README.format(**task_values))
+    write_text(task_dir / "Task.md", GENERIC_TASK.format(**task_values))
+    write_text(task_dir / "Task_zh-CN.md", GENERIC_TASK_ZH.format(**task_values))
+    write_text(task_dir / "references" / "source_manifest.md", source_manifest_text(task["kind"]))
+    write_text(task_dir / "scripts" / "init.py", GENERIC_INIT.format(**task_values))
+    write_text(task_dir / "baseline" / "solution.py", GENERIC_BASELINE.format(**task_values))
+    write_text(task_dir / "runtime" / "problem.py", render_problem(task))
+    write_text(task_dir / "verification" / "evaluator.py", GENERIC_EVALUATOR.format(**task_values))
+    write_text(task_dir / "verification" / "requirements.txt", GENERIC_REQUIREMENTS)
+
+    write_text(task_dir / "frontier_eval" / "initial_program.txt", "scripts/init.py\n")
+    write_text(task_dir / "frontier_eval" / "candidate_destination.txt", "scripts/init.py\n")
+    write_text(task_dir / "frontier_eval" / "eval_command.txt", "{python} verification/evaluator.py {candidate} --metrics-out metrics.json\n")
+    write_text(task_dir / "frontier_eval" / "eval_cwd.txt", ".\n")
+    write_text(task_dir / "frontier_eval" / "agent_files.txt", "Task.md\nTask_zh-CN.md\nREADME.md\nbaseline/solution.py\nruntime/problem.py\n")
+    write_text(task_dir / "frontier_eval" / "readonly_files.txt", "baseline/solution.py\nruntime/problem.py\nverification/evaluator.py\nreferences/source_manifest.md\n")
+    write_text(task_dir / "frontier_eval" / "constraints.txt", GENERIC_CONSTRAINTS)
+
+
+def main() -> None:
+    repo_root = Path(__file__).resolve().parents[1]
+    for task in TASKS:
+        bootstrap_task(repo_root, task)
+        print(f"bootstrapped {task['domain']}/{task['slug']}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/bootstrap_jssp_benchmarks.py b/scripts/bootstrap_jssp_benchmarks.py
new file mode 100644
index 00000000..adc164c4
--- /dev/null
+++ b/scripts/bootstrap_jssp_benchmarks.py
@@ -0,0 +1,847 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import json
+import textwrap
+from pathlib import Path
+
+
+TASKS = [
+    {
+        "slug": "FT10DispatchingRuleOptimization",
+        "title": "FT10 Dispatching Rule Optimization",
+        "short": "Optimize a greedy dispatching rule on the canonical FT10 Fisher-Thompson 10x10 job shop instance.",
+        "instance_name": "ft10",
+        "optimum": 930,
+        "task_kind": "dispatch",
+    },
+    {
+        "slug": "LA16DispatchingRuleOptimization",
+        "title": "LA16 Dispatching Rule Optimization",
+        "short": "Optimize a greedy dispatching rule on the canonical LA16 Lawrence 10x10 job shop instance.",
+        "instance_name": "la16",
+        "optimum": 945,
+        "task_kind": "dispatch",
+    },
+    {
+        "slug": "FT10NeighborhoodMoveSelection",
+        "title": "FT10 Neighborhood Move Selection",
+        "short": "Guide an adjacent-swap local search on the canonical FT10 Fisher-Thompson 10x10 job shop instance.",
+        "instance_name": "ft10",
+        "optimum": 930,
+        "task_kind": "move",
+    },
+    {
+        "slug": "LA16NeighborhoodMoveSelection",
+        "title": "LA16 Neighborhood Move Selection",
+        "short": "Guide an adjacent-swap local search on the canonical LA16 Lawrence 10x10 job shop instance.",
+        "instance_name": "la16",
+        "optimum": 945,
+        "task_kind": "move",
+    },
+]
+
+
+SOURCE_MANIFEST = """\
+# Source Manifest
+
+- Canonical instance: `{instance_name}`
+- Upstream package: `job_shop_lib`
+- Upstream file: `job_shop_lib/benchmarking/benchmark_instances.json`
+- Canonical optimum recorded in upstream metadata: `{optimum}`
+- Original academic provenance:
+  - `ft10`: Fisher and Thompson, *Industrial Scheduling*, 1963.
+  - `la16`: Lawrence benchmark set, 1984.
+
+This benchmark vendors only the specific frozen instance JSON required for evaluation.
+"""
+
+
+README_TEMPLATE = """\
+# {title}
+
+{short}
+
+## Provenance
+
+The frozen instance is copied from the canonical benchmark set distributed in `job_shop_lib/benchmarking/benchmark_instances.json`.
+The instance id is `{instance_name}`, and the published optimum used for scoring reference is `{optimum}`.
+
+## File Layout
+
+- `Task.md`: task contract and scoring rules.
+- `Task_zh-CN.md`: Chinese version.
+- `scripts/init.py`: initial candidate file exposed to agents.
+- `baseline/solution.py`: baseline heuristic.
+- `runtime/problem.py`: frozen instance, scheduling runtime, baseline, and evaluator helpers.
+- `runtime/instance.json`: vendored canonical benchmark instance.
+- `verification/evaluator.py`: evaluator entry.
+- `references/source_manifest.md`: instance provenance.
+
+## Quick Run
+
+```bash
+python benchmarks/OperationsResearch/{slug}/verification/evaluator.py \
+  benchmarks/OperationsResearch/{slug}/scripts/init.py \
+  --metrics-out /tmp/{slug}_metrics.json
+```
+
+Run with `frontier_eval`:
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=OperationsResearch/{slug} \
+  task.runtime.use_conda_run=false \
+  task.runtime.python_path=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \
+  algorithm.iterations=0
+```
+"""
+
+
+TASK_TEMPLATE = """\
+# {title} Task
+
+## Objective
+
+{short}
+
+The benchmark uses one frozen canonical instance: `{instance_name}`.
+The known optimum for this instance is `{optimum}`.
+
+## Submission Contract
+
+Submit one Python file.
+
+For dispatch-rule tasks, define:
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+For neighborhood-move tasks, define:
+
+```python
+def score_move(move, state):
+    ...
+```
+
+You may optionally define:
+
+```python
+MAX_ITERATIONS = 50
+```
+
+## Evaluation
+
+Dispatch-rule tasks:
+
+1. Start from an empty schedule.
+2. Repeatedly gather the next unscheduled operation from each job.
+3. Among operations with the earliest feasible start time, choose the one with highest `score_operation`.
+4. Build a complete feasible schedule and compute makespan.
+
+Neighborhood-move tasks:
+
+1. Start from the baseline SPT dispatch schedule.
+2. Repeatedly generate adjacent machine-order swap moves.
+3. Rank moves by `score_move`.
+4. Apply the first improving move in ranked order.
+5. Stop when no improving move exists or `MAX_ITERATIONS` is reached.
+
+## Metrics
+
+- `combined_score`: `-candidate_makespan`
+- `valid`: `1.0` only when a complete feasible schedule is produced
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## Failure Cases
+
+The submission is marked invalid and receives a very low score if:
+
+- the required scoring function is missing
+- the return value is non-finite
+- the induced schedule is infeasible
+- the candidate crashes during evaluation
+"""
+
+
+TASK_ZH_TEMPLATE = """\
+# {title} 任务
+
+## 目标
+
+{short}
+
+评测使用单个固定的 canonical 实例：`{instance_name}`。
+该实例的已知最优 makespan 为 `{optimum}`。
+
+## 提交接口
+
+提交一个 Python 文件。
+
+如果是 dispatch-rule 任务，需要定义：
+
+```python
+def score_operation(operation, state):
+    ...
+```
+
+如果是邻域搜索任务，需要定义：
+
+```python
+def score_move(move, state):
+    ...
+```
+
+你也可以额外定义：
+
+```python
+MAX_ITERATIONS = 50
+```
+
+## 评测方式
+
+Dispatch-rule 任务：
+
+1. 从空排程开始。
+2. 每次收集每个 job 的下一道未排工序。
+3. 在“最早可开工时间最小”的工序集合中，选择 `score_operation` 最高者。
+4. 构造完整可行排程并计算 makespan。
+
+邻域搜索任务：
+
+1. 从 baseline 的 SPT dispatch 排程开始。
+2. 生成机器序列上的相邻交换 move。
+3. 用 `score_move` 对 move 排序。
+4. 按排序顺序找到第一个真正改进 makespan 的 move 并应用。
+5. 当没有改进 move 或达到 `MAX_ITERATIONS` 时停止。
+
+## 指标
+
+- `combined_score`：`-candidate_makespan`
+- `valid`：只有生成完整可行排程时才为 `1.0`
+- `candidate_makespan`
+- `baseline_makespan`
+- `relative_gap_to_optimum`
+
+## 失败情况
+
+如果出现以下情况，提交会被判为无效，并得到一个很低的分数：
+
+- 缺少要求的评分函数
+- 返回值不是有限标量
+- 诱导出的排程不可行
+- 候选程序在评测时崩溃
+"""
+
+
+INIT_DISPATCH = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.baseline.solution import score_operation as _baseline_score_operation
+except ModuleNotFoundError:
+    from baseline.solution import score_operation as _baseline_score_operation
+
+
+# EVOLVE-BLOCK-START
+def score_operation(operation, state):
+    return _baseline_score_operation(operation, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.{slug}.runtime.problem import load_instance, schedule_with_dispatch
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, schedule_with_dispatch
+    instance = load_instance()
+    result = schedule_with_dispatch(instance, score_operation)
+    print(result["makespan"])
+"""
+
+
+INIT_MOVE = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+except ModuleNotFoundError:
+    from baseline.solution import MAX_ITERATIONS as _baseline_MAX_ITERATIONS, score_move as _baseline_score_move
+
+
+# EVOLVE-BLOCK-START
+MAX_ITERATIONS = _baseline_MAX_ITERATIONS
+
+
+def score_move(move, state):
+    return _baseline_score_move(move, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.OperationsResearch.{slug}.runtime.problem import load_instance, run_local_search
+    except ModuleNotFoundError:
+        from runtime.problem import load_instance, run_local_search
+    instance = load_instance()
+    result = run_local_search(instance, score_move, MAX_ITERATIONS)
+    print(result["makespan"])
+"""
+
+
+BASELINE_DISPATCH = """\
+from __future__ import annotations
+
+
+def score_operation(operation, state):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+"""
+
+
+BASELINE_MOVE = """\
+from __future__ import annotations
+
+MAX_ITERATIONS = 50
+
+
+def score_move(move, state):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+"""
+
+
+EVALUATOR_TEMPLATE = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.{slug}.runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+except ModuleNotFoundError:
+    from runtime.problem import (
+        KNOWN_OPTIMUM,
+        baseline_dispatch_score,
+        baseline_move_score,
+        load_instance,
+        relative_gap,
+        run_local_search,
+        schedule_with_dispatch,
+    )
+
+
+TASK_KIND = "{task_kind}"
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {{
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_makespan": 0.0,
+        "baseline_makespan": 0.0,
+        "relative_gap_to_optimum": 0.0,
+    }}
+    artifacts: dict[str, str] = {{}}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    instance = load_instance()
+
+    try:
+        if TASK_KIND == "dispatch":
+            score_fn = namespace.get("score_operation")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_operation(operation, state)")
+            baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+            candidate = schedule_with_dispatch(instance, score_fn)
+        else:
+            score_fn = namespace.get("score_move")
+            if not callable(score_fn):
+                raise RuntimeError("candidate must define score_move(move, state)")
+            max_iterations = int(namespace.get("MAX_ITERATIONS", 50))
+            baseline = run_local_search(instance, baseline_move_score, max_iterations=50)
+            candidate = run_local_search(instance, score_fn, max_iterations=max_iterations)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    if not baseline["valid"]:
+        artifacts["error_message"] = "internal baseline produced an invalid schedule"
+        return metrics, artifacts
+    if not candidate["valid"]:
+        artifacts["error_message"] = "candidate produced an invalid schedule"
+        return metrics, artifacts
+
+    makespan = float(candidate["makespan"])
+    baseline_makespan = float(baseline["makespan"])
+    if not math.isfinite(makespan) or makespan <= 0:
+        artifacts["error_message"] = "candidate makespan is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_makespan"] = makespan
+    metrics["baseline_makespan"] = baseline_makespan
+    metrics["relative_gap_to_optimum"] = relative_gap(makespan, KNOWN_OPTIMUM)
+    metrics["combined_score"] = -makespan
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+RUNTIME_TEMPLATE = """\
+from __future__ import annotations
+
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+
+
+INSTANCE_PATH = Path(__file__).resolve().with_name("instance.json")
+KNOWN_OPTIMUM = {optimum}
+
+
+def load_instance() -> dict[str, Any]:
+    return json.loads(INSTANCE_PATH.read_text(encoding="utf-8"))
+
+
+def relative_gap(value: float, optimum: float) -> float:
+    return float((value - optimum) / optimum)
+
+
+def baseline_dispatch_score(operation: dict[str, Any], state: dict[str, Any]):
+    return (
+        -float(operation["duration"]),
+        -float(operation["remaining_job_work"]),
+        -float(operation["job_id"]),
+    )
+
+
+def baseline_move_score(move: dict[str, Any], state: dict[str, Any]):
+    return (
+        float(move["delta_duration"]),
+        -float(move["machine_position"]),
+        -float(move["machine_id"]),
+    )
+
+
+def _build_operation_tables(instance: dict[str, Any]) -> tuple[list[list[int]], list[list[int]], dict[tuple[int, int], tuple[int, int]]]:
+    durations = instance["duration_matrix"]
+    machines = instance["machines_matrix"]
+    op_map: dict[tuple[int, int], tuple[int, int]] = {{}}
+    for j, row in enumerate(machines):
+        for k, machine in enumerate(row):
+            op_map[(j, k)] = (machine, durations[j][k])
+    return durations, machines, op_map
+
+
+def schedule_with_dispatch(instance: dict[str, Any], score_operation) -> dict[str, Any]:
+    durations, machines, _ = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    job_next = [0] * num_jobs
+    job_ready = [0] * num_jobs
+    machine_ready = [0] * num_machines
+    scheduled_ops: list[dict[str, Any]] = []
+
+    total_ops = num_jobs * num_machines
+    while len(scheduled_ops) < total_ops:
+        candidates: list[dict[str, Any]] = []
+        for job_id in range(num_jobs):
+            op_index = job_next[job_id]
+            if op_index >= num_machines:
+                continue
+            machine_id = machines[job_id][op_index]
+            duration = durations[job_id][op_index]
+            earliest_start = max(job_ready[job_id], machine_ready[machine_id])
+            remaining_job_work = sum(durations[job_id][op_index:])
+            remaining_job_ops = num_machines - op_index
+            candidates.append(
+                {{
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "earliest_start": earliest_start,
+                    "remaining_job_work": remaining_job_work,
+                    "remaining_job_ops": remaining_job_ops,
+                }}
+            )
+        min_start = min(op["earliest_start"] for op in candidates)
+        ready = [op for op in candidates if op["earliest_start"] == min_start]
+        state = {{
+            "step": len(scheduled_ops),
+            "job_ready_times": tuple(job_ready),
+            "machine_ready_times": tuple(machine_ready),
+            "current_makespan": max(max(job_ready), max(machine_ready)),
+        }}
+        scored: list[tuple[Any, dict[str, Any]]] = []
+        for op in ready:
+            score = score_operation(op, state)
+            scored.append((score, op))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                -item[1]["duration"],
+                -item[1]["remaining_job_work"],
+                -item[1]["job_id"],
+            ),
+            reverse=True,
+        )
+        chosen = scored[0][1]
+        start = chosen["earliest_start"]
+        end = start + chosen["duration"]
+        scheduled = dict(chosen)
+        scheduled["start"] = start
+        scheduled["end"] = end
+        scheduled_ops.append(scheduled)
+        job_ready[chosen["job_id"]] = end
+        machine_ready[chosen["machine_id"]] = end
+        job_next[chosen["job_id"]] += 1
+
+    return {{
+        "valid": True,
+        "schedule": scheduled_ops,
+        "makespan": max(op["end"] for op in scheduled_ops),
+        "machine_sequences": machine_sequences_from_schedule(instance, scheduled_ops),
+    }}
+
+
+def machine_sequences_from_schedule(instance: dict[str, Any], schedule: list[dict[str, Any]]) -> list[list[tuple[int, int]]]:
+    num_machines = len(instance["machines_matrix"][0])
+    sequences: list[list[tuple[int, int, int, int]]] = [[] for _ in range(num_machines)]
+    for op in schedule:
+        sequences[op["machine_id"]].append((op["start"], op["job_id"], op["op_index"], op["end"]))
+    out: list[list[tuple[int, int]]] = []
+    for machine_ops in sequences:
+        machine_ops.sort()
+        out.append([(job_id, op_index) for _, job_id, op_index, _ in machine_ops])
+    return out
+
+
+def build_schedule_from_machine_sequences(instance: dict[str, Any], machine_sequences: list[list[tuple[int, int]]]) -> dict[str, Any]:
+    durations, machines, op_map = _build_operation_tables(instance)
+    num_jobs = len(durations)
+    num_machines = len(durations[0])
+    machine_pred: dict[tuple[int, int], tuple[int, int] | None] = {{}}
+    for seq in machine_sequences:
+        for idx, op in enumerate(seq):
+            machine_pred[op] = seq[idx - 1] if idx > 0 else None
+
+    scheduled: dict[tuple[int, int], dict[str, Any]] = {{}}
+    total_ops = num_jobs * num_machines
+    while len(scheduled) < total_ops:
+        progress = False
+        for job_id in range(num_jobs):
+            for op_index in range(num_machines):
+                op = (job_id, op_index)
+                if op in scheduled:
+                    continue
+                job_prev = (job_id, op_index - 1) if op_index > 0 else None
+                mach_prev = machine_pred.get(op)
+                if job_prev is not None and job_prev not in scheduled:
+                    continue
+                if mach_prev is not None and mach_prev not in scheduled:
+                    continue
+                machine_id, duration = op_map[op]
+                start = 0
+                if job_prev is not None:
+                    start = max(start, scheduled[job_prev]["end"])
+                if mach_prev is not None:
+                    start = max(start, scheduled[mach_prev]["end"])
+                scheduled[op] = {{
+                    "job_id": job_id,
+                    "op_index": op_index,
+                    "machine_id": machine_id,
+                    "duration": duration,
+                    "start": start,
+                    "end": start + duration,
+                }}
+                progress = True
+        if not progress:
+            return {{"valid": False, "schedule": [], "makespan": float("inf"), "machine_sequences": machine_sequences}}
+
+    schedule = list(scheduled.values())
+    schedule.sort(key=lambda item: (item["start"], item["machine_id"], item["job_id"], item["op_index"]))
+    return {{
+        "valid": True,
+        "schedule": schedule,
+        "makespan": max(op["end"] for op in schedule),
+        "machine_sequences": machine_sequences,
+    }}
+
+
+def initial_machine_sequences(instance: dict[str, Any]) -> list[list[tuple[int, int]]]:
+    baseline = schedule_with_dispatch(instance, baseline_dispatch_score)
+    return baseline["machine_sequences"]
+
+
+def generate_adjacent_moves(instance: dict[str, Any], current: dict[str, Any]) -> list[dict[str, Any]]:
+    durations, machines, _ = _build_operation_tables(instance)
+    schedule_by_op = {{
+        (op["job_id"], op["op_index"]): op
+        for op in current["schedule"]
+    }}
+    moves: list[dict[str, Any]] = []
+    for machine_id, seq in enumerate(current["machine_sequences"]):
+        for pos in range(len(seq) - 1):
+            a = seq[pos]
+            b = seq[pos + 1]
+            a_sched = schedule_by_op[a]
+            b_sched = schedule_by_op[b]
+            moves.append(
+                {{
+                    "machine_id": machine_id,
+                    "machine_position": pos,
+                    "op_a": {{
+                        "job_id": a[0],
+                        "op_index": a[1],
+                        "duration": durations[a[0]][a[1]],
+                        "start": a_sched["start"],
+                        "end": a_sched["end"],
+                    }},
+                    "op_b": {{
+                        "job_id": b[0],
+                        "op_index": b[1],
+                        "duration": durations[b[0]][b[1]],
+                        "start": b_sched["start"],
+                        "end": b_sched["end"],
+                    }},
+                    "delta_duration": durations[a[0]][a[1]] - durations[b[0]][b[1]],
+                    "current_makespan": current["makespan"],
+                }}
+            )
+    return moves
+
+
+def apply_adjacent_swap(machine_sequences: list[list[tuple[int, int]]], machine_id: int, position: int) -> list[list[tuple[int, int]]]:
+    new_sequences = copy.deepcopy(machine_sequences)
+    new_sequences[machine_id][position], new_sequences[machine_id][position + 1] = (
+        new_sequences[machine_id][position + 1],
+        new_sequences[machine_id][position],
+    )
+    return new_sequences
+
+
+def run_local_search(instance: dict[str, Any], score_move, max_iterations: int = 50) -> dict[str, Any]:
+    current = schedule_with_dispatch(instance, baseline_dispatch_score)
+    if not current["valid"]:
+        return current
+
+    for iteration in range(max_iterations):
+        moves = generate_adjacent_moves(instance, current)
+        state = {{
+            "iteration": iteration,
+            "current_makespan": current["makespan"],
+        }}
+        scored = []
+        for move in moves:
+            score = score_move(move, state)
+            scored.append((score, move))
+        scored.sort(
+            key=lambda item: (
+                item[0],
+                item[1]["delta_duration"],
+                -item[1]["machine_position"],
+            ),
+            reverse=True,
+        )
+        improved = False
+        for _, move in scored:
+            new_sequences = apply_adjacent_swap(current["machine_sequences"], move["machine_id"], move["machine_position"])
+            candidate = build_schedule_from_machine_sequences(instance, new_sequences)
+            if candidate["valid"] and candidate["makespan"] < current["makespan"]:
+                current = candidate
+                improved = True
+                break
+        if not improved:
+            break
+
+    return current
+"""
+
+
+CONSTRAINTS = """\
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files in `runtime/`, `verification/`, `references/`, or `baseline/`.
+For dispatch tasks, define `score_operation(operation, state)`.
+For neighborhood tasks, define `score_move(move, state)`.
+Return only finite scalar scores.
+"""
+
+
+def write_text(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(textwrap.dedent(content).rstrip() + "\n", encoding="utf-8")
+
+
+def source_json_path() -> Path:
+    candidates = [
+        Path("/tmp/benchmark_instances.json"),
+        Path("/tmp/job_shop_lib/job_shop_lib/benchmarking/benchmark_instances.json"),
+    ]
+    for candidate in candidates:
+        if candidate.is_file():
+            return candidate
+    raise FileNotFoundError("Could not locate benchmark_instances.json from job_shop_lib")
+
+
+def load_instances() -> dict[str, dict]:
+    return json.loads(source_json_path().read_text(encoding="utf-8"))
+
+
+def bootstrap_task(repo_root: Path, task: dict, instance_payload: dict) -> None:
+    task_dir = repo_root / "benchmarks" / "OperationsResearch" / task["slug"]
+    write_text(task_dir / "README.md", README_TEMPLATE.format(**task))
+    write_text(task_dir / "Task.md", TASK_TEMPLATE.format(**task))
+    write_text(task_dir / "Task_zh-CN.md", TASK_ZH_TEMPLATE.format(**task))
+    write_text(task_dir / "references" / "source_manifest.md", SOURCE_MANIFEST.format(**task))
+    write_text(task_dir / "runtime" / "problem.py", RUNTIME_TEMPLATE.format(**task))
+    write_text(task_dir / "runtime" / "instance.json", json.dumps(instance_payload, indent=2))
+    if task["task_kind"] == "dispatch":
+        write_text(task_dir / "scripts" / "init.py", INIT_DISPATCH.format(**task))
+        write_text(task_dir / "baseline" / "solution.py", BASELINE_DISPATCH)
+    else:
+        write_text(task_dir / "scripts" / "init.py", INIT_MOVE.format(**task))
+        write_text(task_dir / "baseline" / "solution.py", BASELINE_MOVE)
+    write_text(task_dir / "verification" / "evaluator.py", EVALUATOR_TEMPLATE.format(**task))
+    write_text(task_dir / "verification" / "requirements.txt", "ortools\n")
+
+    write_text(task_dir / "frontier_eval" / "initial_program.txt", "scripts/init.py\n")
+    write_text(task_dir / "frontier_eval" / "candidate_destination.txt", "scripts/init.py\n")
+    write_text(task_dir / "frontier_eval" / "eval_command.txt", "{python} verification/evaluator.py {candidate} --metrics-out metrics.json\n")
+    write_text(task_dir / "frontier_eval" / "eval_cwd.txt", ".\n")
+    write_text(task_dir / "frontier_eval" / "agent_files.txt", "Task.md\nTask_zh-CN.md\nREADME.md\nbaseline/solution.py\nruntime/problem.py\nreferences/source_manifest.md\n")
+    write_text(task_dir / "frontier_eval" / "readonly_files.txt", "runtime/problem.py\nruntime/instance.json\nverification/evaluator.py\nbaseline/solution.py\nreferences/source_manifest.md\n")
+    write_text(task_dir / "frontier_eval" / "constraints.txt", CONSTRAINTS)
+
+
+def main() -> None:
+    repo_root = Path(__file__).resolve().parents[1]
+    instances = load_instances()
+    for task in TASKS:
+        instance_name = task["instance_name"]
+        bootstrap_task(repo_root, task, instances[instance_name])
+        print(f"bootstrapped OperationsResearch/{task['slug']}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/bootstrap_maritime_routing_benchmarks.py b/scripts/bootstrap_maritime_routing_benchmarks.py
new file mode 100644
index 00000000..3e9dffb5
--- /dev/null
+++ b/scripts/bootstrap_maritime_routing_benchmarks.py
@@ -0,0 +1,1002 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import textwrap
+from pathlib import Path
+
+
+TASKS = [
+    {
+        "slug": "FuelMinimizingShipWeatherRouting",
+        "title": "Fuel-Minimizing Ship Weather Routing",
+        "short": "Route a ship across a frozen coastal grid while minimizing total fuel consumption under synthetic wind and current fields.",
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream lineage:
+  - 52North `WeatherRoutingTool` repository and README
+  - Fuel-aware ship routing under weather-dependent operating conditions
+- License lineage: upstream code lineage is MIT.
+- Data provenance: this benchmark does not redistribute upstream weather rasters. Instead it uses a benchmark-local synthetic coastal grid and deterministic wind/current fields generated directly in `runtime/problem.py`.
+- Authenticity note: the optimization shape follows official weather-routing tool lineage, while the environment data is a frozen synthetic stand-in chosen for offline reproducibility.
+- Transformation path: no external preprocessing pipeline exists. The map, land mask, current field, and wind field are generated from fixed formulas and constants inside the benchmark runtime.
+""",
+        "readme_zh": "在固定海岸网格上，为船舶规划一条从起点到终点的航线，在合成风场与流场下最小化总燃油消耗。",
+        "task_md": """\
+# __TITLE__ Task
+
+## Objective
+
+__SHORT__
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a list of grid cells or a dict with key `path`.
+
+The path must:
+
+1. Start at `instance["start"]`
+2. End at `instance["goal"]`
+3. Move only between adjacent grid cells
+4. Stay on navigable water cells
+
+## Fixed World Model
+
+- The map, start/goal pair, synthetic wind field, and synthetic current field are fixed in `runtime/problem.py`.
+- The upstream lineage is weather-aware ship routing from `WeatherRoutingTool`, but the actual grid data here is benchmark-local synthetic data with a fixed generator.
+
+## Evaluation
+
+The evaluator will:
+
+1. Load the frozen routing instance
+2. Validate your path mechanically
+3. Compute total fuel use and travel time along the path
+4. Log the shortest-hop baseline and Dijkstra reference metrics for context while scoring candidate fuel directly
+
+## Metrics
+
+- `combined_score`: `-candidate_fuel`
+- `valid`: `1.0` only if the route is feasible
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+""",
+        "task_zh": """\
+# __TITLE__ 任务
+
+## 目标
+
+__SHORT__
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回值可以是路径列表，也可以是包含 `path` 键的字典。
+
+路径必须：
+
+1. 从 `instance["start"]` 出发
+2. 以 `instance["goal"]` 结束
+3. 只能在相邻网格之间移动
+4. 不能进入陆地或不可航行水域
+
+## 固定世界模型
+
+- 地图、起终点、合成风场与合成流场都固定在 `runtime/problem.py` 中。
+- 上游算法谱系来自 `WeatherRoutingTool`，但这里的环境数据是 benchmark 内部固定生成的 synthetic asset。
+
+## 评测方式
+
+评测器会：
+
+1. 载入固定实例
+2. 机械检查路径可行性
+3. 计算该路径的总燃油和总航时
+4. 记录最短步数 baseline 与 Dijkstra 参考值作为诊断信息，同时直接以候选燃油目标打分
+
+## 指标
+
+- `combined_score`：`-candidate_fuel`
+- `valid`：只有路径可行时才为 `1.0`
+- `candidate_fuel`
+- `baseline_fuel`
+- `reference_fuel`
+- `candidate_time_h`
+- `baseline_time_h`
+""",
+        "runtime": """\
+from __future__ import annotations
+
+from collections import deque
+import math
+from typing import Any
+
+
+WIDTH = 20
+HEIGHT = 10
+START = (1, 4)
+GOAL = (18, 4)
+
+
+def is_land(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 8 <= x <= 12 and 2 <= y <= 6
+
+
+def is_water(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= x < WIDTH and 0 <= y < HEIGHT and not is_land(cell)
+
+
+def _render_grid() -> tuple[str, ...]:
+    rows = []
+    for y in range(HEIGHT):
+        chars = []
+        for x in range(WIDTH):
+            cell = (x, y)
+            if cell == START:
+                chars.append("S")
+            elif cell == GOAL:
+                chars.append("G")
+            elif is_land(cell):
+                chars.append("#")
+            else:
+                chars.append(".")
+        rows.append("".join(chars))
+    return tuple(rows)
+
+
+GRID = _render_grid()
+
+
+def current_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    east = 0.04 * math.sin(0.45 * x)
+    north = 0.02 * math.cos(0.35 * x)
+    if y <= 2:
+        return (-0.32 + east, north)
+    if y >= 6:
+        return (0.26 + east, -north)
+    return (0.04 + east, 0.01 * math.sin(0.25 * x))
+
+
+def wind_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    side = 0.04 * math.sin(0.3 * x)
+    if y <= 2:
+        return (-0.60, side)
+    if y >= 6:
+        return (0.22, -side)
+    return (-0.08, 0.02 * math.cos(0.2 * x))
+
+
+def _field_to_rows(field_fn) -> tuple[tuple[tuple[float, float], ...], ...]:
+    rows = []
+    for y in range(HEIGHT):
+        row = []
+        for x in range(WIDTH):
+            row.append(tuple(round(v, 4) for v in field_fn((x, y))))
+        rows.append(tuple(row))
+    return tuple(rows)
+
+
+CURRENT_FIELD = _field_to_rows(current_at)
+WIND_FIELD = _field_to_rows(wind_at)
+
+
+def load_instance() -> dict[str, Any]:
+    return {
+        "grid": GRID,
+        "start": START,
+        "goal": GOAL,
+        "current_field": CURRENT_FIELD,
+        "wind_field": WIND_FIELD,
+        "objective": "fuel",
+    }
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if "path" not in value:
+            raise ValueError("missing path")
+        value = value["path"]
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError("path is empty")
+    return path
+
+
+def neighbors(cell: tuple[int, int], directions=((0, -1), (1, 0), (0, 1), (-1, 0))) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for dx, dy in directions:
+        nxt = (x + dx, y + dy)
+        if is_water(nxt):
+            result.append(nxt)
+    return result
+
+
+def validate_path(value: Any) -> list[tuple[int, int]]:
+    path = extract_path(value)
+    if path[0] != START:
+        raise ValueError("path must start at START")
+    if path[-1] != GOAL:
+        raise ValueError("path must end at GOAL")
+    for cell in path:
+        if not is_water(cell):
+            raise ValueError("path enters land or leaves the map")
+    for prev, curr in zip(path, path[1:]):
+        dx = abs(curr[0] - prev[0])
+        dy = abs(curr[1] - prev[1])
+        if dx + dy != 1:
+            raise ValueError("path contains a non-adjacent move")
+    return path
+
+
+def _leg_metrics(prev: tuple[int, int], curr: tuple[int, int]) -> tuple[float, float]:
+    dx = curr[0] - prev[0]
+    dy = curr[1] - prev[1]
+    current_u, current_v = current_at(prev)
+    wind_u, wind_v = wind_at(prev)
+    current_along = current_u * dx + current_v * dy
+    wind_along = wind_u * dx + wind_v * dy
+    headwind = max(0.0, -wind_along)
+    crosswind = abs(-dy * wind_u + dx * wind_v)
+    speed = max(0.35, 1.0 + 0.65 * current_along - 0.45 * headwind)
+    leg_time_h = 1.0 / speed
+    fuel_rate = 1.05 + 0.55 * headwind + 0.20 * crosswind + 0.25 * max(0.0, -current_along)
+    leg_fuel = leg_time_h * fuel_rate
+    return leg_fuel, leg_time_h
+
+
+def route_metrics(value: Any) -> dict[str, float]:
+    path = validate_path(value)
+    total_fuel = 0.0
+    total_time_h = 0.0
+    for prev, curr in zip(path, path[1:]):
+        leg_fuel, leg_time_h = _leg_metrics(prev, curr)
+        total_fuel += leg_fuel
+        total_time_h += leg_time_h
+    return {
+        "fuel": float(total_fuel),
+        "time_h": float(total_time_h),
+        "hops": float(len(path) - 1),
+    }
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def baseline_path() -> list[tuple[int, int]]:
+    queue = deque([START])
+    parent = {START: None}
+    while queue:
+        current = queue.popleft()
+        if current == GOAL:
+            return _retrace(parent, current)
+        for nxt in neighbors(current):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    raise RuntimeError("baseline path not found")
+
+
+BASELINE_PATH = baseline_path()
+BASELINE_FUEL = route_metrics(BASELINE_PATH)["fuel"]
+BASELINE_TIME_H = route_metrics(BASELINE_PATH)["time_h"]
+REFERENCE_FUEL = 21.839377308460037
+REFERENCE_TIME_H = 20.501439186435814
+""",
+    },
+    {
+        "slug": "DynamicCurrentMinimumTimeRouting",
+        "title": "Dynamic-Current Minimum-Time Routing",
+        "short": "Route a ship across a frozen coastal grid while minimizing travel time under deterministic current and depth fields.",
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream lineage:
+  - TU Delft CITG `HALEM` repository and README
+  - Time-optimal ship routing with dynamic currents, variable velocity, and minimum-water-depth constraints
+- License lineage: upstream code lineage is MIT.
+- Data provenance: this benchmark does not vendor upstream hydrographic files. It uses a benchmark-local synthetic coastal grid, synthetic current field, and synthetic depth raster generated directly in `runtime/problem.py`.
+- Authenticity note: the routing objective and minimum-depth constraint follow official HALEM lineage, while the environmental data is a frozen synthetic stand-in for offline reproducibility.
+- Transformation path: no external preprocessing pipeline exists. All fields are generated from fixed formulas and constants inside the benchmark runtime.
+""",
+        "readme_zh": "在固定海岸网格上，为船舶规划一条最短航时路线。环境包含 deterministic 流场与 depth raster，并强制最小水深约束。",
+        "task_md": """\
+# __TITLE__ Task
+
+## Objective
+
+__SHORT__
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def solve(instance):
+    ...
+```
+
+Return either a list of grid cells or a dict with key `path`.
+
+The path must:
+
+1. Start at `instance["start"]`
+2. End at `instance["goal"]`
+3. Move only between adjacent grid cells
+4. Stay on water cells with depth at least `instance["min_depth"]`
+
+## Fixed World Model
+
+- The map, synthetic current field, and synthetic depth raster are fixed in `runtime/problem.py`.
+- The upstream lineage is dynamic-current minimum-time routing from `HALEM`, but the actual environmental data here is benchmark-local synthetic data with a fixed generator.
+
+## Evaluation
+
+The evaluator will:
+
+1. Load the frozen routing instance
+2. Validate your path against the land mask and minimum-depth rule
+3. Compute travel time along the route
+4. Log the shortest-hop baseline and Dijkstra reference metrics for context while scoring candidate travel time directly
+
+## Metrics
+
+- `combined_score`: `-candidate_time_h`
+- `valid`: `1.0` only if the route is feasible
+- `candidate_time_h`
+- `baseline_time_h`
+- `reference_time_h`
+- `candidate_hops`
+- `baseline_hops`
+""",
+        "task_zh": """\
+# __TITLE__ 任务
+
+## 目标
+
+__SHORT__
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def solve(instance):
+    ...
+```
+
+返回值可以是路径列表，也可以是包含 `path` 键的字典。
+
+路径必须：
+
+1. 从 `instance["start"]` 出发
+2. 以 `instance["goal"]` 结束
+3. 只能在相邻网格之间移动
+4. 只能经过水深不小于 `instance["min_depth"]` 的可航行网格
+
+## 固定世界模型
+
+- 地图、合成流场和合成 depth raster 都固定在 `runtime/problem.py` 中。
+- 上游算法谱系来自 `HALEM`，但这里的环境数据是 benchmark 内部固定生成的 synthetic asset。
+
+## 评测方式
+
+评测器会：
+
+1. 载入固定实例
+2. 按陆地与最小水深约束检查路径可行性
+3. 计算总航时
+4. 记录最短步数 baseline 与 Dijkstra 参考值作为诊断信息，同时直接以候选航时目标打分
+
+## 指标
+
+- `combined_score`：`-candidate_time_h`
+- `valid`：只有路径可行时才为 `1.0`
+- `candidate_time_h`
+- `baseline_time_h`
+- `reference_time_h`
+- `candidate_hops`
+- `baseline_hops`
+""",
+        "runtime": """\
+from __future__ import annotations
+
+from collections import deque
+import math
+from typing import Any
+
+
+WIDTH = 20
+HEIGHT = 10
+START = (1, 4)
+GOAL = (18, 4)
+MIN_DEPTH = 2.5
+
+
+def is_land(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 8 <= x <= 12 and 2 <= y <= 6
+
+
+def depth_at(cell: tuple[int, int]) -> float:
+    x, y = cell
+    if is_land(cell):
+        return 0.0
+    depth = 3.8
+    if y == 1 and 7 <= x <= 13:
+        depth = 2.7
+    if y == 6 and 2 <= x <= 5:
+        depth = 2.2
+    if y == 7 and 3 <= x <= 6:
+        depth = 2.4
+    return depth
+
+
+def is_navigable(cell: tuple[int, int]) -> bool:
+    x, y = cell
+    return 0 <= x < WIDTH and 0 <= y < HEIGHT and not is_land(cell) and depth_at(cell) >= MIN_DEPTH
+
+
+def _render_grid() -> tuple[str, ...]:
+    rows = []
+    for y in range(HEIGHT):
+        chars = []
+        for x in range(WIDTH):
+            cell = (x, y)
+            if cell == START:
+                chars.append("S")
+            elif cell == GOAL:
+                chars.append("G")
+            elif is_land(cell):
+                chars.append("#")
+            elif depth_at(cell) < MIN_DEPTH:
+                chars.append("~")
+            else:
+                chars.append(".")
+        rows.append("".join(chars))
+    return tuple(rows)
+
+
+GRID = _render_grid()
+
+
+def current_at(cell: tuple[int, int]) -> tuple[float, float]:
+    x, y = cell
+    ripple = 0.03 * math.sin(0.4 * x)
+    if y <= 2:
+        return (-0.36 + ripple, 0.01 * math.cos(0.3 * x))
+    if y >= 7:
+        return (0.44 + ripple, -0.01 * math.cos(0.3 * x))
+    return (-0.05 + ripple, 0.02 * math.sin(0.2 * x))
+
+
+def _field_to_rows(field_fn) -> tuple[tuple[Any, ...], ...]:
+    rows = []
+    for y in range(HEIGHT):
+        row = []
+        for x in range(WIDTH):
+            value = field_fn((x, y))
+            if isinstance(value, tuple):
+                row.append(tuple(round(v, 4) for v in value))
+            else:
+                row.append(round(float(value), 4))
+        rows.append(tuple(row))
+    return tuple(rows)
+
+
+CURRENT_FIELD = _field_to_rows(current_at)
+DEPTH_FIELD = _field_to_rows(depth_at)
+
+
+def load_instance() -> dict[str, Any]:
+    return {
+        "grid": GRID,
+        "start": START,
+        "goal": GOAL,
+        "current_field": CURRENT_FIELD,
+        "depth_field": DEPTH_FIELD,
+        "min_depth": MIN_DEPTH,
+        "objective": "time",
+    }
+
+
+def _to_cell(value: Any) -> tuple[int, int]:
+    if not isinstance(value, (tuple, list)) or len(value) != 2:
+        raise ValueError("cell must be a length-2 sequence")
+    return int(round(float(value[0]))), int(round(float(value[1])))
+
+
+def extract_path(value: Any) -> list[tuple[int, int]]:
+    if isinstance(value, dict):
+        if "path" not in value:
+            raise ValueError("missing path")
+        value = value["path"]
+    path = [_to_cell(cell) for cell in value]
+    if not path:
+        raise ValueError("path is empty")
+    return path
+
+
+def neighbors(cell: tuple[int, int], directions=((0, -1), (1, 0), (0, 1), (-1, 0))) -> list[tuple[int, int]]:
+    x, y = cell
+    result = []
+    for dx, dy in directions:
+        nxt = (x + dx, y + dy)
+        if is_navigable(nxt):
+            result.append(nxt)
+    return result
+
+
+def validate_path(value: Any) -> list[tuple[int, int]]:
+    path = extract_path(value)
+    if path[0] != START:
+        raise ValueError("path must start at START")
+    if path[-1] != GOAL:
+        raise ValueError("path must end at GOAL")
+    for cell in path:
+        if not is_navigable(cell):
+            raise ValueError("path enters land, leaves the map, or violates minimum depth")
+    for prev, curr in zip(path, path[1:]):
+        dx = abs(curr[0] - prev[0])
+        dy = abs(curr[1] - prev[1])
+        if dx + dy != 1:
+            raise ValueError("path contains a non-adjacent move")
+    return path
+
+
+def _leg_time(prev: tuple[int, int], curr: tuple[int, int]) -> float:
+    dx = curr[0] - prev[0]
+    dy = curr[1] - prev[1]
+    current_u, current_v = current_at(prev)
+    current_along = current_u * dx + current_v * dy
+    depth = depth_at(curr)
+    shallow_penalty = max(0.0, 3.0 - depth) * 0.22
+    speed = max(0.25, 1.0 + 0.9 * current_along - shallow_penalty)
+    return 1.0 / speed
+
+
+def route_metrics(value: Any) -> dict[str, float]:
+    path = validate_path(value)
+    total_time_h = 0.0
+    for prev, curr in zip(path, path[1:]):
+        total_time_h += _leg_time(prev, curr)
+    return {
+        "time_h": float(total_time_h),
+        "hops": float(len(path) - 1),
+    }
+
+
+def _retrace(parent, node):
+    path = []
+    current = node
+    while current is not None:
+        path.append(current)
+        current = parent[current]
+    return path[::-1]
+
+
+def baseline_path() -> list[tuple[int, int]]:
+    queue = deque([START])
+    parent = {START: None}
+    while queue:
+        current = queue.popleft()
+        if current == GOAL:
+            return _retrace(parent, current)
+        for nxt in neighbors(current):
+            if nxt not in parent:
+                parent[nxt] = current
+                queue.append(nxt)
+    raise RuntimeError("baseline path not found")
+
+
+BASELINE_PATH = baseline_path()
+BASELINE_TIME_H = route_metrics(BASELINE_PATH)["time_h"]
+BASELINE_HOPS = route_metrics(BASELINE_PATH)["hops"]
+REFERENCE_TIME_H = 20.012194145529936
+REFERENCE_HOPS = 23.0
+""",
+    },
+]
+
+
+README_TEMPLATE = """\
+# __TITLE__
+
+__SHORT__
+
+## Provenance
+
+- Provenance class: `benchmark-local synthetic environment with traceable upstream routing lineage`
+- Upstream lineage: see `references/source_manifest.md`
+- Data asset: fixed synthetic coastal grid and deterministic environmental fields embedded in `runtime/problem.py`
+- Redistribution status: no upstream environmental rasters are vendored
+
+## File Layout
+
+- `Task.md`: task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation
+- `README_zh-CN.md`: Chinese overview
+- `scripts/init.py`: initial candidate file exposed to agents
+- `baseline/solution.py`: reference baseline
+- `runtime/problem.py`: frozen instance generator, validation logic, and reference costs
+- `verification/evaluator.py`: evaluator entry
+- `references/source_manifest.md`: provenance notes
+
+## Quick Run
+
+From repository root:
+
+```bash
+.venv/bin/python benchmarks/OperationsResearch/__SLUG__/verification/evaluator.py \\
+  benchmarks/OperationsResearch/__SLUG__/scripts/init.py \\
+  --metrics-out /tmp/__SLUG___metrics.json
+```
+"""
+
+
+README_ZH_TEMPLATE = """\
+# __TITLE__
+
+__README_ZH__
+
+## 说明
+
+- 数据来源类型：`benchmark-local synthetic environment with traceable upstream routing lineage`
+- 完整来源说明见 `references/source_manifest.md`
+- 所有固定地图、场和约束都内嵌在 `runtime/problem.py`
+"""
+
+
+BASELINE_TEMPLATE = """\
+from __future__ import annotations
+
+try:
+    from benchmarks.OperationsResearch.__SLUG__.runtime.problem import baseline_path
+except ModuleNotFoundError:
+    from runtime.problem import baseline_path
+
+
+def solve(instance):
+    return baseline_path()
+"""
+
+
+INIT_TEMPLATE = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.__SLUG__.baseline.solution import solve as _baseline_solve
+    from benchmarks.OperationsResearch.__SLUG__.runtime.problem import load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as _baseline_solve
+    from runtime.problem import load_instance, route_metrics
+
+
+# EVOLVE-BLOCK-START
+def solve(instance):
+    return _baseline_solve(instance)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    result = solve(load_instance())
+    print(route_metrics(result))
+"""
+
+
+EVALUATOR_FUEL = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.__SLUG__.baseline.solution import solve as baseline_solve
+    from benchmarks.OperationsResearch.__SLUG__.runtime.problem import BASELINE_FUEL, BASELINE_TIME_H, REFERENCE_FUEL, load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as baseline_solve
+    from runtime.problem import BASELINE_FUEL, BASELINE_TIME_H, REFERENCE_FUEL, load_instance, route_metrics
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_fuel": 0.0,
+        "baseline_fuel": float(BASELINE_FUEL),
+        "reference_fuel": float(REFERENCE_FUEL),
+        "candidate_time_h": 0.0,
+        "baseline_time_h": float(BASELINE_TIME_H),
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    solve_fn = namespace.get("solve")
+    if not callable(solve_fn):
+        artifacts["error_message"] = "candidate must define solve(instance)"
+        return metrics, artifacts
+
+    instance = load_instance()
+    try:
+        baseline_metrics = route_metrics(baseline_solve(instance))
+        candidate_metrics = route_metrics(solve_fn(instance))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    candidate_fuel = float(candidate_metrics["fuel"])
+    candidate_time_h = float(candidate_metrics["time_h"])
+    if not math.isfinite(candidate_fuel) or candidate_fuel <= 0:
+        artifacts["error_message"] = "candidate fuel is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_fuel"] = candidate_fuel
+    metrics["candidate_time_h"] = candidate_time_h
+    metrics["baseline_fuel"] = float(baseline_metrics["fuel"])
+    metrics["baseline_time_h"] = float(baseline_metrics["time_h"])
+    metrics["combined_score"] = -candidate_fuel
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+EVALUATOR_TIME = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.OperationsResearch.__SLUG__.baseline.solution import solve as baseline_solve
+    from benchmarks.OperationsResearch.__SLUG__.runtime.problem import BASELINE_HOPS, BASELINE_TIME_H, REFERENCE_TIME_H, load_instance, route_metrics
+except ModuleNotFoundError:
+    from baseline.solution import solve as baseline_solve
+    from runtime.problem import BASELINE_HOPS, BASELINE_TIME_H, REFERENCE_TIME_H, load_instance, route_metrics
+
+
+def evaluate(program_path: str):
+    metrics = {
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_time_h": 0.0,
+        "baseline_time_h": float(BASELINE_TIME_H),
+        "reference_time_h": float(REFERENCE_TIME_H),
+        "candidate_hops": 0.0,
+        "baseline_hops": float(BASELINE_HOPS),
+    }
+    artifacts = {}
+    namespace = runpy.run_path(str(Path(program_path).expanduser().resolve()), run_name="candidate_program")
+    solve_fn = namespace.get("solve")
+    if not callable(solve_fn):
+        artifacts["error_message"] = "candidate must define solve(instance)"
+        return metrics, artifacts
+
+    instance = load_instance()
+    try:
+        baseline_metrics = route_metrics(baseline_solve(instance))
+        candidate_metrics = route_metrics(solve_fn(instance))
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    candidate_time_h = float(candidate_metrics["time_h"])
+    if not math.isfinite(candidate_time_h) or candidate_time_h <= 0:
+        artifacts["error_message"] = "candidate time is invalid"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_time_h"] = candidate_time_h
+    metrics["candidate_hops"] = float(candidate_metrics["hops"])
+    metrics["baseline_time_h"] = float(baseline_metrics["time_h"])
+    metrics["baseline_hops"] = float(baseline_metrics["hops"])
+    metrics["combined_score"] = -candidate_time_h
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+def render(template: str, **values: str) -> str:
+    result = textwrap.dedent(template).rstrip() + "\n"
+    for key, value in values.items():
+        result = result.replace(f"__{key}__", value)
+    return result
+
+
+def write(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(textwrap.dedent(content).rstrip() + "\n", encoding="utf-8")
+
+
+def frontier_eval_files() -> dict[str, str]:
+    return {
+        "frontier_eval/agent_files.txt": "Task.md\nTask_zh-CN.md\nREADME.md\nbaseline/solution.py\nruntime/problem.py\n",
+        "frontier_eval/candidate_destination.txt": "scripts/init.py\n",
+        "frontier_eval/constraints.txt": (
+            "Edit only `scripts/init.py`.\n"
+            "Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.\n"
+            "Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.\n"
+            "Keep outputs valid and finite.\n"
+        ),
+        "frontier_eval/eval_command.txt": "{python} verification/evaluator.py {candidate} --metrics-out metrics.json\n",
+        "frontier_eval/eval_cwd.txt": ".\n",
+        "frontier_eval/initial_program.txt": "scripts/init.py\n",
+        "frontier_eval/readonly_files.txt": (
+            "baseline/solution.py\n"
+            "runtime/problem.py\n"
+            "verification/evaluator.py\n"
+            "references/source_manifest.md\n"
+        ),
+    }
+
+
+def main() -> None:
+    repo_root = Path(__file__).resolve().parents[1]
+    domain_root = repo_root / "benchmarks" / "OperationsResearch"
+    for task in TASKS:
+        root = domain_root / task["slug"]
+        values = {
+            "TITLE": task["title"],
+            "SHORT": task["short"],
+            "SLUG": task["slug"],
+            "README_ZH": task["readme_zh"],
+        }
+        write(root / "README.md", render(README_TEMPLATE, **values))
+        write(root / "README_zh-CN.md", render(README_ZH_TEMPLATE, **values))
+        write(root / "Task.md", render(task["task_md"], **values))
+        write(root / "Task_zh-CN.md", render(task["task_zh"], **values))
+        write(root / "references" / "source_manifest.md", task["source_manifest"])
+        write(root / "scripts" / "init.py", render(INIT_TEMPLATE, **values))
+        write(root / "baseline" / "solution.py", render(BASELINE_TEMPLATE, **values))
+        write(root / "runtime" / "problem.py", task["runtime"])
+        evaluator_template = EVALUATOR_FUEL if task["slug"] == "FuelMinimizingShipWeatherRouting" else EVALUATOR_TIME
+        write(root / "verification" / "evaluator.py", render(evaluator_template, **values))
+        write(root / "verification" / "requirements.txt", "")
+        for relative, content in frontier_eval_files().items():
+            write(root / relative, content)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/bootstrap_pymoto_benchmarks.py b/scripts/bootstrap_pymoto_benchmarks.py
new file mode 100644
index 00000000..c8a4f4d0
--- /dev/null
+++ b/scripts/bootstrap_pymoto_benchmarks.py
@@ -0,0 +1,762 @@
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import json
+import textwrap
+from pathlib import Path
+
+
+TASKS = [
+    {
+        "domain": "StructuralOptimization",
+        "slug": "CantileverComplianceTopologyOptimization",
+        "title": "Cantilever Compliance Topology Optimization",
+        "short": "Minimize compliance on a frozen cantilever beam using pyMOTO's SIMP formulation and a fixed material budget.",
+        "geometry": "cantilever",
+        "provenance_class": "official-example-derived",
+        "problem": {
+            "geometry": "cantilever",
+            "nx": 36,
+            "ny": 12,
+            "volume_fraction": 0.45,
+            "minimum_density": 0.001,
+            "filter_radius": 1.5,
+            "penalty_power": 3.0,
+            "move_limit": 0.2,
+            "max_iterations": 30,
+            "load_scale": 1.0,
+        },
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_compliance_69line.py`
+- Geometry provenance: clamped-left cantilever with a point load at the free side, directly aligned with the official pyMOTO compliance examples.
+- Frozen benchmark status: this repository vendors only a reduced-size local instance and fixed solver settings; there is no external data file.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: official-example-derived frozen instance.
+""",
+    },
+    {
+        "domain": "StructuralOptimization",
+        "slug": "MBBBeamTopologyOptimization",
+        "title": "MBB Beam Topology Optimization",
+        "short": "Minimize compliance on a frozen half-MBB beam using pyMOTO's SIMP formulation and a fixed material budget.",
+        "geometry": "mbb_half",
+        "provenance_class": "literature-derived canonical geometry",
+        "problem": {
+            "geometry": "mbb_half",
+            "nx": 48,
+            "ny": 16,
+            "volume_fraction": 0.50,
+            "minimum_density": 0.001,
+            "filter_radius": 1.5,
+            "penalty_power": 3.0,
+            "move_limit": 0.2,
+            "max_iterations": 30,
+            "load_scale": 1.0,
+        },
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_self_weight.py` (`bc == 2` names the MBB-beam support style)
+- Geometry provenance: the standard half-MBB beam benchmark lineage used in density-based topology optimization, including Sigmund (2001), "A 99 line topology optimization code written in Matlab".
+- Frozen benchmark status: this repository vendors a reduced-size local half-MBB instance with fixed symmetry/support conditions and a fixed point load.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: literature-derived canonical geometry, locally frozen.
+""",
+    },
+    {
+        "domain": "StructuralOptimization",
+        "slug": "BridgeTopologyOptimization",
+        "title": "Bridge Topology Optimization",
+        "short": "Minimize compliance on a frozen bridge-like topology optimization case with a passive-solid deck and distributed load.",
+        "geometry": "bridge_half",
+        "provenance_class": "traceable literature-derived local instance",
+        "problem": {
+            "geometry": "bridge_half",
+            "nx": 48,
+            "ny": 16,
+            "volume_fraction": 0.45,
+            "minimum_density": 0.001,
+            "filter_radius": 1.5,
+            "penalty_power": 3.0,
+            "move_limit": 0.2,
+            "max_iterations": 30,
+            "load_scale": 1.0,
+            "passive_solid_top_rows": 1,
+        },
+        "source_manifest": """\
+# Source Manifest
+
+- Upstream solver/formulation: `pyMOTO`
+- Upstream files:
+  - `examples/topology_optimization/ex_compliance.py`
+  - `examples/topology_optimization/ex_compliance_69line.py`
+- Geometry provenance: a frozen bridge-like case derived from the standard bridge-structure topology-optimization literature, including the "symmetric half of a bridge structure" discussion in Couri et al. (2024), *One-shot procedures for topology optimization: a comparative study*, with a passive-solid deck row added so the distributed load has an explicit load-bearing support region.
+- Frozen benchmark status: this repository vendors a traceable local instance; it is not claimed to be an official upstream data file.
+- License lineage: pyMOTO is released under the MIT License.
+- Provenance class: traceable literature-derived local instance.
+""",
+    },
+]
+
+
+README_TEMPLATE = """\
+# {title}
+
+{short}
+
+## Provenance
+
+- Provenance class: `{provenance_class}`
+- Frozen geometry: `{geometry}`
+- Solver lineage: `pyMOTO` compliance + SIMP density optimization
+- Full provenance note: see `references/source_manifest.md`
+
+## File Layout
+
+- `Task.md`: task contract and scoring rules.
+- `Task_zh-CN.md`: Chinese translation.
+- `scripts/init.py`: initial candidate file exposed to agents.
+- `baseline/solution.py`: OC-style baseline update rule.
+- `runtime/problem.py`: frozen physics, constraints, and optimization loop.
+- `verification/evaluator.py`: evaluator entry.
+- `references/source_manifest.md`: source and provenance notes.
+
+## Quick Run
+
+From repository root:
+
+```bash
+/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \\
+  benchmarks/{domain}/{slug}/verification/evaluator.py \\
+  benchmarks/{domain}/{slug}/scripts/init.py \\
+  --metrics-out /tmp/{slug}_metrics.json
+```
+
+Run with `frontier_eval`:
+
+```bash
+python -m frontier_eval \\
+  task=unified \\
+  task.benchmark={domain}/{slug} \\
+  task.runtime.use_conda_run=false \\
+  task.runtime.python_path=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \\
+  algorithm.iterations=0
+```
+"""
+
+
+TASK_TEMPLATE = """\
+# {title} Task
+
+## Objective
+
+{short}
+
+The benchmark freezes one pyMOTO-based structural optimization case in `runtime/problem.py`.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+Inputs:
+
+- `density`: current density vector as a NumPy array of shape `(nel,)`
+- `sensitivity`: current compliance sensitivity with respect to the design vector
+- `state`: a dict containing:
+  - `iteration`
+  - `domain_shape`
+  - `volume_fraction`
+  - `target_density_sum`
+  - `minimum_density`
+  - `move_limit`
+  - `current_compliance`
+  - `history`
+  - `passive_solid_mask`
+  - `passive_void_mask`
+
+The function must return the next feasible density vector. A dict with key `density` is also accepted.
+
+You may import `project_density` from `runtime.problem` if you want a helper that projects a raw proposal back onto the feasible set.
+
+## Evaluation
+
+The evaluator will:
+
+1. Build the frozen pyMOTO finite-element model.
+2. Run 30 fixed optimization iterations.
+3. Compare the baseline OC update rule against your `update_density(...)`.
+4. Reject non-finite or infeasible density updates.
+5. Expose the final candidate compliance directly as the optimization score.
+
+## Metrics
+
+- `combined_score`: `-candidate_compliance`
+- `valid`: `1.0` only if every density update is finite and feasible
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+
+## Failure Cases
+
+The submission is marked invalid and receives a very low score if:
+
+- `update_density()` is missing
+- any proposed density is non-finite
+- any density violates bounds, move limits, passive masks, or volume budget
+- the pyMOTO solve fails
+"""
+
+
+TASK_ZH_TEMPLATE = """\
+# {title} 任务
+
+## 目标
+
+{short}
+
+评测在 `runtime/problem.py` 中冻结了一个基于 pyMOTO 的结构拓扑优化实例。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def update_density(density, sensitivity, state):
+    ...
+```
+
+输入参数：
+
+- `density`：当前密度向量，NumPy 数组，形状为 `(nel,)`
+- `sensitivity`：当前目标函数相对于设计变量的灵敏度
+- `state`：字典，包含：
+  - `iteration`
+  - `domain_shape`
+  - `volume_fraction`
+  - `target_density_sum`
+  - `minimum_density`
+  - `move_limit`
+  - `current_compliance`
+  - `history`
+  - `passive_solid_mask`
+  - `passive_void_mask`
+
+返回值必须是下一步的可行密度向量。也接受包含 `density` 字段的字典。
+
+如果你只想先产生一个原始提案，可以从 `runtime.problem` 导入 `project_density`，把原始提案投影回可行域。
+
+## 评测方式
+
+评测器会：
+
+1. 构建固定的 pyMOTO 有限元模型。
+2. 运行固定 30 次优化迭代。
+3. 对比 baseline 的 OC 更新规则与你的 `update_density(...)`。
+4. 拒绝任何非有限或不可行的密度更新。
+5. 直接以最终 candidate compliance 作为优化分数。
+
+## 指标
+
+- `combined_score`：`-candidate_compliance`
+- `valid`：所有密度更新都有限且可行时为 `1.0`
+- `candidate_compliance`
+- `baseline_compliance`
+- `final_volume_fraction`
+- `volume_fraction_error`
+"""
+
+
+INIT_TEMPLATE = """\
+#!/usr/bin/env python3
+
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+
+
+def _is_repo_root(path: Path) -> bool:
+    return (path / "benchmarks").is_dir() and (path / "frontier_eval").is_dir()
+
+
+def _ensure_import_path() -> None:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if _is_repo_root(parent):
+            ps = str(parent)
+            if ps not in sys.path:
+                sys.path.insert(0, ps)
+            return
+    benchmark_root = here.parents[1]
+    ps = str(benchmark_root)
+    if ps not in sys.path:
+        sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.{domain}.{slug}.baseline.solution import update_density as _baseline_update_density
+except ModuleNotFoundError:
+    from baseline.solution import update_density as _baseline_update_density
+
+
+# EVOLVE-BLOCK-START
+def update_density(density, sensitivity, state):
+    return _baseline_update_density(density, sensitivity, state)
+# EVOLVE-BLOCK-END
+
+
+if __name__ == "__main__":
+    try:
+        from benchmarks.{domain}.{slug}.runtime.problem import run_optimization
+    except ModuleNotFoundError:
+        from runtime.problem import run_optimization
+
+    result = run_optimization(update_density)
+    print(result["compliance"])
+"""
+
+
+BASELINE_TEMPLATE = """\
+from __future__ import annotations
+
+try:
+    from benchmarks.{domain}.{slug}.runtime.problem import oc_update
+except ModuleNotFoundError:
+    from runtime.problem import oc_update
+
+
+def update_density(density, sensitivity, state):
+    return oc_update(density, sensitivity, state)
+"""
+
+
+RUNTIME_TEMPLATE = """\
+from __future__ import annotations
+
+import math
+import warnings
+from typing import Any
+
+import numpy as np
+import pymoto as pym
+from scipy.sparse import SparseEfficiencyWarning
+
+
+warnings.filterwarnings("ignore", category=SparseEfficiencyWarning)
+
+PROBLEM = {problem_json}
+SAMPLE_INSTANCE = {{
+    "title": "{title}",
+    "geometry": PROBLEM["geometry"],
+    "domain_shape": [PROBLEM["nx"], PROBLEM["ny"]],
+    "volume_fraction": PROBLEM["volume_fraction"],
+    "filter_radius": PROBLEM["filter_radius"],
+    "penalty_power": PROBLEM["penalty_power"],
+    "max_iterations": PROBLEM["max_iterations"],
+}}
+
+
+def load_instance() -> dict[str, Any]:
+    return dict(SAMPLE_INSTANCE)
+
+
+def _passive_masks(domain: pym.VoxelDomain) -> tuple[np.ndarray, np.ndarray]:
+    solid = np.zeros(domain.nel, dtype=bool)
+    void = np.zeros(domain.nel, dtype=bool)
+    top_rows = int(PROBLEM.get("passive_solid_top_rows", 0))
+    for offset in range(top_rows):
+        y = PROBLEM["ny"] - 1 - offset
+        solid[domain.elements[:, y, 0].reshape(-1)] = True
+    return solid, void
+
+
+def _initial_density(domain: pym.VoxelDomain, solid_mask: np.ndarray, void_mask: np.ndarray) -> np.ndarray:
+    target_sum = PROBLEM["volume_fraction"] * domain.nel
+    fixed_sum = float(np.sum(solid_mask)) + PROBLEM["minimum_density"] * float(np.sum(void_mask))
+    free_mask = ~(solid_mask | void_mask)
+    free_count = int(np.sum(free_mask))
+    if free_count == 0:
+        raise ValueError("no free design variables remain")
+    free_density = (target_sum - fixed_sum) / free_count
+    if not (PROBLEM["minimum_density"] <= free_density <= 1.0):
+        raise ValueError("target volume is infeasible for the chosen passive masks")
+    density = np.full(domain.nel, free_density, dtype=float)
+    density[solid_mask] = 1.0
+    density[void_mask] = PROBLEM["minimum_density"]
+    return density
+
+
+def _fixed_dofs(domain: pym.VoxelDomain) -> np.ndarray:
+    geometry = PROBLEM["geometry"]
+    if geometry == "cantilever":
+        left_nodes = domain.nodes[0, :].flatten()
+        return domain.get_dofnumber(left_nodes, [0, 1], 2).flatten()
+    if geometry in {{"mbb_half", "bridge_half"}}:
+        left_nodes = domain.nodes[0, :].flatten()
+        left_x = domain.get_dofnumber(left_nodes, 0, 2).flatten()
+        right_bottom = int(domain.nodes[PROBLEM["nx"], 0, 0])
+        return np.concatenate([left_x, np.array([2 * right_bottom + 1], dtype=int)])
+    raise ValueError(f"unsupported geometry: {{geometry}}")
+
+
+def _force_vector(domain: pym.VoxelDomain) -> np.ndarray:
+    f = np.zeros(domain.nnodes * 2, dtype=float)
+    geometry = PROBLEM["geometry"]
+    load = float(PROBLEM["load_scale"])
+    if geometry == "cantilever":
+        force_node = int(domain.nodes[PROBLEM["nx"], PROBLEM["ny"] // 2, 0])
+        f[2 * force_node + 1] = load
+        return f
+    if geometry == "mbb_half":
+        force_node = int(domain.nodes[0, PROBLEM["ny"], 0])
+        f[2 * force_node + 1] = -load
+        return f
+    if geometry == "bridge_half":
+        deck_nodes = domain.nodes[:, PROBLEM["ny"], 0].flatten()
+        f[2 * deck_nodes + 1] = -load / len(deck_nodes)
+        return f
+    raise ValueError(f"unsupported geometry: {{geometry}}")
+
+
+def _build_context() -> dict[str, Any]:
+    domain = pym.VoxelDomain(PROBLEM["nx"], PROBLEM["ny"])
+    fixed_dofs = _fixed_dofs(domain)
+    force = _force_vector(domain)
+    passive_solid_mask, passive_void_mask = _passive_masks(domain)
+    x0 = _initial_density(domain, passive_solid_mask, passive_void_mask)
+    signal = pym.Signal("x", state=x0.copy())
+    with pym.Network() as network:
+        filtered = pym.DensityFilter(domain=domain, radius=PROBLEM["filter_radius"])(signal)
+        penalized = pym.MathExpression(
+            expression=f"{{PROBLEM['minimum_density']}} + {{1.0 - PROBLEM['minimum_density']}}*inp0^{{PROBLEM['penalty_power']}}"
+        )(filtered)
+        stiffness = pym.AssembleStiffness(domain=domain, bc=fixed_dofs)(penalized)
+        displacement = pym.LinSolve(symmetric=True, positive_definite=True)(stiffness, force)
+        compliance = pym.EinSum(expression="i,i->")(displacement, force)
+    network.response()
+    return {{
+        "domain": domain,
+        "fixed_dofs": fixed_dofs,
+        "force": force,
+        "signal": signal,
+        "network": network,
+        "compliance_signal": compliance,
+        "passive_solid_mask": passive_solid_mask,
+        "passive_void_mask": passive_void_mask,
+    }}
+
+
+def _extract_density(value: Any, expected_size: int) -> np.ndarray:
+    if isinstance(value, dict):
+        if "density" not in value:
+            raise ValueError("missing density key")
+        value = value["density"]
+    density = np.asarray(value, dtype=float).reshape(-1)
+    if density.size != expected_size:
+        raise ValueError(f"density must have length {{expected_size}}, got {{density.size}}")
+    if not np.all(np.isfinite(density)):
+        raise ValueError("density contains non-finite values")
+    return density
+
+
+def _target_density_sum(state: dict[str, Any]) -> float:
+    return float(state["target_density_sum"])
+
+
+def density_bounds(previous_density: np.ndarray, state: dict[str, Any]) -> tuple[np.ndarray, np.ndarray]:
+    lower = np.maximum(float(state["minimum_density"]), previous_density - float(state["move_limit"]))
+    upper = np.minimum(1.0, previous_density + float(state["move_limit"]))
+    solid_mask = np.asarray(state["passive_solid_mask"], dtype=bool)
+    void_mask = np.asarray(state["passive_void_mask"], dtype=bool)
+    if solid_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[solid_mask] = 1.0
+        upper[solid_mask] = 1.0
+    if void_mask.any():
+        lower = lower.copy()
+        upper = upper.copy()
+        lower[void_mask] = float(state["minimum_density"])
+        upper[void_mask] = float(state["minimum_density"])
+    return lower, upper
+
+
+def _project_sum_with_bounds(raw: np.ndarray, lower: np.ndarray, upper: np.ndarray, target_sum: float) -> np.ndarray:
+    if float(np.sum(lower)) - 1e-9 > target_sum or float(np.sum(upper)) + 1e-9 < target_sum:
+        raise ValueError("target density sum is infeasible under current bounds")
+    lam_low = float(np.min(raw - upper))
+    lam_high = float(np.max(raw - lower))
+    for _ in range(80):
+        lam = 0.5 * (lam_low + lam_high)
+        candidate = np.clip(raw - lam, lower, upper)
+        if float(np.sum(candidate)) > target_sum:
+            lam_low = lam
+        else:
+            lam_high = lam
+    return np.clip(raw - lam_high, lower, upper)
+
+
+def project_density(raw_density: Any, previous_density: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    raw = _extract_density(raw_density, previous_density.size)
+    lower, upper = density_bounds(previous_density, state)
+    return _project_sum_with_bounds(raw, lower, upper, _target_density_sum(state))
+
+
+def validate_density(candidate_density: np.ndarray, previous_density: np.ndarray, state: dict[str, Any]) -> None:
+    lower, upper = density_bounds(previous_density, state)
+    tol = 1e-6
+    if np.any(candidate_density < lower - tol) or np.any(candidate_density > upper + tol):
+        raise ValueError("density violates bounds, move limit, or passive masks")
+    volume_error = abs(float(np.sum(candidate_density)) - _target_density_sum(state))
+    if volume_error > 1e-4:
+        raise ValueError("density violates target volume")
+
+
+def oc_update(density: np.ndarray, sensitivity: np.ndarray, state: dict[str, Any]) -> np.ndarray:
+    lower, upper = density_bounds(density, state)
+    sens = np.asarray(sensitivity, dtype=float).reshape(-1)
+    if sens.shape != density.shape:
+        raise ValueError("sensitivity shape mismatch")
+    sens = np.minimum(sens, -1e-12)
+    l1, l2 = 1e-9, 1e9
+    for _ in range(80):
+        lam = 0.5 * (l1 + l2)
+        candidate = np.clip(density * np.sqrt(np.maximum(1e-12, -sens / lam)), lower, upper)
+        if float(np.sum(candidate)) > _target_density_sum(state):
+            l1 = lam
+        else:
+            l2 = lam
+    return np.clip(density * np.sqrt(np.maximum(1e-12, -sens / l2)), lower, upper)
+
+
+def run_optimization(update_density, max_iterations: int | None = None) -> dict[str, Any]:
+    context = _build_context()
+    signal = context["signal"]
+    network = context["network"]
+    compliance_signal = context["compliance_signal"]
+
+    history: list[float] = [float(compliance_signal.state)]
+    iterations = int(PROBLEM["max_iterations"] if max_iterations is None else max_iterations)
+    for iteration in range(iterations):
+        network.reset()
+        compliance_signal.sensitivity = 1.0
+        network.sensitivity()
+        density = np.asarray(signal.state, dtype=float).reshape(-1).copy()
+        sensitivity = np.asarray(signal.sensitivity, dtype=float).reshape(-1).copy()
+        state = {{
+            "iteration": iteration,
+            "domain_shape": (PROBLEM["nx"], PROBLEM["ny"]),
+            "volume_fraction": PROBLEM["volume_fraction"],
+            "target_density_sum": PROBLEM["volume_fraction"] * context["domain"].nel,
+            "minimum_density": PROBLEM["minimum_density"],
+            "move_limit": PROBLEM["move_limit"],
+            "current_compliance": float(compliance_signal.state),
+            "history": tuple(history),
+            "passive_solid_mask": context["passive_solid_mask"].copy(),
+            "passive_void_mask": context["passive_void_mask"].copy(),
+        }}
+        candidate = update_density(density.copy(), sensitivity.copy(), state)
+        density_next = _extract_density(candidate, density.size)
+        validate_density(density_next, density, state)
+        signal.state = density_next
+        network.response()
+        history.append(float(compliance_signal.state))
+
+    final_density = np.asarray(signal.state, dtype=float).reshape(-1)
+    return {{
+        "valid": True,
+        "compliance": float(compliance_signal.state),
+        "history": history,
+        "iterations": iterations,
+        "final_volume_fraction": float(np.mean(final_density)),
+        "volume_fraction_error": abs(float(np.mean(final_density)) - PROBLEM["volume_fraction"]),
+    }}
+"""
+
+
+EVALUATOR_TEMPLATE = """\
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import runpy
+import traceback
+from pathlib import Path
+
+
+def _repo_root() -> Path:
+    here = Path(__file__).resolve()
+    for parent in [here.parent, *here.parents]:
+        if (parent / "benchmarks").is_dir() and (parent / "frontier_eval").is_dir():
+            return parent
+    return Path.cwd().resolve()
+
+
+def _benchmark_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+
+
+def _ensure_import_path() -> None:
+    import sys
+
+    for p in (_repo_root(), _benchmark_root()):
+        ps = str(p)
+        if ps not in sys.path:
+            sys.path.insert(0, ps)
+
+
+_ensure_import_path()
+
+try:
+    from benchmarks.{domain}.{slug}.baseline.solution import update_density as baseline_update_density
+    from benchmarks.{domain}.{slug}.runtime.problem import run_optimization
+except ModuleNotFoundError:
+    from baseline.solution import update_density as baseline_update_density
+    from runtime.problem import run_optimization
+
+
+def evaluate(program_path: str) -> tuple[dict[str, float], dict[str, str]]:
+    metrics = {{
+        "combined_score": -1e18,
+        "valid": 0.0,
+        "candidate_compliance": 0.0,
+        "baseline_compliance": 0.0,
+        "final_volume_fraction": 0.0,
+        "volume_fraction_error": 0.0,
+    }}
+    artifacts: dict[str, str] = {{}}
+
+    program = Path(program_path).expanduser().resolve()
+    namespace = runpy.run_path(str(program), run_name="candidate_program")
+    update_density = namespace.get("update_density")
+    if not callable(update_density):
+        artifacts["error_message"] = "candidate must define update_density(density, sensitivity, state)"
+        return metrics, artifacts
+
+    try:
+        baseline = run_optimization(baseline_update_density)
+        candidate = run_optimization(update_density)
+    except Exception:
+        artifacts["error_message"] = traceback.format_exc()
+        return metrics, artifacts
+
+    baseline_compliance = float(baseline["compliance"])
+    candidate_compliance = float(candidate["compliance"])
+    if not math.isfinite(baseline_compliance) or baseline_compliance <= 0:
+        artifacts["error_message"] = "internal baseline produced an invalid compliance value"
+        return metrics, artifacts
+    if not math.isfinite(candidate_compliance) or candidate_compliance <= 0:
+        artifacts["error_message"] = "candidate produced an invalid compliance value"
+        return metrics, artifacts
+
+    metrics["valid"] = 1.0
+    metrics["candidate_compliance"] = candidate_compliance
+    metrics["baseline_compliance"] = baseline_compliance
+    metrics["final_volume_fraction"] = float(candidate["final_volume_fraction"])
+    metrics["volume_fraction_error"] = float(candidate["volume_fraction_error"])
+    metrics["combined_score"] = -candidate_compliance
+    return metrics, artifacts
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("program")
+    parser.add_argument("--metrics-out", default="metrics.json")
+    args = parser.parse_args()
+    metrics, artifacts = evaluate(args.program)
+    Path(args.metrics_out).write_text(json.dumps(metrics, indent=2), encoding="utf-8")
+    if artifacts:
+        Path("artifacts.json").write_text(json.dumps(artifacts, indent=2), encoding="utf-8")
+    print(json.dumps(metrics, indent=2))
+
+
+if __name__ == "__main__":
+    main()
+"""
+
+
+def write(path: Path, content: str) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(textwrap.dedent(content).rstrip() + "\n", encoding="utf-8")
+
+
+def benchmark_files(domain: str, slug: str) -> dict[str, str]:
+    return {
+        "frontier_eval/agent_files.txt": "Task.md\nTask_zh-CN.md\nREADME.md\nbaseline/solution.py\nruntime/problem.py\n",
+        "frontier_eval/candidate_destination.txt": "scripts/init.py\n",
+        "frontier_eval/constraints.txt": (
+            "Edit only `scripts/init.py`.\n"
+            "Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.\n"
+            "Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.\n"
+            "Keep every density update finite and feasible.\n"
+        ),
+        "frontier_eval/eval_command.txt": "{python} verification/evaluator.py {candidate} --metrics-out metrics.json\n",
+        "frontier_eval/eval_cwd.txt": ".\n",
+        "frontier_eval/initial_program.txt": "scripts/init.py\n",
+        "frontier_eval/readonly_files.txt": (
+            "baseline/solution.py\n"
+            "runtime/problem.py\n"
+            "verification/evaluator.py\n"
+            "references/source_manifest.md\n"
+        ),
+    }
+
+
+def main() -> None:
+    repo_root = Path(__file__).resolve().parents[1]
+    for task in TASKS:
+        domain = task["domain"]
+        slug = task["slug"]
+        root = repo_root / "benchmarks" / domain / slug
+        root.mkdir(parents=True, exist_ok=True)
+
+        problem_json = json.dumps(task["problem"], indent=4)
+        write(
+            root / "README.md",
+            README_TEMPLATE.format(
+                title=task["title"],
+                short=task["short"],
+                provenance_class=task["provenance_class"],
+                geometry=task["problem"]["geometry"],
+                domain=domain,
+                slug=slug,
+            ),
+        )
+        write(root / "Task.md", TASK_TEMPLATE.format(title=task["title"], short=task["short"]))
+        write(root / "Task_zh-CN.md", TASK_ZH_TEMPLATE.format(title=task["title"], short=task["short"]))
+        write(root / "references" / "source_manifest.md", task["source_manifest"])
+        write(root / "scripts" / "init.py", INIT_TEMPLATE.format(domain=domain, slug=slug))
+        write(root / "baseline" / "solution.py", BASELINE_TEMPLATE.format(domain=domain, slug=slug))
+        write(
+            root / "runtime" / "problem.py",
+            RUNTIME_TEMPLATE.format(problem_json=problem_json, title=task["title"]),
+        )
+        write(
+            root / "verification" / "evaluator.py",
+            EVALUATOR_TEMPLATE.format(domain=domain, slug=slug),
+        )
+        write(root / "verification" / "requirements.txt", "numpy\nscipy\npymoto\n")
+        for relative, content in benchmark_files(domain, slug).items():
+            write(root / relative, content)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/rerun_frontier_first20_shinka5.sh b/scripts/rerun_frontier_first20_shinka5.sh
new file mode 100755
index 00000000..52625c3c
--- /dev/null
+++ b/scripts/rerun_frontier_first20_shinka5.sh
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+ROOT="/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering"
+cd "$ROOT"
+
+PY="${PYTHON_BIN:-./.venv/bin/python}"
+MATRIX="${MATRIX_PATH:-frontier_eval/conf/batch/frontier_first20_shinkaevolve5.yaml}"
+BATCH_ROOT="${1:-runs/batch/frontier_first20_shinkaevolve5__20260313_162247__46e57490}"
+
+mapfile -t TASKS < <(
+  "$PY" - <<'PY'
+from omegaconf import OmegaConf
+
+cfg = OmegaConf.load("frontier_eval/conf/batch/frontier_first20_shinkaevolve5.yaml")
+for task in cfg["tasks"]:
+    print(task)
+PY
+)
+
+cmd=("$PY" -m frontier_eval.batch --matrix "$MATRIX" --in-place --batch-root "$BATCH_ROOT")
+for task in "${TASKS[@]}"; do
+  cmd+=(--tasks "$task")
+done
+
+printf 'Running:'
+for arg in "${cmd[@]}"; do
+  printf ' %q' "$arg"
+done
+printf '\n'
+
+"${cmd[@]}"
+"$PY" scripts/summarize_batch_run.py "$BATCH_ROOT"
+
+echo "BATCH_ROOT=$BATCH_ROOT"
diff --git a/skills/frontier-benchmark-contributor/SKILL.md b/skills/frontier-benchmark-contributor/SKILL.md
new file mode 100644
index 00000000..8b62caa7
--- /dev/null
+++ b/skills/frontier-benchmark-contributor/SKILL.md
@@ -0,0 +1,216 @@
+---
+name: frontier-benchmark-contributor
+description: Create and refine Frontier-Engineering benchmark contributions from research/engineering problem ideas, third-party libraries, or known benchmark instances. Use when Codex needs to choose feasible topics, verify authentic and traceable data sources, turn a candidate task into a reproducible benchmark, design the editable agent interface, add baseline and evaluation files, and integrate the result into `benchmarks/` and `frontier_eval/`. Especially useful for benchmark triage, single-file task wrapping, unified-task metadata setup, and repo-specific benchmark contribution work.
+---
+
+# Frontier Benchmark Contributor
+
+## Overview
+
+Turn a loose task idea into a Frontier-Engineering benchmark that is small, reproducible, and evaluable by the existing repo tooling.
+Prefer benchmark shapes that expose a narrow candidate interface, have a stable baseline, and run repeatedly without large datasets, GUI dependencies, or long training loops.
+Treat data provenance as a hard requirement: every dataset, benchmark instance, parameter table, and historical time series must have a real source, a clear transformation path, and wording that does not over-claim authenticity.
+Treat benchmark quality checks as a hard requirement too: a contributed task is not "done" just because the baseline runs once. It should be able to produce a valid numeric score, and when possible it should show a real improvement signal in a short `frontier_eval` run such as a 10-step `eval_single.sh` check.
+
+## Triage The Candidate Pool
+
+Start by shrinking the candidate pool before writing code.
+Reject candidates with unclear data provenance before evaluating implementation effort.
+
+Score each candidate against six questions:
+
+1. Can the evaluator run offline with bounded runtime and memory?
+2. Can the candidate interface be reduced to one file or one narrow function entrypoint?
+3. Does the task rely on authentic, traceable data or benchmark instances from a primary source?
+4. Is there a trustworthy baseline from the referenced library, dataset, or benchmark instance?
+5. Can validity be checked mechanically with deterministic or low-variance metrics?
+6. Can the task be explained clearly in `Task.md` and `Task_zh-CN.md` without external hidden context?
+
+Prefer tasks that answer "yes" to all six.
+Reject or heavily simplify tasks that require long RL training, online data fetches, GPU-only pipelines, large opaque datasets, simulator stacks that are hard to pin down, or data copied from secondary blog posts and mirrors without a canonical source.
+
+When the user provides a long list, shortlist at most three candidates before implementation.
+Prefer "one excellent benchmark landed end-to-end" over a broad but shallow plan.
+
+For the current candidate family in this repo, load [candidate-triage.md](./references/candidate-triage.md) when you need a practical ranking of likely-easy versus likely-heavy tasks.
+
+## Enforce Data Provenance
+
+Treat source authenticity as part of benchmark validity, not just documentation quality.
+
+For any real dataset, public benchmark instance, or historical series:
+
+1. Cite the primary paper, official repository, or official dataset page.
+2. Record the exact instance name, release, version, or snapshot date.
+3. State whether the repo contains raw data, a processed derivative, or a synthetic stand-in.
+4. Explain every nontrivial transformation from source data to benchmark-ready files.
+5. Confirm the license or redistribution status before vendoring files into the repo.
+
+Do not present fabricated, hand-copied, or weakly sourced data as if it were canonical benchmark data.
+If you use synthetic data for practicality, label it explicitly as synthetic, fix the random seed or generator procedure, and describe why the synthetic construction is an acceptable proxy.
+
+Prefer:
+
+1. Official benchmark instances such as FT10 and LA16 from recognized benchmark collections.
+2. Official library examples or reference datasets bundled with the upstream project.
+3. Public market, operations, or engineering datasets with stable identifiers and clear licensing.
+
+Avoid:
+
+1. Blog reposts of benchmark tables.
+2. Unverifiable CSV files copied from forums or random GitHub forks.
+3. "Historical data" without ticker universe, time range, adjustment rules, and source declaration.
+4. Derived benchmark files whose preprocessing steps cannot be reconstructed.
+
+## Choose The Integration Path
+
+Prefer the existing `unified` task path unless there is a concrete reason not to.
+
+Use `task=unified` when the benchmark can be described by benchmark-local metadata and a shell evaluation command:
+
+1. Candidate file lives inside the benchmark folder.
+2. Evaluation can be launched from `eval_command.txt`.
+3. Metrics can be written to `metrics.json`.
+4. Artifact collection can be handled by `artifact_files.txt`.
+5. No custom Python-side sandbox orchestration is needed.
+
+Use a bespoke `frontier_eval/tasks/<task>` wrapper only when at least one of these is true:
+
+1. The evaluator must mutate or back up multiple files before running.
+2. The benchmark needs nontrivial Python-side setup or teardown.
+3. The framework must inspect outputs programmatically beyond normal `metrics.json` parsing.
+4. The task needs compatibility logic that does not fit cleanly in benchmark-local scripts.
+
+Inspect `frontier_eval/README.md`, `frontier_eval/README_zh-CN.md`, and [repo-checklist.md](./references/repo-checklist.md) when deciding between the two.
+
+## Design The Benchmark Shape
+
+Lock down the task shape before creating files.
+
+Define all of the following explicitly:
+
+1. Immutable world model: simulator, optimizer, instance data, or reference formulas that the agent must not change.
+2. Editable surface: the single file, function, or policy the agent is allowed to modify.
+3. Baseline: the reference implementation to compare against.
+4. Data provenance: canonical source, version or snapshot date, license status, and preprocessing path.
+5. Metrics: `combined_score`, `valid`, and any secondary metrics worth logging.
+6. Constraints: runtime limit, memory limit, reproducibility expectations, forbidden side effects.
+7. Expected artifacts: logs, schedules, plots, or summary tables worth collecting.
+
+Prefer one of these agent surfaces:
+
+1. `solve(instance) -> solution`
+2. `plan_path(map, start, goal) -> path`
+3. `schedule_fab(state) -> action`
+4. `dispatch_rule(job_state) -> priority`
+5. `custom_kernel(data) -> output`
+
+Avoid wide interfaces with many mutable files unless the benchmark genuinely needs them.
+
+## Build The Files
+
+Create the benchmark under `benchmarks/<Domain>/<Task>/`.
+Mirror existing repo conventions instead of inventing a new layout.
+
+For most new tasks, include:
+
+1. `Task.md`
+2. `Task_zh-CN.md`
+3. `README.md`
+4. `baseline/`
+5. `verification/`
+6. `frontier_eval/`
+
+Add `README_zh-CN.md`, `references/`, `runtime/`, or `data/` only when they materially help.
+If the task uses external or processed data, add explicit source notes in the README and keep any preprocessing script or provenance note close to the data files.
+
+If using `unified`, create benchmark metadata files under `frontier_eval/` and keep the evaluator benchmark-local.
+If using a bespoke wrapper, also add:
+
+1. `frontier_eval/tasks/<task>/task.py`
+2. `frontier_eval/tasks/<task>/evaluator/python.py`
+3. `frontier_eval/tasks/<task>/__init__.py`
+4. `frontier_eval/conf/task/<task>.yaml`
+
+Use [repo-checklist.md](./references/repo-checklist.md) for the concrete file checklist and representative examples already present in this repository.
+
+## Write The Task Prompt
+
+Make the task prompt operational, not promotional.
+
+State:
+
+1. What the agent may edit.
+2. What inputs are fixed.
+3. Where the fixed data or benchmark instances came from.
+4. What outputs are expected.
+5. How validity is checked.
+6. How score is computed.
+7. Which baseline or reference implementation is available.
+
+Use concrete paths and function signatures.
+Mention exact filenames the user should inspect or modify.
+If the task comes from a classic benchmark instance such as FT10 or LA16, include the instance name and the known optimal or baseline score.
+If the task uses market data, experimental data, or processed benchmark files, name the provider, time interval, preprocessing rules, and whether the data is redistributed or regenerated locally.
+
+## Validate Before Calling The Task Done
+
+Run the simplest direct benchmark command first, then the Frontier wrapper.
+
+Check at least:
+
+1. Baseline execution succeeds from a clean state.
+2. The direct benchmark command writes a finite numeric score and `valid=1` for the baseline or init candidate.
+3. The benchmark writes stable metrics.
+4. The benchmark is using the intended data snapshot or instance set, not an accidental local variant.
+5. Invalid candidate behavior maps to `valid=0`.
+6. The candidate cannot silently modify protected files.
+7. The task can run through `frontier_eval` with zero or minimal iterations.
+8. When the task is intended for evolutionary optimization, run a short optimization sanity check such as a 10-step `eval_single.sh` run and check whether there is at least some valid improvement signal over the initial candidate.
+
+For `unified`, prefer a smoke command of the form:
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=<Domain>/<Task> \
+  algorithm.iterations=0
+```
+
+If the benchmark needs a specific Python or conda environment, document it in the benchmark README and task metadata.
+
+For short optimization validation in this repository, prefer the repo helper script:
+
+```bash
+./eval_single.sh task=unified task.benchmark=<Domain>/<Task>
+```
+
+If the benchmark needs `.venv` or non-default runtime overrides, pass them explicitly:
+
+```bash
+PYTHON_BIN=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \
+./eval_single.sh \
+  task=unified \
+  task.benchmark=<Domain>/<Task> \
+  task.runtime.use_conda_run=false \
+  task.runtime.python_path=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
+```
+
+Do not overstate quality if either of these fails:
+
+1. The benchmark cannot produce `valid=1` with a finite baseline score.
+2. A 10-step `eval_single.sh` run produces only invalid candidates.
+3. A 10-step run shows no meaningful improvement signal and you have not explicitly called that out as a remaining risk.
+
+## Report With Decision Quality
+
+When finishing the contribution, summarize:
+
+1. Why this task was selected over alternatives.
+2. Which repo path and integration mode were used.
+3. What the authentic data source or canonical benchmark origin is.
+4. What the editable agent surface is.
+5. What baseline and evaluation metrics exist.
+6. What remaining risks or environment gaps still exist.
+
+If you reject a candidate, say why in concrete engineering terms such as weak source authenticity, redistribution ambiguity, dependency weight, nondeterminism, hidden data requirements, or excessive runtime.
diff --git a/skills/frontier-benchmark-contributor/agents/openai.yaml b/skills/frontier-benchmark-contributor/agents/openai.yaml
new file mode 100644
index 00000000..c9243792
--- /dev/null
+++ b/skills/frontier-benchmark-contributor/agents/openai.yaml
@@ -0,0 +1,4 @@
+interface:
+  display_name: "Frontier Benchmark Contributor"
+  short_description: "Design and add Frontier-Engineering benchmark tasks"
+  default_prompt: "Use $frontier-benchmark-contributor to turn candidate domains into implementable Frontier-Engineering benchmark tasks."
diff --git a/skills/frontier-benchmark-contributor/references/candidate-triage.md b/skills/frontier-benchmark-contributor/references/candidate-triage.md
new file mode 100644
index 00000000..93446579
--- /dev/null
+++ b/skills/frontier-benchmark-contributor/references/candidate-triage.md
@@ -0,0 +1,65 @@
+# Candidate Triage For The Current Frontier-Engineering Expansion
+
+Use this file when the user asks which ideas from the current candidate pool should be implemented first.
+
+## Tier A: Fastest To Land
+
+These are the best first contributions because they are small, deterministic, and easy to score.
+
+1. Stockpyl single-node EOQ extensions with MOQ and supplier discount variants.
+2. Stockpyl stochastic demand with service-level constrained `(s, Q)` or `(R, Q)` policies.
+3. Job shop scheduling with FT10 or LA16 using an existing solver as baseline and a heuristic surface for the agent.
+4. PyPortfolioOpt static Markowitz portfolio optimization with fixed historical return matrices.
+5. Simplified ship weather routing on a pre-generated 2D grid with offline wind/current fields.
+6. Simplified water network MILP with a small hand-written network and a CBC or HiGHS baseline.
+
+Why these are attractive:
+
+1. They have narrow interfaces.
+2. They admit clear baseline calls.
+3. They can use synthetic or fixed local data.
+4. They can expose interpretable metrics such as cost, makespan, Sharpe ratio, or route cost.
+
+## Tier B: Viable After Moderate Simplification
+
+These can work, but they need extra care around dependencies or benchmark scope.
+
+1. PyPSA intraday operation with storage on a very small network.
+2. EV2Gym simplified charging plus grid proxy.
+3. pyMOTO topology optimization with fixed meshes and limited iteration budgets.
+4. Motion planning from `caelan/motion-planners` for 2D or small 3D maps.
+5. MESMO multi-energy scheduling only if the environment can be pinned and the instance is tiny.
+
+Main risk:
+
+The library stack or runtime may be heavier than the eventual benchmark warrants.
+
+## Tier C: Defer Or Simplify Aggressively
+
+Avoid these for the first batch unless the user explicitly wants a heavy benchmark and accepts extra engineering work.
+
+1. CARLA-derived driving behavior planning.
+2. SimRLFab RL scheduling with training loops.
+3. OpenFF force-field fitting or MD performance tuning.
+4. Sequoya multi-sequence alignment on real datasets.
+5. DawnDesignTool aircraft multidisciplinary design.
+6. Optiland optical system design.
+7. Additive manufacturing differentiable simulation.
+8. Data-center MARL cooling and scheduling.
+
+Typical reasons to defer:
+
+1. Dependency stacks are large or fragile.
+2. Runtime is hard to bound.
+3. Reproducibility is harder to guarantee.
+4. Benchmark setup can dominate the actual task definition.
+
+## Practical Selection Rule
+
+When the user wants to contribute a batch:
+
+1. Pick one Tier A task that is optimization-heavy but easy to score.
+2. Pick one Tier A or B task from a different domain for diversity.
+3. Delay Tier C until the repo has enough lighter tasks landed successfully.
+
+If uncertain between two tasks, choose the one that can be implemented with `task=unified`.
diff --git a/skills/frontier-benchmark-contributor/references/repo-checklist.md b/skills/frontier-benchmark-contributor/references/repo-checklist.md
new file mode 100644
index 00000000..71298494
--- /dev/null
+++ b/skills/frontier-benchmark-contributor/references/repo-checklist.md
@@ -0,0 +1,158 @@
+# Frontier-Engineering Repo Checklist
+
+Use this file when you need exact repo placement and validation details.
+
+## Prefer Unified When Possible
+
+The repo already supports benchmark-local metadata through `frontier_eval/tasks/unified/`.
+Prefer that path for new tasks that can be evaluated by a shell command plus `metrics.json`.
+
+Useful files to inspect:
+
+1. `frontier_eval/README.md`
+2. `frontier_eval/README_zh-CN.md`
+3. `frontier_eval/tasks/unified/spec.py`
+4. `benchmarks/CommunicationEngineering/PMDSimulation/frontier_eval/`
+
+Minimum unified metadata under `benchmarks/<Domain>/<Task>/frontier_eval/`:
+
+1. `initial_program.txt`
+2. `eval_command.txt`
+
+Commonly useful optional metadata:
+
+1. `candidate_destination.txt`
+2. `eval_cwd.txt`
+3. `agent_files.txt`
+4. `artifact_files.txt`
+5. `readonly_files.txt`
+6. `constraints.txt`
+
+Evaluation convention:
+
+1. Make `eval_command.txt` produce `metrics.json`.
+2. Include numeric `combined_score`.
+3. Include numeric or boolean `valid`.
+4. Use nonzero exit codes for fatal evaluation failure.
+
+## Use A Bespoke Wrapper Only When Needed
+
+Create a dedicated `frontier_eval/tasks/<task>/` package only when the benchmark needs Python-driven sandbox logic.
+
+Representative files:
+
+1. `frontier_eval/tasks/quadruped_gait/task.py`
+2. `frontier_eval/tasks/mla/task.py`
+3. `frontier_eval/tasks/mla/evaluator/python.py`
+4. `frontier_eval/conf/task/quadruped_gait.yaml`
+
+Minimum bespoke wrapper additions:
+
+1. `frontier_eval/tasks/<task>/task.py`
+2. `frontier_eval/tasks/<task>/__init__.py`
+3. `frontier_eval/tasks/<task>/evaluator/python.py`
+4. `frontier_eval/tasks/<task>/evaluator/__init__.py`
+5. `frontier_eval/conf/task/<task>.yaml`
+
+## Benchmark Folder Shape
+
+Most benchmark folders should include:
+
+1. `Task.md`
+2. `Task_zh-CN.md`
+3. `README.md`
+4. `baseline/`
+5. `verification/`
+6. `frontier_eval/`
+
+Common optional additions:
+
+1. `README_zh-CN.md`
+2. `references/`
+3. `runtime/`
+4. `data/`
+5. `scripts/`
+
+## Data Provenance Checklist
+
+Before committing a benchmark with any nontrivial data payload or instance file, record:
+
+1. The primary source URL, paper, or upstream repository.
+2. The exact dataset version, release tag, instance name, or snapshot date.
+3. Whether files in this repo are raw, filtered, transformed, or synthetic.
+4. The preprocessing script or manual transformation steps.
+5. The redistribution or license status.
+
+Recommended placement:
+
+1. Put a short provenance summary in `README.md`.
+2. Mention the canonical source in `Task.md` and `Task_zh-CN.md`.
+3. Keep preprocessing code in `scripts/` or `references/` when practical.
+4. Avoid committing opaque processed files without a reconstruction path.
+
+For classic benchmark instances such as FT10 or LA16, cite the benchmark family and pin the exact instance names instead of relying on informal copies.
+
+Representative examples:
+
+1. `benchmarks/Robotics/QuadrupedGaitOptimization/`
+2. `benchmarks/CommunicationEngineering/PMDSimulation/`
+3. `benchmarks/KernelEngineering/FlashAttention/`
+
+## Candidate Interface Checklist
+
+Before implementation, pin down:
+
+1. Which file is the editable candidate file.
+2. Where that candidate file is copied inside the sandbox.
+3. Which function or class entrypoint the evaluator expects.
+4. Which files are read-only.
+5. Which outputs are used to compute score.
+
+Prefer one editable file and one well-named entrypoint.
+
+## Validation Checklist
+
+Run these checks before considering the task contributed:
+
+1. Run the baseline directly inside the benchmark folder.
+2. Confirm the direct evaluator returns a finite score and `valid=1` for the baseline or init candidate.
+3. Run the benchmark through `frontier_eval`.
+4. Confirm the benchmark is reading the intended source files, instance set, or local snapshot.
+5. Confirm the baseline score is reproducible across at least two runs when practical.
+6. Confirm metrics degrade or fail when the candidate is intentionally broken.
+7. If the task is meant to be optimizable, run a 10-step sanity check with `eval_single.sh` and see whether the best valid score improves over the initial candidate.
+8. Confirm logs or artifacts are preserved where the user can inspect them.
+
+Useful smoke command:
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=<Domain>/<Task> \
+  algorithm.iterations=0
+```
+
+If the task does not fit `unified`, run the corresponding dedicated task config instead.
+
+Useful short optimization sanity check:
+
+```bash
+./eval_single.sh task=unified task.benchmark=<Domain>/<Task>
+```
+
+If the benchmark needs a specific Python executable or runtime overrides, pass them through the same script:
+
+```bash
+PYTHON_BIN=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python \
+./eval_single.sh \
+  task=unified \
+  task.benchmark=<Domain>/<Task> \
+  task.runtime.use_conda_run=false \
+  task.runtime.python_path=/mnt/shared-storage-user/p1-shared/luotianwei/Frontier-Engineering/.venv/bin/python
+```
+
+Interpretation rule:
+
+1. `valid=1` with a finite score is the minimum bar for “can be scored”.
+2. A visible best-score improvement within 10 steps is the preferred bar for “quality-checked and optimization-ready”.
+3. If the 10-step run fails to improve, document that explicitly as a remaining risk instead of calling the benchmark fully validated.