EinsiaLab · llltttwww · Mar 16, 2026 · Mar 16, 2026 · Mar 18, 2026
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/README.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/README.md
@@ -0,0 +1,55 @@
+# DuckDB Index Selection
+
+Choose a whitelist subset of DuckDB indexes for a frozen analytical lookup workload and minimize total runtime.
+
+## Why This Benchmark Matters
+
+This benchmark captures physical-design tuning on top of a stable DuckDB workload. Extra indexes can speed repeated queries, but they also add build and maintenance cost, so the goal is to choose the right subset rather than simply choosing more.
+
+Viewed computationally, this is a subset-selection problem over a fixed candidate set, with real execution time rather than a proxy metric deciding the score.
+
+## What You Edit
+
+- Target file: `scripts/init.py`
+- Entry point: `select_indexes(workload_manifest)`
+
+## Source of Truth
+
+- `Task.md`: full task contract and scoring rules
+- `Task_zh-CN.md`: Chinese translation of the task contract
+- `runtime/problem.py`: frozen instance, validator, and metrics helpers
+- `baseline/solution.py`: reference baseline
+- `verification/evaluator.py`: local evaluator entry point
+- `references/source_manifest.md`: provenance and lineage notes
+
+## Environment
+
+From repository root:
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
+```
+
+## Quick Run
+
+From repository root:
+
+```bash
+python benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBIndexSelection_metrics.json
+```
+
+## Optional: Run with `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBIndexSelection \
+  algorithm.iterations=0
+```
+
+If you need a non-default interpreter, also add `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`.
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/README_zh-CN.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/README_zh-CN.md
@@ -0,0 +1,55 @@
+# DuckDB 索引选择
+
+在冻结的 DuckDB 分析 workload 上，从白名单中挑选一组索引，尽量降低总执行时间。
+
+## 这个 Benchmark 在测什么
+
+这个 benchmark 对应的是稳定 DuckDB workload 上的物理设计调优。额外索引确实可能加速重复查询，但也会带来构建和维护成本，所以关键不是“建得越多越好”，而是选对那一小部分。
+
+从计算角度看，这是一道固定候选集合上的子集选择题，而且分数来自真实执行时间，不是某个替代指标。
+
+## 你真正会改的文件
+
+- 目标文件：`scripts/init.py`
+- 入口函数：`select_indexes(workload_manifest)`
+
+## 先看哪里
+
+- `Task_zh-CN.md`：中文任务契约与评分规则
+- `Task.md`：英文任务说明
+- `runtime/problem.py`：冻结实例、校验逻辑和指标辅助函数
+- `baseline/solution.py`：基线实现
+- `verification/evaluator.py`：本地评测入口
+- `references/source_manifest.md`：来源与谱系说明
+
+## 环境准备
+
+从仓库根目录运行：
+
+```bash
+pip install -r frontier_eval/requirements.txt
+pip install -r benchmarks/ComputerSystems/DuckDBIndexSelection/verification/requirements.txt
+```
+
+## 快速运行
+
+从仓库根目录运行：
+
+```bash
+python benchmarks/ComputerSystems/DuckDBIndexSelection/verification/evaluator.py \
+  benchmarks/ComputerSystems/DuckDBIndexSelection/scripts/init.py \
+  --metrics-out /tmp/DuckDBIndexSelection_metrics.json
+```
+
+## 可选：使用 `frontier_eval`
+
+```bash
+python -m frontier_eval \
+  task=unified \
+  task.benchmark=ComputerSystems/DuckDBIndexSelection \
+  algorithm.iterations=0
+```
+
+如果需要指定解释器，可以额外添加 `task.runtime.use_conda_run=false task.runtime.python_path=/path/to/python`。
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/Task.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task.md
@@ -0,0 +1,51 @@
+# DuckDB Index Selection Task
+
+## Problem
+
+Choose a whitelist subset of DuckDB indexes for a frozen analytical lookup workload and minimize total runtime.
+
+This benchmark captures physical-design tuning on top of a stable DuckDB workload. Extra indexes can speed repeated queries, but they also add build and maintenance cost, so the goal is to choose the right subset rather than simply choosing more.
+
+Viewed computationally, this is a subset-selection problem over a fixed candidate set, with real execution time rather than a proxy metric deciding the score.
+
+## What Is Frozen
+
+- The schema, local data generator, and workload manifest in `runtime/problem.py`.
+- The whitelist of legal index names in `workload_manifest["candidate_indexes"]`.
+- The timing protocol: index build plus four repeated workload executions.
+
+## Submission Contract
+
+Submit one Python file that defines:
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+Return a list of whitelist index names. A dict with key `indexes` is also accepted.
+
+## Evaluation
+
+1. Build the frozen DuckDB database and load the manifest.
+2. Create the indexes you selected from the whitelist.
+3. Run the fixed lookup workload four times.
+4. Measure total candidate runtime and report no-index baseline numbers for context.
+
+## Metrics
+
+- `combined_score`: `-candidate_total_runtime_s`
+- `valid`: `1.0` only if every selected index name is legal and execution succeeds
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## Invalid Submissions
+
+- `select_indexes(...)` is missing or crashes
+- The return value cannot be parsed into a list of names
+- Any selected name is outside the whitelist
+- Index creation or workload execution fails
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/Task_zh-CN.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/Task_zh-CN.md
@@ -0,0 +1,51 @@
+# DuckDB 索引选择
+
+## 任务概览
+
+在冻结的 DuckDB 分析 workload 上，从白名单中挑选一组索引，尽量降低总执行时间。
+
+这个 benchmark 对应的是稳定 DuckDB workload 上的物理设计调优。额外索引确实可能加速重复查询，但也会带来构建和维护成本，所以关键不是“建得越多越好”，而是选对那一小部分。
+
+从计算角度看，这是一道固定候选集合上的子集选择题，而且分数来自真实执行时间，不是某个替代指标。
+
+## 哪些部分是冻结的
+
+- `runtime/problem.py` 中冻结的 schema、本地数据生成逻辑和 workload manifest。
+- `workload_manifest["candidate_indexes"]` 中给出的合法索引白名单。
+- 固定的计时协议：先建索引，再重复执行 workload 四轮。
+
+## 提交接口
+
+提交一个 Python 文件，定义：
+
+```python
+def select_indexes(workload_manifest):
+    ...
+```
+
+返回白名单中的索引名列表；也接受带 `indexes` 字段的字典。
+
+## 评测流程
+
+1. 构建冻结的 DuckDB 数据库并加载 manifest。
+2. 创建你从白名单中选出的索引。
+3. 固定重复执行查询 workload 四次。
+4. 统计候选总耗时，并同时给出无索引 baseline 作为参考。
+
+## 指标
+
+- `combined_score`：`-candidate_total_runtime_s`
+- `valid`：只有索引名合法且执行成功时才为 `1.0`
+- `candidate_total_runtime_s`
+- `baseline_total_runtime_s`
+- `candidate_setup_runtime_s`
+- `candidate_workload_runtime_s`
+
+## 判为无效的情况
+
+- 缺少 `select_indexes(...)`，或函数在评测中报错
+- 返回值无法解析为索引名列表
+- 任意一个索引名不在白名单中
+- 建索引或执行 workload 时失败
+
+<!-- AI_GENERATED -->
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/baseline/solution.py b/benchmarks/ComputerSystems/DuckDBIndexSelection/baseline/solution.py
@@ -0,0 +1,5 @@
+from __future__ import annotations
+
+
+def select_indexes(workload_manifest):
+    return []
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/agent_files.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/agent_files.txt
@@ -0,0 +1,6 @@
+Task.md
+Task_zh-CN.md
+README.md
+baseline/solution.py
+runtime/problem.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/candidate_destination.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/candidate_destination.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/constraints.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/constraints.txt
@@ -0,0 +1,4 @@
+Edit only `scripts/init.py`.
+Modify only code between `# EVOLVE-BLOCK-START` and `# EVOLVE-BLOCK-END` in that file.
+Do not modify files under `baseline/`, `runtime/`, `references/`, or `verification/`.
+Keep outputs valid and finite.
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_command.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_command.txt
@@ -0,0 +1 @@
+{python} verification/evaluator.py {candidate} --metrics-out metrics.json
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_cwd.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/eval_cwd.txt
@@ -0,0 +1 @@
+.
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/initial_program.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/initial_program.txt
@@ -0,0 +1 @@
+scripts/init.py
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/readonly_files.txt b/benchmarks/ComputerSystems/DuckDBIndexSelection/frontier_eval/readonly_files.txt
@@ -0,0 +1,5 @@
+baseline/solution.py
+runtime/problem.py
+runtime/duckdb_local_workload.py
+verification/evaluator.py
+references/source_manifest.md
diff --git a/benchmarks/ComputerSystems/DuckDBIndexSelection/references/source_manifest.md b/benchmarks/ComputerSystems/DuckDBIndexSelection/references/source_manifest.md
@@ -0,0 +1,10 @@
+# Source Manifest
+
+- Upstream engine: `DuckDB`
+- Upstream lineage:
+  - DuckDB benchmark and TPC-H documentation
+  - DuckDB SQL and index support
+- Schema lineage: this benchmark uses a local frozen relational workload with `customer`, `orders`, and `lineitem` tables modeled after the TPC-H schema family.
+- Data provenance: rows are generated deterministically inside DuckDB from fixed SQL formulas and a fixed schema; this is a benchmark-local synthetic dataset, not official TPC-H `dbgen` output.
+- Authenticity note: the schema and workload lineage are traceable to official DuckDB/TPC-H benchmarking materials, but the data itself is a local frozen synthetic asset used because online extension-based generation was not reliable in this environment.
+- License lineage: DuckDB is released under the MIT License.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{python} verification/evaluator.py {candidate} --metrics-out metrics.json