CombiBench is the first benchmark focused on combinatorial problems, based on the formal language Lean 4. CombiBench is a manually produced benchmark, including 100 combinatorial mathematics problems of varying difficulty and knowledge levels. It aims to provide a benchmark for evaluating the combinatorial mathematics capabilities of automated theorem proving systems to advance the field. For problems that require providing a solution first and then proving its correctness, we have referred to the style of PutnamBench.
We are hosting a leaderboard and will readily receive evaluation results which are accompanied by a preprint or publication. Please reach out privately at [email protected]
with any requests for additions to the leaderboard.
We collected all combinatorics problems from the official IMO problems since 2000, except for one problem that relies on a figure. And We selected problems through random sampling from 14 chapters in the book, choosing 3 problems from each chapter, ensuring that the 42 problems are evenly distributed across all 14 chapters. We randomly selected 10 simple combinatorics problems at the middle school level from a mathematics problem collection website hackmath. Then, we randomly collected 12 problems from other mathematics competitions.
Source | Count |
---|---|
Hackmath | 10 |
Brualdi's book | 42 |
IMO | 36 |
APMO | 2 |
Balticway | 1 |
EGMO | 1 |
IMO-Shortlist | 4 |
IZHO | 2 |
BXMO | 1 |
USAMO | 1 |
Note : The complete proofs of Problem 3 and Problem 5 from IMO 2024 have already been formalized in mathlib4/Archive/Imo2024Q3 and mathlib4/Archive/Imo2024Q5. Therefore, we directly refer to the statements of these problems, along with the necessary definitions used in the statements. We are very grateful to Joseph Myers, the author of these two problems. We also appreciate his suggestions on the formalization of our problems.
This project requires python >= 3.10
, Lean>=4.15.0
.
Install the Python dependencies:
pip install -e .
or
uv venv .venv --prompt combibench
source .venv/bin/activate
uv sync
Follow https://github.com/project-numina/kimina-lean-server to configure a lean server and get a custom url and API_KEY.
We support API interfaces such as OpenAI, Antropic, TogetherAI, and Google GenerativeAI.
Refer to evaluation/config/template.json5
to configure the dataset, lean server, llm server, generation parameters, prompt, and parallel parameters.
To run one-stage Fine-Eval:
python evaluation/online_one_stage.py --config evaluation/config/template.json5
To run two-stage Fine-Eval:
python evaluation/online_two_stage.py --config evaluation/config/template.json5
Note that both evaluation methods are compatible with theorem proving tasks and fill-in-the-blank tasks.
Contributions are welcome! If anyone notices any mistakes, please raise an issue on the repository and we will address it.
This project is licensed under the MIT License. See the LICENSE file for full details.