At ICSE 2025's LLM4Code workshop.
Lifting APPS to Lean with LLM-generated theorem statements.
datasets.load_dataset("quinn-dougherty/fvapps", split="train")If you'd like to cite FVAPPS, please use
@misc{dougherty2025provingcodinginterviewbenchmark,
title={Proving the Coding Interview: A Benchmark for Formally Verified Code Generation},
author={Quinn Dougherty and Ronak Mehta},
year={2025},
eprint={2502.05714},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.05714},
}Install elan, rye, do your $PATH munging, and source .venv/bin/activate.
First, you need to install mathlib with lake update. This will pull the mathlib precompile cache, too, but if it doesn't and you find yourself waiting for mathlib to compile, run lake exe cache get in the lake project.
--model {llama,prover-rl} require local GPU, the others require API keys.
$ rye sync
$ cd artefacts/baselines/solve-fvapps
$ lake update
$ cd ./../../../
$ rye run baselines --help
usage: FVApps Baselines [-h] [--model {sonnet,o1-mini,gemini,prover-rl,llama,testhf}] [--split {train,test}]
[--start_idx START_IDX] [--end_idx END_IDX]
options:
-h, --help show this help message and exit
--model {sonnet,o1-mini,prover-rl,llama,testhf}
model name (default: sonnet)
--split {train,test} train or test (default: train)
--start_idx START_IDX
index to start pulling from apps
--end_idx END_IDX index to end pulling from apps (inclusive)
install elan and install/update to a nightly toolchain. Install rye.
rye sync
. .venv/bin/activate
Sourcing the .venv will make sure we're not relying on pytest executable to have been installed in the global machine.
Create a .env file with the following:
ANTHROPIC_API_KEY="YOUR_KEY_HERE"
On the linux server you'll need to install parallel, maybe screen.
First, we preprocess apps solutions
$ rye run preprocess --help
usage: preprocess [-h] [--split SPLIT] [--start_idx START_IDX] [--end_idx END_IDX]
options:
-h, --help show this help message and exit
--split SPLIT Train or test split. Default: train.
--start_idx START_IDX
Start index for the dataset.
--end_idx END_IDX End index for the dataset.
They'll populate in artefacts/apps/train/{i}
Then, two agents will generate property tests and sorry'd out lean theorems, respectively.
$ rye run fvapps --help
usage: FV-APPS full generation run [-h] [--split SPLIT] [--start_idx START_IDX] [--end_idx END_IDX]
[--skip_lean] [--skip_python]
options:
-h, --help show this help message and exit
--split SPLIT train or test (default: train)
--start_idx START_IDX
index to start pulling from apps
--end_idx END_IDX index to end pulling from apps
--skip_lean skips lean when present
--skip_python skips python when present
rye run fvapps --skip_lean depends on rye run preprocess to have been run before, and rye run fvapps --skip_python depends on both the preprocessing step and the fvapps python step to have been run before. (FileNotFoundError will guide you toward this understanding regardless)
$ rye run qa_autoformalize
$ rye run qa_plausible
The last thing is to trim up the artefacts to their huggingface form.
rye run postprocess
