diff --git a/docs/leaderboard.html b/docs/leaderboard.html
index 3f79b46..3a74b1e 100644
--- a/docs/leaderboard.html
+++ b/docs/leaderboard.html
@@ -105,10 +105,11 @@
Legend
💚 Fully open-sourced method
💙 Partially open-sourced method
FS: Few-shot prompting with 1-2 examples
- COPRA: Symbolic proof search agent
+ COPRA: Symbolic proof search agent
[arXiv]
Compiled: Code is syntactically valid and type-checks
Proved: Proofs accepted by Lean's kernel
Pass@k-sec: Success rate within 600-second time budget (k=600)
+ # Proofs Generated: # Proofs generated while Spec and Impl Certification
@@ -116,68 +117,85 @@ Legend
🏆 End-to-End Performance Rankings
- Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
+ Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
+ Sorted based on first "End-to-End Code Generation" column, then "# Proofs Generated"
| # |
- Model |
+ Model |
Approach |
- End-to-End |
- Problems Solved |
+ End-to-End Code Generation |
Note |
+ # Proofs Generated |
| 1 |
- 💙 GPT-4o mini |
- Few-Shot |
- 0.621% |
- 1/161 |
+ 💙 Claude-3.7 |
+ COPRA-enhanced |
+ 1/161 |
Problem 53 |
+ 2/161 (spec) + 14/161 (impl) = 16/282 |
| 1 |
- 💙 Claude-3.7 |
+ 💚 DeepSeek-R1 |
Few-Shot |
- 0.621% |
- 1/161 |
+ 1/161 |
Problem 53 |
+ 1/161 (spec) + 9/161 (impl) = 10/282 |
| 1 |
- 💙 DeepSeek-R1 |
- Few-Shot |
- 0.621% |
- 1/161 |
+ 💚 GPT OSS 20b |
+ COPRA-enhanced |
+ 1/161 |
Problem 53 |
+ 2/161 (spec) + 8/161 (impl) = 10/282 |
| 1 |
💙 GPT-4o |
COPRA-enhanced |
- 0.621% |
- 1/161 |
+ 1/161 |
+ Problem 53 |
+ 3/161 (spec) + 6/161 (impl) = 9/282 |
+
+
+ | 1 |
+ 💙 GPT-4o mini |
+ Few-Shot |
+ 1/161 |
Problem 53 |
+ 2/161 (spec) + 3 / 161 (impl) = 5/282 |
| 1 |
💙 Claude-3.7 |
- COPRA-enhanced |
- 0.621% |
- 1/161 |
+ Few-Shot |
+ 1/161 |
Problem 53 |
+ 1/161 (spec) + 3/161 (impl) = 4/282 |
| 6 |
💙 GPT-4o |
Few-Shot |
- 0% |
0/161 |
- |
+ 1/161 (spec) + 1/161 (impl) = 2/282 |
+
+
+ | 6 |
+ 💙 GPT-5 mini (For Code Generation) + Kimina Prover (For proofs) |
+ Few-Shot |
+ 0/161 |
+ - |
+ 0/161 (spec) + 1/161 (impl) = 1/282 |
@@ -188,7 +206,9 @@
Legend
📊 Detailed Performance Breakdown
- Pass@600-sec rates for specification certification and implementation certification
+ Pass@600-sec rates for specification certification and implementation certification
+ (Compiled: syntactically valid and type-checks; Proved: proofs accepted by Lean's kernel)
+ Sorted based on average compilation rates across both stages
@@ -211,15 +231,6 @@ Legend
| Few-Shot Baseline |
-
- | 💙 GPT-4o |
- Few-Shot |
- 84.472% |
- 0.621% |
- 68.323% |
- 0.621% |
- 0% |
-
| 💙 GPT-4o mini |
Few-Shot |
@@ -239,7 +250,16 @@ Legend
0.621% |
- | 💙 DeepSeek-R1 |
+ 💙 GPT-4o |
+ Few-Shot |
+ 84.472% |
+ 0.621% |
+ 68.323% |
+ 0.621% |
+ 0% |
+
+
+ | 💚 DeepSeek-R1 |
Few-Shot |
71.42% |
0.621% |
@@ -250,15 +270,6 @@ Legend
| COPRA Baseline |
-
- | 💙 GPT-4o |
- COPRA-enhanced |
- 76.398% |
- 1.863% |
- 68.323% |
- 3.727% |
- 0.621% |
-
| 💙 Claude-3.7 |
COPRA-enhanced |
@@ -268,6 +279,24 @@ Legend
8.696% |
0.621% |
+
+ | 💚 GPT OSS 20b |
+ COPRA-enhanced |
+ 78.261% |
+ 1.242% |
+ 65.839% |
+ 4.969% |
+ 0.621% |
+
+
+ | 💙 GPT-4o |
+ COPRA-enhanced |
+ 76.398% |
+ 1.863% |
+ 68.323% |
+ 3.727% |
+ 0.621% |
+