diff --git a/docs/leaderboard.html b/docs/leaderboard.html index 3f79b46..3a74b1e 100644 --- a/docs/leaderboard.html +++ b/docs/leaderboard.html @@ -105,10 +105,11 @@
Legend
💚 Fully open-sourced method
💙 Partially open-sourced method
FS: Few-shot prompting with 1-2 examples
-
COPRA: Symbolic proof search agent
+
COPRA: Symbolic proof search agent [arXiv]
Compiled: Code is syntactically valid and type-checks
Proved: Proofs accepted by Lean's kernel
Pass@k-sec: Success rate within 600-second time budget (k=600)
+
# Proofs Generated: # Proofs generated while Spec and Impl Certification
@@ -116,68 +117,85 @@
Legend
🏆 End-to-End Performance Rankings
- Complete pipeline success: specification certification + implementation certification (Pass@600-sec) + Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
+ Sorted based on first "End-to-End Code Generation" column, then "# Proofs Generated"
- + - - + + - - - - + + + + - + - - + + - - - - + + + + - - + + + + + + + + + + - - - + + + - + + + + + + + + +
#ModelModel ApproachEnd-to-EndProblems SolvedEnd-to-End Code Generation Note# Proofs Generated
1💙 GPT-4o miniFew-Shot0.621%1/161💙 Claude-3.7COPRA-enhanced1/161 Problem 532/161 (spec) + 14/161 (impl) = 16/282
1💙 Claude-3.7💚 DeepSeek-R1 Few-Shot0.621%1/1611/161 Problem 531/161 (spec) + 9/161 (impl) = 10/282
1💙 DeepSeek-R1Few-Shot0.621%1/161💚 GPT OSS 20bCOPRA-enhanced1/161 Problem 532/161 (spec) + 8/161 (impl) = 10/282
1 💙 GPT-4o COPRA-enhanced0.621%1/1611/161Problem 533/161 (spec) + 6/161 (impl) = 9/282
1💙 GPT-4o miniFew-Shot1/161 Problem 532/161 (spec) + 3 / 161 (impl) = 5/282
1 💙 Claude-3.7COPRA-enhanced0.621%1/161Few-Shot1/161 Problem 531/161 (spec) + 3/161 (impl) = 4/282
6 💙 GPT-4o Few-Shot0% 0/161 -1/161 (spec) + 1/161 (impl) = 2/282
6💙 GPT-5 mini (For Code Generation) + Kimina Prover (For proofs)Few-Shot0/161-0/161 (spec) + 1/161 (impl) = 1/282
@@ -188,7 +206,9 @@
Legend
📊 Detailed Performance Breakdown
- Pass@600-sec rates for specification certification and implementation certification + Pass@600-sec rates for specification certification and implementation certification
+ (Compiled: syntactically valid and type-checks; Proved: proofs accepted by Lean's kernel)
+ Sorted based on average compilation rates across both stages
@@ -211,15 +231,6 @@
Legend
- - - - - - - - - @@ -239,7 +250,16 @@
Legend
- + + + + + + + + + + @@ -250,15 +270,6 @@
Legend
- - - - - - - - - @@ -268,6 +279,24 @@
Legend
+ + + + + + + + + + + + + + + + + +
Few-Shot Baseline
💙 GPT-4oFew-Shot84.472%0.621%68.323%0.621%0%
💙 GPT-4o mini Few-Shot 0.621%
💙 DeepSeek-R1💙 GPT-4oFew-Shot84.472%0.621%68.323%0.621%0%
💚 DeepSeek-R1 Few-Shot 71.42% 0.621%
COPRA Baseline
💙 GPT-4oCOPRA-enhanced76.398%1.863%68.323%3.727%0.621%
💙 Claude-3.7 COPRA-enhanced 8.696% 0.621%
💚 GPT OSS 20bCOPRA-enhanced78.261%1.242%65.839%4.969%0.621%
💙 GPT-4oCOPRA-enhanced76.398%1.863%68.323%3.727%0.621%