-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks #4
Comments
I opened a PR to update the readme, where I did some loose benchmarking. the API is a bit expensive for me right now so maybe thats something I do in the future but you can read up on my testing there, but ill paste a snippet of it below: === Beam Searchso beam search is pretty straightforward - it keeps track of the most promising solution paths as it goes. works really well when you've got problems with clear right answers, like math stuff or certain types of puzzles. interesting thing i found while testing; when i threw 50 puzzles from the Arc AGI benchmark at it, it only scored 24%. like, it wasn't completely lost, but... not great. here's how i tested it:
Monte Carlo Tree Searchnow THIS is where it gets interesting. MCTS absolutely crushed it compared to beam search - we're talking 48% on a different set of 50 Arc puzzles. yeah yeah, maybe they were easier puzzles (this isn't an official benchmark or anything), but doubling the performance? that's not just luck. the cool thing about MCTS is how it explores different possibilities. instead of just following what seems best right away, it tries out different paths to see what might work better in the long run. claude spent way more time understanding the examples before diving in, which probably helped a lot. Why This Mattersadding structured reasoning to claude makes it way better... no der, right? but what's really interesting is how different methods work for different types of problems. why'd i test on puzzles instead of coding problems? honestly, claude's already proven itself on stuff like polyglot and codeforces. i wanted to see how it handled more abstract reasoning - the kind of stuff that's harder to measure. |
3.8B and 7B models are able to beat o1 on math with MCTS, so it's not unrealistic that it crushes Beam search. |
I agree but for testing purposes we just wanted to see |
It would be great to see some benchmarks for MCP Reasoner
I'm interested to compare this to the Sequential Thinking MCP Server by @Skirano
The text was updated successfully, but these errors were encountered: