Motia chess bench and Arena.(WIP)#52
Open
Bot-Rakshit wants to merge 41 commits into
Open
Conversation
- Add methodology page with evaluation metrics, flow diagram, and prompt transparency - Implement Variant B (unguided) prompt template - AI receives only FEN, no legal moves - Add variant selector (guided/unguided) to game creation flow - Update landing page with methodology link and button - Add BenchmarkVariant type to game schema
- Hero section with benchmark description and key metrics - Methodology prominent with View Methodology button - Leaderboard section below the fold - Quick action cards for Live, Methodology, About - Sticky header with auth and GitHub link - Full-width scrollable layout
- Keep original PageGrid two-column layout (leaderboard left, controls right) - Keep original components (ChessArenaLogo, TopBar, CreateGameButton) - Update copy to be benchmark-focused - Add methodology link and button - Change CTA back to 'Create Game' (not 'Run Benchmark') - Add 3-button grid: Live, Rankings, Methodology
|
@Bot-Rakshit is attempting to deploy a commit to the motia Team on Vercel. A member of the Team first needs to authorize it. |
- Add GameHistory and GameHistoryFilter types with full game data - Add createdAt timestamp to Game schema - Create chessGameHistory stream for persisting completed games
- Add PGN generation service for exporting games - Add createdAt timestamp when creating games - Archive completed games to history in game-ended step - Add GET /chess/history endpoint with filtering (provider, model, variant, status, date) - Add GET /chess/history/:gameId for full game details with moves/messages - Add GET /chess/history/export for CSV/JSON export
- Add useGameHistory and useGameHistoryDetail hooks - Create game history page with filtering and export options - Create game replay page with: - Move-by-move navigation with autoplay - AI reasoning panel showing model thoughts - Move list sidebar and scoreboard - PGN download - Add routes for /history and /history/:gameId - Add History link to landing page navigation
- Remove unused GameSchema import from game-history.ts - Add promotion detection to PGN generation (extract from FEN diff) - Rename export endpoint file (10b) to load before detail endpoint (11) - Add proper CSV escaping for all string fields to prevent injection
- Add gpt-5.2-high, gpt-3.5-turbo-instruct - Remove gpt-4.1-nano, gpt-4o-mini - Add gemini-3.0-pro-preview, gemini-2.5-pro - Remove deprecated gemini-2.0 models - Add claude-opus-4.5, remove claude-opus-4.1
- Add TestPosition, ModelBenchmarkResult, LegalMoveBenchmarkRun types - Add LegalMoveBenchmarkSummary for aggregated scores - Create streams for storing benchmark runs and summaries
Backend: - Add benchmark-prompt.ts for LLM calls with moves[] schema - Add run-legal-move-benchmark.ts with position generator - Generate 20 random positions (5-25 legal moves, 8-60 half-moves) API Endpoints: - POST /benchmark/legal-moves/run - Run benchmark for a model - GET /benchmark/legal-moves/runs - List all benchmark runs - GET /benchmark/legal-moves/runs/:runId - Get detailed run with all positions/results - GET /benchmark/legal-moves/leaderboard - Model rankings by score Scoring: accuracy% - (illegal_moves * 5) penalty
- Add LichessPuzzle, PuzzleSet, PuzzleBenchmarkRun types - Add PuzzleBenchmarkSummary for aggregated scores - Create streams for puzzle sets and benchmark results
Services: - fetch-lichess-puzzles.ts: Fetch puzzles from Lichess API (mateIn1, oneMove) - run-puzzle-benchmark.ts: Run puzzles against LLMs API Endpoints: - POST /benchmark/puzzles/fetch - Fetch & cache 50 puzzles per theme - POST /benchmark/puzzles/run - Run benchmark for a model - GET /benchmark/puzzles/leaderboard - Model rankings - GET /benchmark/puzzles/sets - List cached puzzle sets All LLMs tested on same cached puzzles for fair comparison
- Remove unguided/FEN-only mode option from game creation - All games now use guided mode (legal moves provided) - Simplifies evaluation: legal move gen benchmark tests rule knowledge instead
Fixed env var mismatch - benchmark services were using GOOGLE_API_KEY but .env and gemini.ts use GEMINI_API_KEY
Lichess puzzles require playing one additional move after initialPly to reach the actual puzzle position where the solver makes their move.
Motia doesn't support querySchema - must use queryParams array format
- Add retry with exponential backoff (5s, 10s, 15s) on 429 responses - Increase delay between requests from 500ms to 1500ms
- Single request instead of 50 individual requests - Falls back to mix endpoint if theme-specific fails
- Add position set stream to store generated positions - Add /benchmark/positions/generate endpoint to create positions - Run benchmark now uses stored positions (same for all models) - Auto-generates if no position set exists
Old: accuracy - (illegal * 5) -> too harsh, scores went to 0 New: F1 score (harmonic mean of precision and recall) - Recall: what % of legal moves found - Precision: what % of answers were correct - F1: balanced score between both
Backend: - Add random-ai-selection service with weighted probabilities - Cheaper models (haiku, flash, mini) have 10x weight - Mid-tier models have 5x weight - Expensive models (opus, gpt-5) have 1-2x weight - New endpoint POST /chess/play-vs-ai Frontend: - New /play-ai page with color selection - Shows opponent model after matching - Updated landing page with Play vs AI button
Models were returning long algebraic notation (Ng1f3) instead of standard algebraic notation (Nf3). Added explicit CORRECT/WRONG examples to guide models.
POST /benchmark/legal-moves/run-all - Creates 20 positions if needed - Runs ALL models in parallel (fire and forget) - Returns immediately with list of queued models
Running 22 models simultaneously hits API rate limits. Now runs 3 models at a time with 5s delay between batches.
…ider in parallel New flow: - Round 1: Run first model from each provider (4 models) on all 20 positions - Round 2: Run second model from each provider on all 20 positions - etc. Clear logging: - Position 1/20 (25 legal moves) - openai/gpt-5: 20/25 correct, 3 illegal, score=75% - gemini/flash: 18/25 correct, 5 illegal, score=65% ... This avoids rate limits by only hitting each provider once per position.
- Types for Stockfish game results and benchmark runs - StockfishEngine class to communicate via UCI protocol - playGameAgainstStockfish service plays full games - Calculates ACPL, blunders, mistakes, inaccuracies - POST /benchmark/stockfish/run - runs 2 games (as white and black) - GET /benchmark/stockfish/leaderboard - rankings by ACPL Uses same AI prompt as regular games for fair comparison.
- Log timing for each API call - Check and report missing API keys - Identify common errors: 401, 404, 429, timeout - Show elapsed time on success/failure
AbortSignal.timeout wasn't triggering. Now using manual timeout wrapper. Also added more granular logging to see where it hangs: 1. Starting API call 2. API key present, creating provider model 3. Provider model created, calling generateObject 4. SUCCESS/FAILED
Match the working AI vs AI game pattern which uses streamObject
- fix typo and add real-time stream hooks to bench page - add flow diagram and benchmark comparison section - extract shared utils and fix concurrency race condition - align model names in random-ai-selection - remove unused ScoreGauge, improve type safety Co-Authored-By: Claude Opus 4.5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note : This is a draft PR for checkpoint 1, will ping relevant maintainers when ready for review!
Changes