Motia chess bench and Arena.(WIP) by Bot-Rakshit · Pull Request #52 · MotiaDev/chessarena-ai

Bot-Rakshit · 2025-12-05T17:37:07Z

Note : This is a draft PR for checkpoint 1, will ping relevant maintainers when ready for review!

Changes

 - Add `/methodology` page with full benchmark documentation
   - Evaluation pipeline diagram
   - Key metrics (Legal Move Rate, Centipawn Loss, Blunder Rate, Completion Rate)
   - Benchmark variant comparison (Guided vs Unguided)
   - Prompt transparency with exact template
   - Comparison table with other LLM benchmarks

 - Implement Variant B (Unguided) mode
   - New prompt template (AI receives only FEN, no legal moves)
   - Variant selector in game creation UI
   - Backend support for variant in game schema

 - Update landing page
   - Add methodology link and button
   - Updated copy for benchmark focus

- Add methodology page with evaluation metrics, flow diagram, and prompt transparency - Implement Variant B (unguided) prompt template - AI receives only FEN, no legal moves - Add variant selector (guided/unguided) to game creation flow - Update landing page with methodology link and button - Add BenchmarkVariant type to game schema

- Hero section with benchmark description and key metrics - Methodology prominent with View Methodology button - Leaderboard section below the fold - Quick action cards for Live, Methodology, About - Sticky header with auth and GitHub link - Full-width scrollable layout

- Keep original PageGrid two-column layout (leaderboard left, controls right) - Keep original components (ChessArenaLogo, TopBar, CreateGameButton) - Update copy to be benchmark-focused - Add methodology link and button - Change CTA back to 'Create Game' (not 'Run Benchmark') - Add 3-button grid: Live, Rankings, Methodology

vercel · 2025-12-05T17:37:13Z

@Bot-Rakshit is attempting to deploy a commit to the motia Team on Vercel.

A member of the Team first needs to authorize it.

- Add GameHistory and GameHistoryFilter types with full game data - Add createdAt timestamp to Game schema - Create chessGameHistory stream for persisting completed games

- Add PGN generation service for exporting games - Add createdAt timestamp when creating games - Archive completed games to history in game-ended step - Add GET /chess/history endpoint with filtering (provider, model, variant, status, date) - Add GET /chess/history/:gameId for full game details with moves/messages - Add GET /chess/history/export for CSV/JSON export

- Add useGameHistory and useGameHistoryDetail hooks - Create game history page with filtering and export options - Create game replay page with: - Move-by-move navigation with autoplay - AI reasoning panel showing model thoughts - Move list sidebar and scoreboard - PGN download - Add routes for /history and /history/:gameId - Add History link to landing page navigation

- Remove unused GameSchema import from game-history.ts - Add promotion detection to PGN generation (extract from FEN diff) - Rename export endpoint file (10b) to load before detail endpoint (11) - Add proper CSV escaping for all string fields to prevent injection

- Add gpt-5.2-high, gpt-3.5-turbo-instruct - Remove gpt-4.1-nano, gpt-4o-mini - Add gemini-3.0-pro-preview, gemini-2.5-pro - Remove deprecated gemini-2.0 models - Add claude-opus-4.5, remove claude-opus-4.1

- Add TestPosition, ModelBenchmarkResult, LegalMoveBenchmarkRun types - Add LegalMoveBenchmarkSummary for aggregated scores - Create streams for storing benchmark runs and summaries

Backend: - Add benchmark-prompt.ts for LLM calls with moves[] schema - Add run-legal-move-benchmark.ts with position generator - Generate 20 random positions (5-25 legal moves, 8-60 half-moves) API Endpoints: - POST /benchmark/legal-moves/run - Run benchmark for a model - GET /benchmark/legal-moves/runs - List all benchmark runs - GET /benchmark/legal-moves/runs/:runId - Get detailed run with all positions/results - GET /benchmark/legal-moves/leaderboard - Model rankings by score Scoring: accuracy% - (illegal_moves * 5) penalty

- Add LichessPuzzle, PuzzleSet, PuzzleBenchmarkRun types - Add PuzzleBenchmarkSummary for aggregated scores - Create streams for puzzle sets and benchmark results

Services: - fetch-lichess-puzzles.ts: Fetch puzzles from Lichess API (mateIn1, oneMove) - run-puzzle-benchmark.ts: Run puzzles against LLMs API Endpoints: - POST /benchmark/puzzles/fetch - Fetch & cache 50 puzzles per theme - POST /benchmark/puzzles/run - Run benchmark for a model - GET /benchmark/puzzles/leaderboard - Model rankings - GET /benchmark/puzzles/sets - List cached puzzle sets All LLMs tested on same cached puzzles for fair comparison

- Remove unguided/FEN-only mode option from game creation - All games now use guided mode (legal moves provided) - Simplifies evaluation: legal move gen benchmark tests rule knowledge instead

Fixed env var mismatch - benchmark services were using GOOGLE_API_KEY but .env and gemini.ts use GEMINI_API_KEY

Lichess puzzles require playing one additional move after initialPly to reach the actual puzzle position where the solver makes their move.

Motia doesn't support querySchema - must use queryParams array format

- Add retry with exponential backoff (5s, 10s, 15s) on 429 responses - Increase delay between requests from 500ms to 1500ms

- Single request instead of 50 individual requests - Falls back to mix endpoint if theme-specific fails

- Add position set stream to store generated positions - Add /benchmark/positions/generate endpoint to create positions - Run benchmark now uses stored positions (same for all models) - Auto-generates if no position set exists

Old: accuracy - (illegal * 5) -> too harsh, scores went to 0 New: F1 score (harmonic mean of precision and recall) - Recall: what % of legal moves found - Precision: what % of answers were correct - F1: balanced score between both

Backend: - Add random-ai-selection service with weighted probabilities - Cheaper models (haiku, flash, mini) have 10x weight - Mid-tier models have 5x weight - Expensive models (opus, gpt-5) have 1-2x weight - New endpoint POST /chess/play-vs-ai Frontend: - New /play-ai page with color selection - Shows opponent model after matching - Updated landing page with Play vs AI button

Models were returning long algebraic notation (Ng1f3) instead of standard algebraic notation (Nf3). Added explicit CORRECT/WRONG examples to guide models.

POST /benchmark/legal-moves/run-all - Creates 20 positions if needed - Runs ALL models in parallel (fire and forget) - Returns immediately with list of queued models

Running 22 models simultaneously hits API rate limits. Now runs 3 models at a time with 5s delay between batches.

…ider in parallel New flow: - Round 1: Run first model from each provider (4 models) on all 20 positions - Round 2: Run second model from each provider on all 20 positions - etc. Clear logging: - Position 1/20 (25 legal moves) - openai/gpt-5: 20/25 correct, 3 illegal, score=75% - gemini/flash: 18/25 correct, 5 illegal, score=65% ... This avoids rate limits by only hitting each provider once per position.

- Types for Stockfish game results and benchmark runs - StockfishEngine class to communicate via UCI protocol - playGameAgainstStockfish service plays full games - Calculates ACPL, blunders, mistakes, inaccuracies - POST /benchmark/stockfish/run - runs 2 games (as white and black) - GET /benchmark/stockfish/leaderboard - rankings by ACPL Uses same AI prompt as regular games for fair comparison.

- Log timing for each API call - Check and report missing API keys - Identify common errors: 401, 404, 429, timeout - Show elapsed time on success/failure

AbortSignal.timeout wasn't triggering. Now using manual timeout wrapper. Also added more granular logging to see where it hangs: 1. Starting API call 2. API key present, creating provider model 3. Provider model created, calling generateObject 4. SUCCESS/FAILED

Match the working AI vs AI game pattern which uses streamObject

- fix typo and add real-time stream hooks to bench page - add flow diagram and benchmark comparison section - extract shared utils and fix concurrency race condition - align model names in random-ai-selection - remove unused ScoreGauge, improve type safety Co-Authored-By: Claude Opus 4.5 <[email protected]>

Bot-Rakshit added 5 commits December 5, 2025 19:42

fix: improve methodology page UI and scrolling

31e3757

fix: improve landing page button layout with grid

272ca65

Bot-Rakshit added 24 commits December 10, 2025 20:08

feat: add game history types and stream

8b7e93d

- Add GameHistory and GameHistoryFilter types with full game data - Add createdAt timestamp to Game schema - Create chessGameHistory stream for persisting completed games

chore: add MOTIA_DOCS.md to gitignore

dca34e8

feat: update AI models

0d19ded

- Add gpt-5.2-high, gpt-3.5-turbo-instruct - Remove gpt-4.1-nano, gpt-4o-mini - Add gemini-3.0-pro-preview, gemini-2.5-pro - Remove deprecated gemini-2.0 models - Add claude-opus-4.5, remove claude-opus-4.1

feat: add legal move benchmark types and streams

52a6eeb

- Add TestPosition, ModelBenchmarkResult, LegalMoveBenchmarkRun types - Add LegalMoveBenchmarkSummary for aggregated scores - Create streams for storing benchmark runs and summaries

feat: add puzzle benchmark types and streams

8e134c1

- Add LichessPuzzle, PuzzleSet, PuzzleBenchmarkRun types - Add PuzzleBenchmarkSummary for aggregated scores - Create streams for puzzle sets and benchmark results

refactor: remove variant toggle, use guided mode only

be43e6d

- Remove unguided/FEN-only mode option from game creation - All games now use guided mode (legal moves provided) - Simplifies evaluation: legal move gen benchmark tests rule knowledge instead

fix: use GEMINI_API_KEY instead of GOOGLE_API_KEY in benchmarks

7b0acc1

Fixed env var mismatch - benchmark services were using GOOGLE_API_KEY but .env and gemini.ts use GEMINI_API_KEY

fix: correct puzzle parsing by playing setup move after initialPly

b733338

Lichess puzzles require playing one additional move after initialPly to reach the actual puzzle position where the solver makes their move.

fix: use queryParams array instead of querySchema for Motia API routes

9f7a461

Motia doesn't support querySchema - must use queryParams array format

fix: add rate limit retry logic and increase delay for Lichess API

5b3efb9

- Add retry with exponential backoff (5s, 10s, 15s) on 429 responses - Increase delay between requests from 500ms to 1500ms

refactor: use Lichess batch API for faster puzzle fetching

0a39348

- Single request instead of 50 individual requests - Falls back to mix endpoint if theme-specific fails

feat: use shared position set for fair benchmark comparison

55a7198

- Add position set stream to store generated positions - Add /benchmark/positions/generate endpoint to create positions - Run benchmark now uses stored positions (same for all models) - Auto-generates if no position set exists

fix: correct position set stream config format

b0f1abf

fix: improve legal move benchmark prompt with clearer SAN examples

ba3b22c

Models were returning long algebraic notation (Ng1f3) instead of standard algebraic notation (Nf3). Added explicit CORRECT/WRONG examples to guide models.

feat: add run-all-benchmarks endpoint for parallel execution

f50fdf3

POST /benchmark/legal-moves/run-all - Creates 20 positions if needed - Runs ALL models in parallel (fire and forget) - Returns immediately with list of queued models

fix: correct useAuth import and use apiClient in play-ai-page

3b66e71

fix: run benchmarks in batches of 3 to avoid rate limits

fd3728e

Running 22 models simultaneously hits API rate limits. Now runs 3 models at a time with 5s delay between batches.

Bot-Rakshit added 10 commits December 15, 2025 04:22

fix: reduce benchmark prompt timeout from 2min to 1min

f065539

feat: add detailed logging to benchmark prompts

ad4b551

- Log timing for each API call - Check and report missing API keys - Identify common errors: 401, 404, 429, timeout - Show elapsed time on success/failure

fix: use streamObject instead of generateObject for benchmark

4c3e642

Match the working AI vs AI game pattern which uses streamObject

fix: remove fire and forget from the api

7b9bd92

add thinking efforts in models

5e49c8d

add batch api

108b86f

rechart graphs

35bf6ba

Bot-Rakshit changed the title ~~Feature/checkpoint 1 benchmark methodology( DRAFT:WIP)~~ Motia chess bench and Arena. Dec 19, 2025

Bot-Rakshit changed the title ~~Motia chess bench and Arena.~~ Motia chess bench and Arena.(WIP) Dec 19, 2025

Bot-Rakshit and others added 2 commits December 27, 2025 18:15

revert landing assets

37abcb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Motia chess bench and Arena.(WIP)#52

Motia chess bench and Arena.(WIP)#52
Bot-Rakshit wants to merge 41 commits into
MotiaDev:mainfrom
Bot-Rakshit:feature/checkpoint-1-benchmark-methodology

Bot-Rakshit commented Dec 5, 2025

Uh oh!

vercel Bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bot-Rakshit commented Dec 5, 2025

Changes

Uh oh!

vercel Bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant