diff --git a/.cursor/plan.md b/.cursor/plan.md index f1b7fed..01b2060 100644 --- a/.cursor/plan.md +++ b/.cursor/plan.md @@ -1,55 +1,116 @@ -## AI Essay → Review → Revision Pipeline - -### Goal - -Implement a **Bun-friendly TypeScript CLI** (`bun run index.ts`) that: - -- Prompts the user for an essay topic. -- Uses **model A (OpenRouter via Vercel AI SDK)** to generate an essay. -- Uses **model B** to review that essay and produce feedback. -- Calls **model A again** with the feedback to produce a revised essay. -- Saves all three artifacts as **markdown files** on disk in a consistent location, with the `runs/` directory ignored by git. - -### High-level Design - -- **Runtime & entrypoint**: Keep using Bun with `index.ts` as the main CLI entrypoint. -- **AI client setup**: -- Add `ai` and the **OpenRouter provider for the Vercel AI SDK** as dependencies (no separate `openai` package needed since we are using OpenRouter directly). -- Configure a small `aiClient.ts` module (or keep logic inline in `index.ts` if very small) that wires the AI SDK to OpenRouter using an `OPENROUTER_API_KEY` env var. -- Hard-code two model IDs (e.g. one for essay generation, one for review) with clear `const` names so you can easily change them later. -- **Pipeline orchestration**: -- Implement a `runEssayPipeline()` function that: -- Reads the prompt from stdin (simple interactive question). -- Calls the **essay model** with a system prompt + user prompt to generate the initial essay. -- Calls the **review model** with system instructions plus the essay content to generate feedback. -- Calls the **essay model** again with the original prompt and the feedback to produce a revised essay. -- Keep everything **strongly typed** with small TypeScript interfaces for the pipeline results. -- **Markdown file output**: -- Decide on a simple folder and naming scheme (e.g. `runs/-essay.md`, `runs/-review.md`, `runs/-revision.md`). -- Use Bun / Node fs APIs in a small utility to write each step as a separate markdown file. -- Include basic front-matter or headings (e.g. `# Original Essay`, `# Review Feedback`, `# Revised Essay`) for easy inspection in an editor. -- Ensure `runs/` is added to `.gitignore` so generated artifacts don’t clutter git history. - -### Implementation Steps - -- **setup-deps**: Add `ai` and the OpenRouter provider for the AI SDK to `package.json` and document the required `OPENROUTER_API_KEY` env var in `README.md`. -- **ai-client**: Create a small AI client configuration that: -- Instantiates the AI SDK with the OpenRouter provider. -- Exposes typed helpers like `generateEssay(prompt)`, `reviewEssay(essay)`, and `reviseEssay(prompt, essay, feedback)`. -- **pipeline-logic**: Implement `runEssayPipeline()` in `index.ts` that: -- Interactively asks for a prompt via stdin. -- Runs the three AI steps in sequence (no streaming needed) with clear logging to the console. -- Returns a typed result object containing the three text outputs. -- **file-output**: Add a small utility function to: -- Create a `runs/` directory if it doesn’t exist. -- Write three markdown files with timestamped names and simple headings. -- Confirm that `runs/` is listed in `.gitignore`. -- **polish-types**: Ensure all public functions are type-safe (typed params and return types where helpful) and that the code compiles under the existing `tsconfig`. - -### Todos - -- **setup-deps**: Add and configure Vercel AI SDK (`ai`) and the OpenRouter provider, and document `OPENROUTER_API_KEY`. -- **ai-client**: Implement the AI client helper(s) for essay generation, review, and revision using hard-coded OpenRouter model IDs. -- **pipeline-logic**: Implement the CLI flow in `index.ts` to run the generation → review → revision pipeline. -- **file-output**: Implement markdown file-writing utilities (create `runs/` directory, timestamped filenames, headings) and ensure `runs/` is in `.gitignore`. -- **polish-types**: Run TypeScript checks and tighten any loose types if needed. +# Writing Quality Arena + +## Models Configuration + +Use the provided `modelsToRun` array in `constants.ts`: + +```ts +export type RunnableModel = { + name: string; + llm: LanguageModelV1; + reasoning: boolean; +}; + +export const modelsToRun: RunnableModel[] = [ + { + name: "claude-4.5-opus-reasoning", + llm: openrouter("anthropic/claude-opus-4.5"), + reasoning: true, + }, + // ... 11 models total +]; + +export const PARALLEL_LIMIT = 5; // Configurable concurrency +``` + +## Execution Flow (4 Phases) + +### Phase 1: Essay Generation + +Each model writes an essay on the topic. **N calls**. + +### Phase 2: All-to-All Review + +Every model reviews EVERY essay (including their own). **N × N calls**. + +### Phase 3: Per-Reviewer Revisions + +Each model creates a separate revised essay for EACH piece of feedback received. **N × N revisions**. + +### Phase 4: Scoring + +Every model scores EVERY essay (N originals + N×(N-1) revisions). Use `generateObject` with Zod schema: + +```ts +const ScoreSchema = z.object({ + score: z.number().min(1).max(10), + justification: z.string(), +}); +``` + +**N × (N + N×(N-1)) = N × N² = N³ calls**. + +## API Call Summary (N=11 models) + +| Phase | Formula | Calls | + +|-------|---------|-------| + +| Essays | N | 11 | + +| Feedback | N×(N-1) | 110 | + +| Revisions | N×(N-1) | 110 | + +| Scores | N³ | 1331 | + +| **Total** | | **1562** | + +## Rankings + +**Essay Ranking**: All essays (original + revised) ranked by average score across all judges. + +**Reviewer Ranking**: For each reviewer, calculate avg improvement = mean(revision_score - original_score) for all revisions that used their feedback. + +## File Structure + +``` +results/{timestamp}/ +├── essays/{model-name}.md +├── feedback/{reviewer}-on-{author}.md +├── revisions/{author}-revised-by-{reviewer}.md +├── results.json +└── summary.md +``` + +## File Changes + +| File | Change | + +|------|--------| + +| `constants.ts` | Add `RunnableModel` type, `modelsToRun` array, `PARALLEL_LIMIT` | + +| `types.ts` | Already has appropriate types; verify alignment | + +| `aiClient.ts` | Update functions to accept `RunnableModel`, add `scoreEssay()` using `generateObject` | + +| `index.ts` | Rewrite with 4-phase arena orchestration, parallel execution via `p-limit`, `confirmRun()` | + +| `fileUtils.ts` | Rewrite for arena folder structure (`results/` dir, essays/, feedback/, revisions/, results.json, summary.md) | + +## CLI Confirmation + +Display call counts and prompt before running: + +```ts +async function confirmRun(): Promise { + const n = modelsToRun.length; + const essays = n; + const feedback = n * (n - 1); + const revisions = n * (n - 1); + const scores = n * n * n; + const total = essays + feedback + revisions + scores; + // ... display and prompt Y/n +} +``` diff --git a/.gitignore b/.gitignore index 1c7683b..0cb6022 100644 --- a/.gitignore +++ b/.gitignore @@ -35,3 +35,4 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json # Generated essay runs runs/ +results/ diff --git a/aiClient.ts b/aiClient.ts index 96ef8fc..bd4d0ec 100644 --- a/aiClient.ts +++ b/aiClient.ts @@ -1,63 +1,111 @@ -import { createOpenRouter } from "@openrouter/ai-sdk-provider"; import { generateText } from "ai"; +import type { RunnableModel } from "./constants"; -// Model IDs - easily changeable constants -const ESSAY_MODEL = "anthropic/claude-opus-4.5"; -const REVIEW_MODEL = "moonshotai/kimi-k2-thinking"; +/** + * Extracts cost from OpenRouter provider metadata. + */ +function extractCost( + providerMetadata: Record | undefined +): number { + if (!providerMetadata) return 0; + const openrouterMeta = providerMetadata.openrouter as any; + + if (openrouterMeta?.usage?.cost) { + return openrouterMeta.usage.cost; + } -// Initialize the OpenRouter provider -if (!process.env.OPENROUTER_API_KEY) { - throw new Error( - "OPENROUTER_API_KEY environment variable is required. Please set it before running the script." - ); + if (openrouterMeta?.usage?.costDetails?.upstreamInferenceCost) { + return openrouterMeta.usage.costDetails.upstreamInferenceCost; + } + + return 0; } -const openrouter = createOpenRouter({ - apiKey: process.env.OPENROUTER_API_KEY, -}); +export interface TokenUsage { + inputTokens: number; + outputTokens: number; + totalTokens: number; + /** Cost in USD from OpenRouter */ + cost: number; +} export interface EssayResult { text: string; + usage: TokenUsage; } export interface ReviewResult { text: string; + usage: TokenUsage; } export interface RevisionResult { text: string; + usage: TokenUsage; +} + +export interface ScoreResult { + score: number; + justification: string; + usage: TokenUsage; +} + +export interface CompareResult { + winner: "A" | "B" | "tie"; + reasoning: string; + usage: TokenUsage; } /** * Generates an essay based on the given topic prompt. */ -export async function generateEssay(topic: string): Promise { +export async function generateEssay( + model: RunnableModel, + topic: string +): Promise { const result = await generateText({ - model: openrouter(ESSAY_MODEL), + model: model.llm, system: `You are an expert essay writer. Write a well-structured, thoughtful essay on the given topic. -The essay should be clear, engaging, and demonstrate strong writing skills.`, +The essay should be clear, engaging, and demonstrate strong writing skills. +Write approximately 800-1200 words.`, prompt: `Write an essay on the following topic:\n\n${topic}`, }); return { text: result.text, + usage: { + inputTokens: result.usage?.inputTokens ?? 0, + outputTokens: result.usage?.outputTokens ?? 0, + totalTokens: result.usage?.totalTokens ?? 0, + cost: extractCost(result.providerMetadata), + }, }; } /** * Reviews an essay and provides constructive feedback. */ -export async function reviewEssay(essay: string): Promise { +export async function reviewEssay( + model: RunnableModel, + essay: string, + topic: string +): Promise { const result = await generateText({ - model: openrouter(REVIEW_MODEL), + model: model.llm, system: `You are an expert writing tutor and editor. Review the essay provided and give constructive, specific feedback on areas such as structure, clarity, argumentation, style, and areas for improvement. -Be thorough but encouraging.`, - prompt: `Please review the following essay and provide detailed feedback:\n\n${essay}`, +Be thorough but encouraging. Focus on actionable improvements.`, + prompt: `Topic: ${topic}\n\nPlease review the following essay and provide detailed feedback:\n\n${essay}`, }); return { text: result.text, + usage: { + inputTokens: result.usage?.inputTokens ?? 0, + outputTokens: result.usage?.outputTokens ?? 0, + totalTokens: result.usage?.totalTokens ?? 0, + cost: extractCost(result.providerMetadata), + }, }; } @@ -65,18 +113,134 @@ Be thorough but encouraging.`, * Revises an essay based on the original topic, original essay, and review feedback. */ export async function reviseEssay( + model: RunnableModel, topic: string, originalEssay: string, feedback: string ): Promise { const result = await generateText({ - model: openrouter(ESSAY_MODEL), + model: model.llm, system: `You are an expert essay writer. Revise the provided essay based on the feedback given, -while maintaining the core message and improving the areas identified.`, +while maintaining the core message and improving the areas identified. +Produce a complete revised essay, not just suggestions.`, prompt: `Original topic: ${topic}\n\nOriginal essay:\n${originalEssay}\n\nReview feedback:\n${feedback}\n\nPlease revise the essay based on the feedback above.`, }); return { text: result.text, + usage: { + inputTokens: result.usage?.inputTokens ?? 0, + outputTokens: result.usage?.outputTokens ?? 0, + totalTokens: result.usage?.totalTokens ?? 0, + cost: extractCost(result.providerMetadata), + }, + }; +} + +/** + * Scores an essay on a scale of 1-10 with justification. + */ +export async function scoreEssay( + model: RunnableModel, + essay: string, + topic: string +): Promise { + const result = await generateText({ + model: model.llm, + system: `You are an expert essay judge. Score the essay on a scale of 1-10 based on: +- Clarity and coherence of argument +- Quality of writing (style, grammar, flow) +- Depth of insight and originality +- Relevance to the topic +- Overall effectiveness + +Be fair and consistent in your scoring. A score of 5 is average, 7-8 is good, 9-10 is exceptional. + +IMPORTANT: Start your response with EXACTLY "Score: X/10" on the first line (where X is your score), then provide your detailed justification below.`, + prompt: `Topic: ${topic}\n\nPlease score the following essay:\n\n${essay}`, + }); + + // Parse score from the text - look for "Score: X/10" or similar patterns + const scoreMatch = result.text.match(/Score:\s*(\d+(?:\.\d+)?)\s*\/\s*10/i); + const score = scoreMatch?.[1] ? parseFloat(scoreMatch[1]) : 5; // Default to 5 if parsing fails + + // Everything after the score line is the justification + const justification = result.text + .replace(/^Score:\s*\d+(?:\.\d+)?\s*\/\s*10\s*/i, "") + .trim(); + + return { + score: Math.min(10, Math.max(1, score)), // Clamp between 1-10 + justification, + usage: { + inputTokens: result.usage?.inputTokens ?? 0, + outputTokens: result.usage?.outputTokens ?? 0, + totalTokens: result.usage?.totalTokens ?? 0, + cost: extractCost(result.providerMetadata), + }, + }; +} + +/** + * Compares two essays head-to-head and picks a winner. + */ +export async function compareEssays( + judge: RunnableModel, + essayA: { author: string; text: string }, + essayB: { author: string; text: string }, + topic: string +): Promise { + const result = await generateText({ + model: judge.llm, + system: `You are an expert essay judge conducting a head-to-head comparison. You will be shown two essays on the same topic, labeled Essay A and Essay B. + +Compare them based on: +- Clarity and coherence of argument +- Quality of writing (style, grammar, flow) +- Depth of insight and originality +- Relevance to the topic +- Overall effectiveness + +You MUST pick a winner. Only declare a tie if the essays are genuinely indistinguishable in quality. + +IMPORTANT: Start your response with EXACTLY one of these on the first line: +- "Winner: A" (if Essay A is better) +- "Winner: B" (if Essay B is better) +- "Winner: Tie" (only if truly equal) + +Then provide your detailed reasoning below, explaining why you chose that winner.`, + prompt: `Topic: ${topic} + +Essay A: +${essayA.text} + +Essay B: +${essayB.text} + +Compare these essays and pick a winner.`, + }); + + // Parse winner from the text + const winnerMatch = result.text.match(/Winner:\s*(A|B|Tie)/i); + let winner: "A" | "B" | "tie" = "tie"; + if (winnerMatch) { + const parsed = winnerMatch[1]!.toUpperCase(); + if (parsed === "A") winner = "A"; + else if (parsed === "B") winner = "B"; + else winner = "tie"; + } + + // Everything after the winner line is the reasoning + const reasoning = result.text.replace(/^Winner:\s*(A|B|Tie)\s*/i, "").trim(); + + return { + winner, + reasoning, + usage: { + inputTokens: result.usage?.inputTokens ?? 0, + outputTokens: result.usage?.outputTokens ?? 0, + totalTokens: result.usage?.totalTokens ?? 0, + cost: extractCost(result.providerMetadata), + }, }; } diff --git a/bun.lock b/bun.lock index adc4602..325db0a 100644 --- a/bun.lock +++ b/bun.lock @@ -4,6 +4,12 @@ "workspaces": { "": { "name": "auto-draftify", + "dependencies": { + "@openrouter/ai-sdk-provider": "^1.0.0", + "ai": "^5.0.0", + "p-limit": "^6.1.0", + "zod": "^3.24.0", + }, "devDependencies": { "@types/bun": "latest", }, @@ -13,14 +19,42 @@ }, }, "packages": { + "@ai-sdk/gateway": ["@ai-sdk/gateway@2.0.17", "", { "dependencies": { "@ai-sdk/provider": "2.0.0", "@ai-sdk/provider-utils": "3.0.18", "@vercel/oidc": "3.0.5" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-oVAG6q72KsjKlrYdLhWjRO7rcqAR8CjokAbYuyVZoCO4Uh2PH/VzZoxZav71w2ipwlXhHCNaInGYWNs889MMDA=="], + + "@ai-sdk/provider": ["@ai-sdk/provider@2.0.0", "", { "dependencies": { "json-schema": "^0.4.0" } }, "sha512-6o7Y2SeO9vFKB8lArHXehNuusnpddKPk7xqL7T2/b+OvXMRIXUO1rR4wcv1hAFUAT9avGZshty3Wlua/XA7TvA=="], + + "@ai-sdk/provider-utils": ["@ai-sdk/provider-utils@3.0.18", "", { "dependencies": { "@ai-sdk/provider": "2.0.0", "@standard-schema/spec": "^1.0.0", "eventsource-parser": "^3.0.6" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-ypv1xXMsgGcNKUP+hglKqtdDuMg68nWHucPPAhIENrbFAI+xCHiqPVN8Zllxyv1TNZwGWUghPxJXU+Mqps0YRQ=="], + + "@openrouter/ai-sdk-provider": ["@openrouter/ai-sdk-provider@1.2.8", "", { "dependencies": { "@openrouter/sdk": "^0.1.8" }, "peerDependencies": { "ai": "^5.0.0", "zod": "^3.24.1 || ^v4" } }, "sha512-pQT8AzZBKg9f4bkt4doF486ZlhK0XjKkevrLkiqYgfh1Jplovieu28nK4Y+xy3sF18/mxjqh9/2y6jh01qzLrA=="], + + "@openrouter/sdk": ["@openrouter/sdk@0.1.27", "", { "dependencies": { "zod": "^3.25.0 || ^4.0.0" } }, "sha512-RH//L10bSmc81q25zAZudiI4kNkLgxF2E+WU42vghp3N6TEvZ6F0jK7uT3tOxkEn91gzmMw9YVmDENy7SJsajQ=="], + + "@opentelemetry/api": ["@opentelemetry/api@1.9.0", "", {}, "sha512-3giAOQvZiH5F9bMlMiv8+GSPMeqg0dbaeo58/0SlA9sxSqZhnUtxzX9/2FzyhS9sWQf5S0GJE0AKBrFqjpeYcg=="], + + "@standard-schema/spec": ["@standard-schema/spec@1.0.0", "", {}, "sha512-m2bOd0f2RT9k8QJx1JN85cZYyH1RqFBdlwtkSlf4tBDYLCiiZnv1fIIwacK6cqwXavOydf0NPToMQgpKq+dVlA=="], + "@types/bun": ["@types/bun@1.3.3", "", { "dependencies": { "bun-types": "1.3.3" } }, "sha512-ogrKbJ2X5N0kWLLFKeytG0eHDleBYtngtlbu9cyBKFtNL3cnpDZkNdQj8flVf6WTZUX5ulI9AY1oa7ljhSrp+g=="], "@types/node": ["@types/node@24.10.1", "", { "dependencies": { "undici-types": "~7.16.0" } }, "sha512-GNWcUTRBgIRJD5zj+Tq0fKOJ5XZajIiBroOF0yvj2bSU1WvNdYS/dn9UxwsujGW4JX06dnHyjV2y9rRaybH0iQ=="], + "@vercel/oidc": ["@vercel/oidc@3.0.5", "", {}, "sha512-fnYhv671l+eTTp48gB4zEsTW/YtRgRPnkI2nT7x6qw5rkI1Lq2hTmQIpHPgyThI0znLK+vX2n9XxKdXZ7BUbbw=="], + + "ai": ["ai@5.0.104", "", { "dependencies": { "@ai-sdk/gateway": "2.0.17", "@ai-sdk/provider": "2.0.0", "@ai-sdk/provider-utils": "3.0.18", "@opentelemetry/api": "1.9.0" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-MZOkL9++nY5PfkpWKBR3Rv+Oygxpb9S16ctv8h91GvrSif7UnNEdPMVZe3bUyMd2djxf0AtBk/csBixP0WwWZQ=="], + "bun-types": ["bun-types@1.3.3", "", { "dependencies": { "@types/node": "*" } }, "sha512-z3Xwlg7j2l9JY27x5Qn3Wlyos8YAp0kKRlrePAOjgjMGS5IG6E7Jnlx736vH9UVI4wUICwwhC9anYL++XeOgTQ=="], + "eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="], + + "json-schema": ["json-schema@0.4.0", "", {}, "sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA=="], + + "p-limit": ["p-limit@6.2.0", "", { "dependencies": { "yocto-queue": "^1.1.1" } }, "sha512-kuUqqHNUqoIWp/c467RI4X6mmyuojY5jGutNU0wVTmEOOfcuwLqyMVoAi9MKi2Ak+5i9+nhmrK4ufZE8069kHA=="], + "typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="], "undici-types": ["undici-types@7.16.0", "", {}, "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="], + + "yocto-queue": ["yocto-queue@1.2.2", "", {}, "sha512-4LCcse/U2MHZ63HAJVE+v71o7yOdIe4cZ70Wpf8D/IyjDKYQLV5GD46B+hSTjJsvV5PztjvHoU580EftxjDZFQ=="], + + "zod": ["zod@3.25.76", "", {}, "sha512-gzUt/qt81nXsFGKIFcC3YnfEAx5NkunCfnDlvuBSSFS02bcXu4Lmea0AFIUwbLWxWPx3d9p8S5QoaujKcNQxcQ=="], } } diff --git a/constants.ts b/constants.ts new file mode 100644 index 0000000..e5768d0 --- /dev/null +++ b/constants.ts @@ -0,0 +1,131 @@ +import { createOpenRouter } from "@openrouter/ai-sdk-provider"; +import type { LanguageModel } from "ai"; + +// Initialize the OpenRouter provider +if (!process.env.OPENROUTER_API_KEY) { + throw new Error( + "OPENROUTER_API_KEY environment variable is required. Please set it before running the script." + ); +} + +const openrouter = createOpenRouter({ + apiKey: process.env.OPENROUTER_API_KEY, +}); + +// Parallelism configuration +export const PARALLEL_LIMIT = 30; + +// Essay topics +export const TOPICS = [ + // "The role of failure in personal growth", + // "Why boredom is underrated", + "The ethics of artificial intelligence", + "How social media reshapes human connection", + // "The value of slow living in a fast world", + // "Why we should embrace uncertainty", + // "The hidden costs of convenience", + // "What makes a good explanation", + // "The relationship between creativity and constraint", + // "Why some ideas spread and others don't", + "the negative impacts on society from artificial intelligence", +] as const; + +// Model definition +export interface RunnableModel { + name: string; + llm: LanguageModel; + reasoning: boolean; +} + +// Include "usage" so we can log cost +const defaultProviderOptions = { + usage: { + include: true, + }, +}; + +export const modelsToRun: RunnableModel[] = [ + // Anthropic + { + name: "claude-4.5-opus-reasoning", + llm: openrouter("anthropic/claude-opus-4.5", defaultProviderOptions), + reasoning: true, + }, + // { + // name: "claude-4.5-opus-non-reasoning", + // llm: openrouter("anthropic/claude-opus-4.5", defaultProviderOptions), + // reasoning: false, + // }, + + // OpenAI + // { + // name: "gpt-4o", + // llm: openrouter("openai/gpt-4o", defaultProviderOptions), + // reasoning: false, + // }, + { + name: "gpt-5.1", + llm: openrouter("openai/gpt-5.1", defaultProviderOptions), + reasoning: true, + }, + // { + // name: "gpt-5.1-chat", + // llm: openrouter("openai/gpt-5.1-chat", defaultProviderOptions), + // reasoning: false, + // }, + // { + // name: "gpt-5-mini", + // llm: openrouter("openai/gpt-5-mini", defaultProviderOptions), + // reasoning: true, + // }, + + // Google + { + name: "gemini-3-pro-preview", + llm: openrouter("google/gemini-3-pro-preview", defaultProviderOptions), + reasoning: true, + }, + // { + // name: "gemini-2.5-pro", + // llm: openrouter("google/gemini-2.5-pro", defaultProviderOptions), + // reasoning: true, + // }, + + // Grok + // { + // name: "grok-4.1-fast", + // llm: openrouter("x-ai/grok-4.1-fast", defaultProviderOptions), + // reasoning: true, + // }, + + // Open Weight + // { + // name: "kimi-k2", + // llm: openrouter("moonshotai/kimi-k2", defaultProviderOptions), + // reasoning: false, + // }, + { + name: "kimi-k2-thinking", + llm: openrouter("moonshotai/kimi-k2-thinking", defaultProviderOptions), + reasoning: true, + }, +]; + +// Cheap models for dry-run testing +export const dryRunModels: RunnableModel[] = [ + { + name: "claude-4.5-haiku", + llm: openrouter("anthropic/claude-haiku-4.5", defaultProviderOptions), + reasoning: false, + }, + { + name: "gemini-2.5-flash", + llm: openrouter("google/gemini-2.5-flash", defaultProviderOptions), + reasoning: true, + }, + { + name: "gpt-5-mini", + llm: openrouter("openai/gpt-5-mini", defaultProviderOptions), + reasoning: true, + }, +]; diff --git a/fileUtils.ts b/fileUtils.ts index b11c2f9..ed96fed 100644 --- a/fileUtils.ts +++ b/fileUtils.ts @@ -1,26 +1,111 @@ import { mkdir, writeFile } from "fs/promises"; import { join } from "path"; -const RUNS_DIR = "runs"; +const RESULTS_DIR = "results"; -export interface PipelineOutput { - essay: string; - review: string; - revision: string; +export type TestType = "scoring-test" | "1v1"; + +export interface TopicResults { + topic: string; + essays: Record; + feedback: Record>; + revisions: Record>; + scores: { + original: Record< + string, + Record + >; + revised: Record< + string, + Record> + >; + }; + rankings: { + essays: Array<{ + type: "original" | "revised"; + author: string; + reviewer?: string; + avgScore: number; + }>; + reviewers: Array<{ + reviewer: string; + avgImprovement: number; + }>; + }; } -/** - * Ensures the runs directory exists, creating it if necessary. - */ -async function ensureRunsDirectory(): Promise { - try { - await mkdir(RUNS_DIR, { recursive: true }); - } catch (error) { - // Directory might already exist, which is fine - if ((error as NodeJS.ErrnoException).code !== "EEXIST") { - throw error; - } - } +export interface ArenaResults { + timestamp: string; + models: string[]; + topics: TopicResults[]; + aggregateRankings: { + essays: Array<{ + author: string; + avgScore: number; + avgImprovement: number; + }>; + reviewers: Array<{ + reviewer: string; + avgImprovement: number; + }>; + }; +} + +// 1v1 specific types +export interface ComparisonResult { + judge: string; + essayA: { author: string; reviewer?: string }; + essayB: { author: string; reviewer?: string }; + winner: "A" | "B" | "tie"; + reasoning: string; +} + +export interface OneVsOneTopicResults { + topic: string; + essays: Record; + feedback: Record>; + revisions: Record>; + comparisons: ComparisonResult[]; + rankings: { + essays: Array<{ + author: string; + reviewer?: string; + wins: number; + losses: number; + ties: number; + winRate: number; + }>; + }; +} + +export interface OneVsOneResults { + timestamp: string; + models: string[]; + topics: OneVsOneTopicResults[]; + aggregateRankings: { + essays: Array<{ + author: string; + wins: number; + losses: number; + ties: number; + winRate: number; + }>; + reviewers: Array<{ + reviewer: string; + wins: number; + losses: number; + ties: number; + winRate: number; + }>; + pairings: Array<{ + author: string; + reviewer: string; + wins: number; + losses: number; + ties: number; + winRate: number; + }>; + }; } /** @@ -32,35 +117,299 @@ function getTimestamp(): string { } /** - * Writes the pipeline outputs to markdown files in the runs directory. + * Sanitizes a name for use in filenames. */ -export async function writePipelineOutputs( - outputs: PipelineOutput -): Promise { - await ensureRunsDirectory(); - const timestamp = getTimestamp(); +function sanitizeName(name: string): string { + return name.replace(/[^a-zA-Z0-9-_]/g, "-").toLowerCase(); +} - const files = [ - { - path: join(RUNS_DIR, `${timestamp}-essay.md`), - content: `# Original Essay\n\n${outputs.essay}`, - }, - { - path: join(RUNS_DIR, `${timestamp}-review.md`), - content: `# Review Feedback\n\n${outputs.review}`, - }, - { - path: join(RUNS_DIR, `${timestamp}-revision.md`), - content: `# Revised Essay\n\n${outputs.revision}`, - }, +/** + * Creates the arena results directory structure for a topic. + */ +export async function createTopicDirectories(baseDir: string, topic: string) { + const topicSlug = sanitizeName(topic).slice(0, 50); + const topicDir = join(baseDir, topicSlug); + const dirs = [ + topicDir, + join(topicDir, "essays"), + join(topicDir, "feedback"), + join(topicDir, "revisions"), ]; - for (const file of files) { - await writeFile(file.path, file.content, "utf-8"); + for (const dir of dirs) { + await mkdir(dir, { recursive: true }); + } + + return topicDir; +} + +/** + * Writes an essay to the essays directory. + */ +export async function writeEssay( + topicDir: string, + modelName: string, + essay: string +) { + const filename = `${sanitizeName(modelName)}.md`; + const path = join(topicDir, "essays", filename); + await writeFile(path, `# Essay by ${modelName}\n\n${essay}`, "utf-8"); + return path; +} + +/** + * Writes feedback to the feedback directory. + */ +export async function writeFeedback( + topicDir: string, + reviewer: string, + author: string, + feedback: string +) { + const filename = `${sanitizeName(reviewer)}-on-${sanitizeName(author)}.md`; + const path = join(topicDir, "feedback", filename); + await writeFile( + path, + `# Feedback by ${reviewer} on ${author}'s Essay\n\n${feedback}`, + "utf-8" + ); + return path; +} + +/** + * Writes a revision to the revisions directory. + */ +export async function writeRevision( + topicDir: string, + author: string, + reviewer: string, + revision: string +) { + const filename = `${sanitizeName(author)}-revised-by-${sanitizeName( + reviewer + )}.md`; + const path = join(topicDir, "revisions", filename); + await writeFile( + path, + `# ${author}'s Essay Revised Based on ${reviewer}'s Feedback\n\n${revision}`, + "utf-8" + ); + return path; +} + +/** + * Writes the complete results JSON file. + */ +export async function writeResultsJson(baseDir: string, results: ArenaResults) { + const path = join(baseDir, "results.json"); + await writeFile(path, JSON.stringify(results, null, 2), "utf-8"); + return path; +} + +/** + * Generates and writes the summary markdown file. + */ +export async function writeSummary(baseDir: string, results: ArenaResults) { + const path = join(baseDir, "summary.md"); + + let content = `# Writing Quality Arena Results\n\n`; + content += `**Date:** ${results.timestamp}\n\n`; + content += `**Models:** ${results.models.length}\n\n`; + content += `**Topics:** ${results.topics.length}\n\n`; + + // Aggregate Model Rankings (as writers) + content += `## Aggregate Model Rankings (as Writers)\n\n`; + content += `| Rank | Model | Avg Score | Avg Improvement |\n`; + content += `|------|-------|-----------|----------------|\n`; + + results.aggregateRankings.essays.forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + content += `| ${index + 1} | ${entry.author} | ${entry.avgScore.toFixed( + 2 + )} | ${sign}${entry.avgImprovement.toFixed(2)} |\n`; + }); + + // Aggregate Reviewer Rankings + content += `\n## Aggregate Reviewer Rankings (by Improvement Impact)\n\n`; + content += `| Rank | Reviewer | Avg Improvement |\n`; + content += `|------|----------|----------------|\n`; + + results.aggregateRankings.reviewers.forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + content += `| ${index + 1} | ${ + entry.reviewer + } | ${sign}${entry.avgImprovement.toFixed(2)} |\n`; + }); + + // Per-topic summaries + content += `\n## Per-Topic Results\n\n`; + + for (const topic of results.topics) { + content += `### ${topic.topic}\n\n`; + + // Top 3 essays for this topic + content += `**Top 3 Essays:**\n`; + topic.rankings.essays.slice(0, 3).forEach((entry, index) => { + const reviewer = entry.reviewer ? ` (← ${entry.reviewer})` : ""; + content += `${index + 1}. ${entry.author}${reviewer} [${ + entry.type + }] - ${entry.avgScore.toFixed(2)}\n`; + }); + + // Top 3 reviewers for this topic + content += `\n**Top 3 Reviewers:**\n`; + topic.rankings.reviewers.slice(0, 3).forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + content += `${index + 1}. ${ + entry.reviewer + } - ${sign}${entry.avgImprovement.toFixed(2)}\n`; + }); + + content += `\n`; } - console.log(`\n✓ Files written:`); - files.forEach((file) => { - console.log(` - ${file.path}`); + await writeFile(path, content, "utf-8"); + return path; +} + +/** + * Creates a new arena run and returns the base directory and timestamp. + */ +export async function initArenaRun(testType: TestType) { + const timestamp = getTimestamp(); + const baseDir = join(RESULTS_DIR, testType, timestamp); + await mkdir(baseDir, { recursive: true }); + return { baseDir, timestamp }; +} + +/** + * Writes a comparison result to the comparisons directory. + */ +export async function writeComparison( + topicDir: string, + judge: string, + essayA: { author: string; reviewer?: string }, + essayB: { author: string; reviewer?: string }, + winner: "A" | "B" | "tie", + reasoning: string +) { + const comparisonsDir = join(topicDir, "comparisons"); + await mkdir(comparisonsDir, { recursive: true }); + + const essayALabel = essayA.reviewer + ? `${sanitizeName(essayA.author)}-revised-by-${sanitizeName( + essayA.reviewer + )}` + : sanitizeName(essayA.author); + const essayBLabel = essayB.reviewer + ? `${sanitizeName(essayB.author)}-revised-by-${sanitizeName( + essayB.reviewer + )}` + : sanitizeName(essayB.author); + + const filename = `${sanitizeName(judge)}-${essayALabel}-vs-${essayBLabel}.md`; + const path = join(comparisonsDir, filename); + + const essayADisplay = essayA.reviewer + ? `${essayA.author} (revised by ${essayA.reviewer})` + : essayA.author; + const essayBDisplay = essayB.reviewer + ? `${essayB.author} (revised by ${essayB.reviewer})` + : essayB.author; + + const winnerDisplay = + winner === "A" ? essayADisplay : winner === "B" ? essayBDisplay : "Tie"; + + await writeFile( + path, + `# Comparison by ${judge}\n\n**Essay A:** ${essayADisplay}\n**Essay B:** ${essayBDisplay}\n\n**Winner:** ${winnerDisplay}\n\n## Reasoning\n\n${reasoning}`, + "utf-8" + ); + return path; +} + +/** + * Writes the 1v1 results JSON file. + */ +export async function writeOneVsOneResultsJson( + baseDir: string, + results: OneVsOneResults +) { + const path = join(baseDir, "results.json"); + await writeFile(path, JSON.stringify(results, null, 2), "utf-8"); + return path; +} + +/** + * Generates and writes the 1v1 summary markdown file. + */ +export async function writeOneVsOneSummary( + baseDir: string, + results: OneVsOneResults +) { + const path = join(baseDir, "summary.md"); + + let content = `# 1v1 Arena Results\n\n`; + content += `**Date:** ${results.timestamp}\n\n`; + content += `**Models:** ${results.models.length}\n\n`; + content += `**Topics:** ${results.topics.length}\n\n`; + + // Aggregate Model Rankings (as Writers) + content += `## Aggregate Model Rankings (as Writers)\n\n`; + content += `| Rank | Model | Wins | Losses | Ties | Win Rate |\n`; + content += `|------|-------|------|--------|------|----------|\n`; + + results.aggregateRankings.essays.forEach((entry, index) => { + content += `| ${index + 1} | ${entry.author} | ${entry.wins} | ${ + entry.losses + } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`; + }); + + // Aggregate Reviewer Rankings + content += `\n## Aggregate Reviewer Rankings\n\n`; + content += `| Rank | Reviewer | Wins | Losses | Ties | Win Rate |\n`; + content += `|------|----------|------|--------|------|----------|\n`; + + results.aggregateRankings.reviewers.forEach((entry, index) => { + content += `| ${index + 1} | ${entry.reviewer} | ${entry.wins} | ${ + entry.losses + } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`; }); + + // Aggregate Pairing Rankings + content += `\n## Aggregate Pairing Rankings (Author + Reviewer)\n\n`; + content += `| Rank | Author | Reviewer | Wins | Losses | Ties | Win Rate |\n`; + content += `|------|--------|----------|------|--------|------|----------|\n`; + + results.aggregateRankings.pairings.forEach((entry, index) => { + content += `| ${index + 1} | ${entry.author} | ${entry.reviewer} | ${ + entry.wins + } | ${entry.losses} | ${entry.ties} | ${(entry.winRate * 100).toFixed( + 1 + )}% |\n`; + }); + + // Per-topic summaries + content += `\n## Per-Topic Results\n\n`; + + for (const topic of results.topics) { + content += `### ${topic.topic}\n\n`; + + content += `| Rank | Essay | Wins | Losses | Ties | Win Rate |\n`; + content += `|------|-------|------|--------|------|----------|\n`; + + topic.rankings.essays.forEach((entry, index) => { + const label = entry.reviewer + ? `${entry.author} (← ${entry.reviewer})` + : entry.author; + content += `| ${index + 1} | ${label} | ${entry.wins} | ${ + entry.losses + } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`; + }); + + content += `\n`; + } + + await writeFile(path, content, "utf-8"); + return path; } diff --git a/index.ts b/index.ts index c68e297..9a593af 100644 --- a/index.ts +++ b/index.ts @@ -1,66 +1,1447 @@ -import { generateEssay, reviewEssay, reviseEssay } from "./aiClient"; -import { writePipelineOutputs } from "./fileUtils"; +import pLimit from "p-limit"; +import { + generateEssay, + reviewEssay, + reviseEssay, + scoreEssay, + compareEssays, + type TokenUsage, +} from "./aiClient"; +import { + modelsToRun as allModels, + dryRunModels, + PARALLEL_LIMIT, + TOPICS, + type RunnableModel, +} from "./constants"; +import { + createTopicDirectories, + initArenaRun, + writeEssay, + writeFeedback, + writeResultsJson, + writeRevision, + writeSummary, + writeComparison, + writeOneVsOneResultsJson, + writeOneVsOneSummary, + type ArenaResults, + type TopicResults, + type TestType, + type OneVsOneResults, + type OneVsOneTopicResults, + type ComparisonResult, +} from "./fileUtils"; + +// Parse CLI flags +const isDryRun = process.argv.includes("--dry-run"); +const modelsToRun = isDryRun ? dryRunModels : allModels; + +// Parse --test argument +function getTestTypeFromArgs(): TestType | null { + const testArg = process.argv.find((arg) => arg.startsWith("--test=")); + if (!testArg) return null; + const value = testArg.split("=")[1]; + if (value === "scoring-test" || value === "1v1") { + return value; + } + console.error(`Invalid test type: ${value}. Use "scoring-test" or "1v1".`); + process.exit(1); +} + +const limit = pLimit(PARALLEL_LIMIT); + +/** + * Tracks token usage and costs per model per phase. + */ +interface UsageTracker { + essays: Record; + reviews: Record; + revisions: Record; + scores: Record; + comparisons: Record; +} + +function createUsageTracker(): UsageTracker { + const tracker: UsageTracker = { + essays: {}, + reviews: {}, + revisions: {}, + scores: {}, + comparisons: {}, + }; + for (const model of modelsToRun) { + tracker.essays[model.name] = []; + tracker.reviews[model.name] = []; + tracker.revisions[model.name] = []; + tracker.scores[model.name] = []; + tracker.comparisons[model.name] = []; + } + return tracker; +} + +let usageTracker = createUsageTracker(); /** - * Prompts the user for an essay topic via stdin. + * Interactive test type selection UI. */ -async function promptForTopic(): Promise { - const prompt = "Enter your essay topic: "; - process.stdout.write(prompt); +async function selectTestType(): Promise { + console.log("\n🏟️ Writing Quality Arena\n"); + console.log("Select test type:\n"); + console.log(" 1. scoring-test - Models score essays on a 1-10 scale"); + console.log(" 2. 1v1 - Head-to-head essay comparisons\n"); + + process.stdout.write("Enter choice (1 or 2): "); return new Promise((resolve) => { process.stdin.once("data", (data) => { - const topic = data.toString().trim(); - if (!topic) { - console.error("Topic cannot be empty. Please try again."); - process.exit(1); + const input = data.toString().trim(); + if (input === "1" || input === "scoring-test") { + resolve("scoring-test"); + } else if (input === "2" || input === "1v1") { + resolve("1v1"); + } else { + console.log("Invalid choice, defaulting to scoring-test"); + resolve("scoring-test"); } - resolve(topic); + }); + }); +} + +// ============================================================================ +// SHARED PHASES (used by both test types) +// ============================================================================ + +/** + * Phase 1: Each model generates an essay on the topic. + */ +async function runPhase1Essays( + topic: string, + topicDir: string +): Promise> { + const essays: Record = {}; + + const tasks = modelsToRun.map((model) => + limit(async () => { + console.log(` Generating essay: ${model.name}...`); + const result = await generateEssay(model, topic); + essays[model.name] = result.text; + usageTracker.essays[model.name]!.push(result.usage); + await writeEssay(topicDir, model.name, result.text); + console.log( + ` ✓ ${model.name} (${ + result.usage.totalTokens + } tokens, $${result.usage.cost.toFixed(4)})` + ); + return result; + }) + ); + + await Promise.all(tasks); + return essays; +} + +/** + * Phase 2: Every model reviews every OTHER model's essay. + */ +async function runPhase2Feedback( + topic: string, + essays: Record, + topicDir: string +): Promise>> { + const feedback: Record> = {}; + + // Initialize nested objects + for (const reviewer of modelsToRun) { + feedback[reviewer.name] = {}; + } + + const tasks: Array> = []; + + for (const reviewer of modelsToRun) { + for (const author of modelsToRun) { + if (reviewer.name === author.name) continue; + + tasks.push( + limit(async () => { + console.log(` ${reviewer.name} reviewing ${author.name}...`); + const essayText = essays[author.name]!; + const result = await reviewEssay(reviewer, essayText, topic); + feedback[reviewer.name]![author.name] = result.text; + usageTracker.reviews[reviewer.name]!.push(result.usage); + await writeFeedback( + topicDir, + reviewer.name, + author.name, + result.text + ); + console.log( + ` ✓ ${reviewer.name} → ${author.name} (${ + result.usage.totalTokens + } tokens, $${result.usage.cost.toFixed(4)})` + ); + }) + ); + } + } + + await Promise.all(tasks); + return feedback; +} + +/** + * Phase 3: Each author revises their essay for EACH piece of feedback received. + */ +async function runPhase3Revisions( + topic: string, + essays: Record, + feedback: Record>, + topicDir: string +): Promise>> { + const revisions: Record> = {}; + + // Initialize nested objects + for (const author of modelsToRun) { + revisions[author.name] = {}; + } + + const tasks: Array> = []; + + for (const author of modelsToRun) { + for (const reviewer of modelsToRun) { + if (author.name === reviewer.name) continue; + + tasks.push( + limit(async () => { + const reviewerFeedback = feedback[reviewer.name]![author.name]!; + const essayText = essays[author.name]!; + console.log( + ` ${author.name} revising based on ${reviewer.name}...` + ); + const result = await reviseEssay( + author, + topic, + essayText, + reviewerFeedback + ); + revisions[author.name]![reviewer.name] = result.text; + usageTracker.revisions[author.name]!.push(result.usage); + await writeRevision( + topicDir, + author.name, + reviewer.name, + result.text + ); + console.log( + ` ✓ ${author.name} ← ${reviewer.name} (${ + result.usage.totalTokens + } tokens, $${result.usage.cost.toFixed(4)})` + ); + }) + ); + } + } + + await Promise.all(tasks); + return revisions; +} + +// ============================================================================ +// SCORING TEST SPECIFIC +// ============================================================================ + +/** + * Counts API calls for scoring test. + */ +function countScoringApiCalls() { + let essays = 0; + let feedback = 0; + let revisions = 0; + let scores = 0; + + for (const _topic of TOPICS) { + for (const _model of modelsToRun) { + essays++; + } + + for (const reviewer of modelsToRun) { + for (const author of modelsToRun) { + if (reviewer.name === author.name) continue; + feedback++; + } + } + + for (const author of modelsToRun) { + for (const reviewer of modelsToRun) { + if (author.name === reviewer.name) continue; + revisions++; + } + } + + for (const _judge of modelsToRun) { + for (const _author of modelsToRun) { + scores++; + } + } + for (const _judge of modelsToRun) { + for (const author of modelsToRun) { + for (const reviewer of modelsToRun) { + if (author.name === reviewer.name) continue; + scores++; + } + } + } + } + + return { + essays, + feedback, + revisions, + scores, + total: essays + feedback + revisions + scores, + }; +} + +/** + * Prompts for scoring test confirmation. + */ +async function confirmScoringRun(): Promise { + const { essays, feedback, revisions, scores, total } = countScoringApiCalls(); + + console.log("\n🏟️ Writing Quality Arena - Scoring Test\n"); + if (isDryRun) { + console.log("⚡ DRY RUN MODE (using cheap models)\n"); + } + console.log(`Models: ${modelsToRun.length}`); + console.log(`Topics: ${TOPICS.length}`); + console.log(`\nAPI Call Breakdown (across all ${TOPICS.length} topics):`); + console.log(` Phase 1 - Essays: ${essays.toString().padStart(6)} calls`); + console.log( + ` Phase 2 - Feedback: ${feedback.toString().padStart(6)} calls` + ); + console.log( + ` Phase 3 - Revisions: ${revisions.toString().padStart(6)} calls` + ); + console.log(` Phase 4 - Scores: ${scores.toString().padStart(6)} calls`); + console.log(` ────────────────────────────`); + console.log(` Total: ${total.toString().padStart(6)} calls\n`); + console.log(`Parallelism: ${PARALLEL_LIMIT} concurrent requests\n`); + + process.stdout.write("Proceed? (Y/n): "); + + return new Promise((resolve) => { + process.stdin.once("data", (data) => { + const input = data.toString().trim().toLowerCase(); + resolve(input === "" || input === "y" || input === "yes"); + }); + }); +} + +/** + * Phase 4 (Scoring): Every model scores every essay. + */ +async function runPhase4Scoring( + topic: string, + essays: Record, + revisions: Record> +): Promise<{ + original: Record< + string, + Record + >; + revised: Record< + string, + Record> + >; +}> { + const originalScores: Record< + string, + Record + > = {}; + const revisedScores: Record< + string, + Record> + > = {}; + + for (const judge of modelsToRun) { + originalScores[judge.name] = {}; + revisedScores[judge.name] = {}; + for (const author of modelsToRun) { + revisedScores[judge.name]![author.name] = {}; + } + } + + const tasks: Array> = []; + + for (const judge of modelsToRun) { + for (const author of modelsToRun) { + tasks.push( + limit(async () => { + const essayText = essays[author.name]!; + console.log(` ${judge.name} scoring ${author.name} (original)...`); + const result = await scoreEssay(judge, essayText, topic); + originalScores[judge.name]![author.name] = { + score: result.score, + justification: result.justification, + }; + usageTracker.scores[judge.name]!.push(result.usage); + console.log( + ` ✓ ${judge.name} → ${author.name} (original): ${ + result.score + } (${result.usage.totalTokens} tokens, $${result.usage.cost.toFixed( + 4 + )})` + ); + }) + ); + } + } + + for (const judge of modelsToRun) { + for (const author of modelsToRun) { + for (const reviewer of modelsToRun) { + if (author.name === reviewer.name) continue; + + tasks.push( + limit(async () => { + const revision = revisions[author.name]![reviewer.name]!; + console.log( + ` ${judge.name} scoring ${author.name}←${reviewer.name} (revised)...` + ); + const result = await scoreEssay(judge, revision, topic); + revisedScores[judge.name]![author.name]![reviewer.name] = { + score: result.score, + justification: result.justification, + }; + usageTracker.scores[judge.name]!.push(result.usage); + console.log( + ` ✓ ${judge.name} → ${author.name}←${reviewer.name}: ${ + result.score + } (${ + result.usage.totalTokens + } tokens, $${result.usage.cost.toFixed(4)})` + ); + }) + ); + } + } + } + + await Promise.all(tasks); + return { original: originalScores, revised: revisedScores }; +} + +/** + * Calculate rankings from scores for a single topic. + */ +function calculateScoringRankings(scores: { + original: Record< + string, + Record + >; + revised: Record< + string, + Record> + >; +}): TopicResults["rankings"] { + const essayScores: Array<{ + type: "original" | "revised"; + author: string; + reviewer?: string; + avgScore: number; + }> = []; + + const judges = Object.keys(scores.original); + const firstJudge = judges[0]!; + const authors = Object.keys(scores.original[firstJudge]!); + + for (const author of authors) { + const judgeScores = judges.map((j) => scores.original[j]![author]!.score); + const avgScore = + judgeScores.reduce((a, b) => a + b, 0) / judgeScores.length; + essayScores.push({ type: "original", author, avgScore }); + } + + for (const author of authors) { + for (const reviewer of authors) { + if (author === reviewer) continue; + const judgeScores = judges.map( + (j) => scores.revised[j]![author]![reviewer]!.score + ); + const avgScore = + judgeScores.reduce((a, b) => a + b, 0) / judgeScores.length; + essayScores.push({ type: "revised", author, reviewer, avgScore }); + } + } + + essayScores.sort((a, b) => b.avgScore - a.avgScore); + + const reviewerImpact: Record = {}; + for (const reviewer of authors) { + reviewerImpact[reviewer] = []; + } + + for (const author of authors) { + const originalAvg = + judges.reduce((sum, j) => sum + scores.original[j]![author]!.score, 0) / + judges.length; + + for (const reviewer of authors) { + if (author === reviewer) continue; + const revisedAvg = + judges.reduce( + (sum, j) => sum + scores.revised[j]![author]![reviewer]!.score, + 0 + ) / judges.length; + const improvement = revisedAvg - originalAvg; + reviewerImpact[reviewer]!.push(improvement); + } + } + + const reviewerScores = Object.entries(reviewerImpact).map( + ([reviewer, improvements]) => ({ + reviewer, + avgImprovement: + improvements.reduce((a, b) => a + b, 0) / improvements.length, + }) + ); + + reviewerScores.sort((a, b) => b.avgImprovement - a.avgImprovement); + + return { + essays: essayScores, + reviewers: reviewerScores, + }; +} + +/** + * Calculate aggregate rankings across all topics for scoring test. + */ +function calculateScoringAggregateRankings( + topics: TopicResults[] +): ArenaResults["aggregateRankings"] { + const modelScores: Record< + string, + { scores: number[]; improvements: number[] } + > = {}; + const reviewerImprovements: Record = {}; + + for (const topic of topics) { + const originalByAuthor: Record = {}; + for (const entry of topic.rankings.essays) { + if (entry.type === "original") { + originalByAuthor[entry.author] = entry.avgScore; + if (!modelScores[entry.author]) { + modelScores[entry.author] = { scores: [], improvements: [] }; + } + modelScores[entry.author]!.scores.push(entry.avgScore); + } + } + + for (const entry of topic.rankings.essays) { + if (entry.type === "revised" && entry.reviewer) { + const original = originalByAuthor[entry.author]!; + const improvement = entry.avgScore - original; + modelScores[entry.author]!.improvements.push(improvement); + + if (!reviewerImprovements[entry.reviewer]) { + reviewerImprovements[entry.reviewer] = []; + } + reviewerImprovements[entry.reviewer]!.push(improvement); + } + } + } + + const essayRankings = Object.entries(modelScores).map(([author, data]) => ({ + author, + avgScore: data.scores.reduce((a, b) => a + b, 0) / data.scores.length, + avgImprovement: + data.improvements.length > 0 + ? data.improvements.reduce((a, b) => a + b, 0) / + data.improvements.length + : 0, + })); + essayRankings.sort((a, b) => b.avgScore - a.avgScore); + + const reviewerRankings = Object.entries(reviewerImprovements).map( + ([reviewer, improvements]) => ({ + reviewer, + avgImprovement: + improvements.reduce((a, b) => a + b, 0) / improvements.length, + }) + ); + reviewerRankings.sort((a, b) => b.avgImprovement - a.avgImprovement); + + return { + essays: essayRankings, + reviewers: reviewerRankings, + }; +} + +/** + * Prints topic results for scoring test. + */ +function printScoringTopicResults(result: TopicResults) { + console.log(`\n 📊 Results for "${result.topic}":\n`); + + console.log(" 📝 Essay Rankings (by avg score):"); + result.rankings.essays.slice(0, 5).forEach((entry, index) => { + const label = entry.reviewer + ? `${entry.author} ← ${entry.reviewer} (revised)` + : `${entry.author} (original)`; + console.log(` ${index + 1}. ${label} - ${entry.avgScore.toFixed(2)}`); + }); + if (result.rankings.essays.length > 5) { + console.log(` ... and ${result.rankings.essays.length - 5} more`); + } + + console.log("\n 🎯 Reviewer Rankings (by improvement impact):"); + result.rankings.reviewers.forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + console.log( + ` ${index + 1}. ${ + entry.reviewer + } - ${sign}${entry.avgImprovement.toFixed(2)}` + ); + }); +} + +/** + * Run all phases for a single topic (scoring test). + */ +async function runScoringTopicArena( + topic: string, + topicIndex: number, + totalTopics: number, + baseDir: string +): Promise { + console.log( + `\n${"═".repeat(60)}\n📚 Topic ${ + topicIndex + 1 + }/${totalTopics}: "${topic}"\n${"═".repeat(60)}` + ); + + const topicDir = await createTopicDirectories(baseDir, topic); + + console.log("\n 📝 Phase 1: Essay Generation"); + const essays = await runPhase1Essays(topic, topicDir); + console.log(` ✓ Phase 1 complete: ${modelsToRun.length} essays`); + + console.log("\n 📋 Phase 2: Feedback Generation"); + const feedback = await runPhase2Feedback(topic, essays, topicDir); + const feedbackCount = modelsToRun.length * (modelsToRun.length - 1); + console.log(` ✓ Phase 2 complete: ${feedbackCount} feedback pieces`); + + console.log("\n ✏️ Phase 3: Revisions"); + const revisions = await runPhase3Revisions(topic, essays, feedback, topicDir); + console.log(` ✓ Phase 3 complete: ${feedbackCount} revisions`); + + console.log("\n ⭐ Phase 4: Scoring"); + const scores = await runPhase4Scoring(topic, essays, revisions); + console.log(` ✓ Phase 4 complete`); + + const rankings = calculateScoringRankings(scores); + + return { + topic, + essays, + feedback, + revisions, + scores, + rankings, + }; +} + +/** + * Formats duration in milliseconds to human-readable string. + */ +function formatDuration(ms: number): string { + if (ms < 1000) return `${ms}ms`; + const seconds = Math.floor(ms / 1000); + if (seconds < 60) return `${seconds}s`; + const minutes = Math.floor(seconds / 60); + const remainingSeconds = seconds % 60; + if (minutes < 60) return `${minutes}m ${remainingSeconds}s`; + const hours = Math.floor(minutes / 60); + const remainingMinutes = minutes % 60; + return `${hours}h ${remainingMinutes}m ${remainingSeconds}s`; +} + +/** + * Main scoring test orchestration. + */ +async function runScoringTest(): Promise { + usageTracker = createUsageTracker(); + + const confirmed = await confirmScoringRun(); + if (!confirmed) { + console.log("\nAborted."); + process.exit(0); + } + + const overallStart = Date.now(); + + const { baseDir, timestamp } = await initArenaRun("scoring-test"); + console.log(`\nResults will be saved to: ${baseDir}`); + + const topicResults: TopicResults[] = []; + const topicTimes: Array<{ topic: string; duration: number }> = []; + + for (let i = 0; i < TOPICS.length; i++) { + const topic = TOPICS[i]!; + const topicStart = Date.now(); + const result = await runScoringTopicArena(topic, i, TOPICS.length, baseDir); + const topicDuration = Date.now() - topicStart; + topicResults.push(result); + topicTimes.push({ topic, duration: topicDuration }); + printScoringTopicResults(result); + console.log(`\n ⏱️ Topic completed in ${formatDuration(topicDuration)}`); + } + + console.log("\n\n📊 Calculating aggregate rankings...\n"); + + // Log all topic results before aggregate + console.log("═".repeat(60)); + console.log("\n📋 INDIVIDUAL TOPIC RESULTS\n"); + for (const result of topicResults) { + printScoringTopicResults(result); + console.log(""); + } + const aggregateRankings = calculateScoringAggregateRankings(topicResults); + + const results: ArenaResults = { + timestamp, + models: modelsToRun.map((m) => m.name), + topics: topicResults, + aggregateRankings, + }; + + await writeResultsJson(baseDir, results); + await writeSummary(baseDir, results); + + console.log("═".repeat(60)); + console.log("\n🏆 AGGREGATE RESULTS\n"); + + console.log("📝 Models (as Writers):\n"); + aggregateRankings.essays.forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + console.log( + ` ${index + 1}. ${entry.author} - ${entry.avgScore.toFixed( + 2 + )} avg (${sign}${entry.avgImprovement.toFixed(2)} after feedback)` + ); + }); + + console.log("\n🎯 Reviewers (by improvement impact):\n"); + aggregateRankings.reviewers.forEach((entry, index) => { + const sign = entry.avgImprovement >= 0 ? "+" : ""; + console.log( + ` ${index + 1}. ${ + entry.reviewer + } - ${sign}${entry.avgImprovement.toFixed(2)}` + ); + }); + + printUsageSummary("scoring-test"); + + const overallDuration = Date.now() - overallStart; + console.log("\n" + "═".repeat(60)); + console.log("\n⏱️ RUNTIME SUMMARY\n"); + topicTimes.forEach((t) => { + console.log( + ` ${t.topic.slice(0, 40).padEnd(42)} ${formatDuration(t.duration)}` + ); + }); + console.log(` ${"─".repeat(50)}`); + console.log(` ${"Total".padEnd(42)} ${formatDuration(overallDuration)}`); + + console.log(`\n✨ Scoring test complete! Results saved to: ${baseDir}`); +} + +// ============================================================================ +// 1V1 TEST SPECIFIC +// ============================================================================ + +/** + * Counts API calls for 1v1 test. + */ +function countOneVsOneApiCalls() { + let essays = 0; + let feedback = 0; + let revisions = 0; + let comparisons = 0; + + const n = modelsToRun.length; + + for (const _topic of TOPICS) { + // Phase 1: Essays + essays += n; + + // Phase 2: Feedback (each model reviews every other) + feedback += n * (n - 1); + + // Phase 3: Revisions (each author revises per reviewer) + revisions += n * (n - 1); + + // Phase 4: Comparisons + // Original essays: C(n, 2) pairs = n*(n-1)/2, each judged by n models + const originalPairs = (n * (n - 1)) / 2; + comparisons += originalPairs * n; + + // Revised essays: each model has (n-1) revisions + // Total revised essays = n * (n-1) + // Pairs of revised essays = C(n*(n-1), 2) = ...but we only compare within same topic + // Actually: all revised essays compete pairwise + const revisedCount = n * (n - 1); + const revisedPairs = (revisedCount * (revisedCount - 1)) / 2; + comparisons += revisedPairs * n; + } + + return { + essays, + feedback, + revisions, + comparisons, + total: essays + feedback + revisions + comparisons, + }; +} + +/** + * Prompts for 1v1 test confirmation. + */ +async function confirmOneVsOneRun(): Promise { + const { essays, feedback, revisions, comparisons, total } = + countOneVsOneApiCalls(); + + console.log("\n🏟️ Writing Quality Arena - 1v1 Test\n"); + if (isDryRun) { + console.log("⚡ DRY RUN MODE (using cheap models)\n"); + } + console.log(`Models: ${modelsToRun.length}`); + console.log(`Topics: ${TOPICS.length}`); + console.log(`\nAPI Call Breakdown (across all ${TOPICS.length} topics):`); + console.log( + ` Phase 1 - Essays: ${essays.toString().padStart(6)} calls` + ); + console.log( + ` Phase 2 - Feedback: ${feedback.toString().padStart(6)} calls` + ); + console.log( + ` Phase 3 - Revisions: ${revisions.toString().padStart(6)} calls` + ); + console.log( + ` Phase 4 - Comparisons: ${comparisons.toString().padStart(6)} calls` + ); + console.log(` ────────────────────────────`); + console.log( + ` Total: ${total.toString().padStart(6)} calls\n` + ); + console.log(`Parallelism: ${PARALLEL_LIMIT} concurrent requests\n`); + + process.stdout.write("Proceed? (Y/n): "); + + return new Promise((resolve) => { + process.stdin.once("data", (data) => { + const input = data.toString().trim().toLowerCase(); + resolve(input === "" || input === "y" || input === "yes"); }); }); } /** - * Runs the complete essay pipeline: generation → review → revision. + * Phase 4 (1v1): Head-to-head comparisons of all essays. */ -async function runEssayPipeline(): Promise { - console.log("🎓 Auto-Draftify: Essay Generation Pipeline\n"); +async function runPhase4Comparisons( + topic: string, + essays: Record, + revisions: Record>, + topicDir: string +): Promise { + const comparisons: ComparisonResult[] = []; + + // Build list of all essays (original + revised) + interface EssayEntry { + author: string; + reviewer?: string; + text: string; + } + + const allEssays: EssayEntry[] = []; + + // Add original essays + for (const author of Object.keys(essays)) { + allEssays.push({ author, text: essays[author]! }); + } + + // Add revised essays + for (const author of Object.keys(revisions)) { + for (const reviewer of Object.keys(revisions[author]!)) { + allEssays.push({ + author, + reviewer, + text: revisions[author]![reviewer]!, + }); + } + } + + // Generate all unique pairs + const pairs: Array<[EssayEntry, EssayEntry]> = []; + for (let i = 0; i < allEssays.length; i++) { + for (let j = i + 1; j < allEssays.length; j++) { + pairs.push([allEssays[i]!, allEssays[j]!]); + } + } + + const tasks: Array> = []; - // Step 1: Get topic from user - const topic = await promptForTopic(); - console.log(`\n📝 Topic: ${topic}\n`); + for (const judge of modelsToRun) { + for (const [essayA, essayB] of pairs) { + tasks.push( + limit(async () => { + const labelA = essayA.reviewer + ? `${essayA.author}←${essayA.reviewer}` + : essayA.author; + const labelB = essayB.reviewer + ? `${essayB.author}←${essayB.reviewer}` + : essayB.author; - // Step 2: Generate initial essay - console.log("Step 1/3: Generating initial essay..."); - const essayResult = await generateEssay(topic); - console.log("✓ Essay generated\n"); + console.log(` ${judge.name} comparing ${labelA} vs ${labelB}...`); + + const result = await compareEssays( + judge, + { author: essayA.author, text: essayA.text }, + { author: essayB.author, text: essayB.text }, + topic + ); + + const comparison: ComparisonResult = { + judge: judge.name, + essayA: { author: essayA.author, reviewer: essayA.reviewer }, + essayB: { author: essayB.author, reviewer: essayB.reviewer }, + winner: result.winner, + reasoning: result.reasoning, + }; + + comparisons.push(comparison); + usageTracker.comparisons[judge.name]!.push(result.usage); + + await writeComparison( + topicDir, + judge.name, + { author: essayA.author, reviewer: essayA.reviewer }, + { author: essayB.author, reviewer: essayB.reviewer }, + result.winner, + result.reasoning + ); + + const winnerLabel = + result.winner === "A" + ? labelA + : result.winner === "B" + ? labelB + : "Tie"; + console.log( + ` ✓ ${judge.name}: ${labelA} vs ${labelB} → ${winnerLabel} (${ + result.usage.totalTokens + } tokens, $${result.usage.cost.toFixed(4)})` + ); + }) + ); + } + } + + await Promise.all(tasks); + return comparisons; +} + +/** + * Calculate rankings from comparisons for a single topic. + */ +function calculateOneVsOneRankings( + comparisons: ComparisonResult[] +): OneVsOneTopicResults["rankings"] { + // Track wins/losses/ties per essay + const stats: Record< + string, + { + wins: number; + losses: number; + ties: number; + author: string; + reviewer?: string; + } + > = {}; + + function getKey(author: string, reviewer?: string) { + return reviewer ? `${author}:${reviewer}` : author; + } + + for (const comp of comparisons) { + const keyA = getKey(comp.essayA.author, comp.essayA.reviewer); + const keyB = getKey(comp.essayB.author, comp.essayB.reviewer); + + if (!stats[keyA]) { + stats[keyA] = { + wins: 0, + losses: 0, + ties: 0, + author: comp.essayA.author, + reviewer: comp.essayA.reviewer, + }; + } + if (!stats[keyB]) { + stats[keyB] = { + wins: 0, + losses: 0, + ties: 0, + author: comp.essayB.author, + reviewer: comp.essayB.reviewer, + }; + } + + if (comp.winner === "A") { + stats[keyA]!.wins++; + stats[keyB]!.losses++; + } else if (comp.winner === "B") { + stats[keyB]!.wins++; + stats[keyA]!.losses++; + } else { + stats[keyA]!.ties++; + stats[keyB]!.ties++; + } + } + + const essays = Object.values(stats).map((s) => ({ + author: s.author, + reviewer: s.reviewer, + wins: s.wins, + losses: s.losses, + ties: s.ties, + winRate: + s.wins + s.losses + s.ties > 0 + ? s.wins / (s.wins + s.losses + s.ties) + : 0, + })); + + essays.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins); + + return { essays }; +} + +/** + * Calculate aggregate rankings across all topics for 1v1 test. + */ +function calculateOneVsOneAggregateRankings( + topics: OneVsOneTopicResults[] +): OneVsOneResults["aggregateRankings"] { + // Aggregate by original author only (not per-revision) + const authorStats: Record< + string, + { wins: number; losses: number; ties: number } + > = {}; + + // Aggregate by reviewer (how well essays do after being revised by this reviewer) + const reviewerStats: Record< + string, + { wins: number; losses: number; ties: number } + > = {}; + + // Aggregate by author+reviewer pairing + const pairingStats: Record< + string, + { + author: string; + reviewer: string; + wins: number; + losses: number; + ties: number; + } + > = {}; + + for (const topic of topics) { + for (const entry of topic.rankings.essays) { + if (!entry.reviewer) { + // Original essay - count for author + if (!authorStats[entry.author]) { + authorStats[entry.author] = { wins: 0, losses: 0, ties: 0 }; + } + authorStats[entry.author]!.wins += entry.wins; + authorStats[entry.author]!.losses += entry.losses; + authorStats[entry.author]!.ties += entry.ties; + } else { + // Revised essay - count for reviewer and pairing + if (!reviewerStats[entry.reviewer]) { + reviewerStats[entry.reviewer] = { wins: 0, losses: 0, ties: 0 }; + } + reviewerStats[entry.reviewer]!.wins += entry.wins; + reviewerStats[entry.reviewer]!.losses += entry.losses; + reviewerStats[entry.reviewer]!.ties += entry.ties; + + const pairingKey = `${entry.author}:${entry.reviewer}`; + if (!pairingStats[pairingKey]) { + pairingStats[pairingKey] = { + author: entry.author, + reviewer: entry.reviewer, + wins: 0, + losses: 0, + ties: 0, + }; + } + pairingStats[pairingKey]!.wins += entry.wins; + pairingStats[pairingKey]!.losses += entry.losses; + pairingStats[pairingKey]!.ties += entry.ties; + } + } + } - // Step 3: Review the essay - console.log("Step 2/3: Reviewing essay..."); - const reviewResult = await reviewEssay(essayResult.text); - console.log("✓ Review completed\n"); + const calcWinRate = (s: { wins: number; losses: number; ties: number }) => + s.wins + s.losses + s.ties > 0 ? s.wins / (s.wins + s.losses + s.ties) : 0; - // Step 4: Revise the essay - console.log("Step 3/3: Revising essay based on feedback..."); - const revisionResult = await reviseEssay( + const essays = Object.entries(authorStats).map(([author, s]) => ({ + author, + wins: s.wins, + losses: s.losses, + ties: s.ties, + winRate: calcWinRate(s), + })); + essays.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins); + + const reviewers = Object.entries(reviewerStats).map(([reviewer, s]) => ({ + reviewer, + wins: s.wins, + losses: s.losses, + ties: s.ties, + winRate: calcWinRate(s), + })); + reviewers.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins); + + const pairings = Object.values(pairingStats).map((s) => ({ + author: s.author, + reviewer: s.reviewer, + wins: s.wins, + losses: s.losses, + ties: s.ties, + winRate: calcWinRate(s), + })); + pairings.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins); + + return { essays, reviewers, pairings }; +} + +/** + * Prints topic results for 1v1 test. + */ +function printOneVsOneTopicResults(result: OneVsOneTopicResults) { + console.log(`\n 📊 Results for "${result.topic}":\n`); + + console.log(" 📝 Essay Rankings (by win rate):"); + result.rankings.essays.slice(0, 5).forEach((entry, index) => { + const label = entry.reviewer + ? `${entry.author} ← ${entry.reviewer} (revised)` + : `${entry.author} (original)`; + console.log( + ` ${index + 1}. ${label} - ${entry.wins}W/${entry.losses}L/${ + entry.ties + }T (${(entry.winRate * 100).toFixed(1)}%)` + ); + }); + if (result.rankings.essays.length > 5) { + console.log(` ... and ${result.rankings.essays.length - 5} more`); + } +} + +/** + * Run all phases for a single topic (1v1 test). + */ +async function runOneVsOneTopicArena( + topic: string, + topicIndex: number, + totalTopics: number, + baseDir: string +): Promise { + console.log( + `\n${"═".repeat(60)}\n📚 Topic ${ + topicIndex + 1 + }/${totalTopics}: "${topic}"\n${"═".repeat(60)}` + ); + + const topicDir = await createTopicDirectories(baseDir, topic); + + console.log("\n 📝 Phase 1: Essay Generation"); + const essays = await runPhase1Essays(topic, topicDir); + console.log(` ✓ Phase 1 complete: ${modelsToRun.length} essays`); + + console.log("\n 📋 Phase 2: Feedback Generation"); + const feedback = await runPhase2Feedback(topic, essays, topicDir); + const feedbackCount = modelsToRun.length * (modelsToRun.length - 1); + console.log(` ✓ Phase 2 complete: ${feedbackCount} feedback pieces`); + + console.log("\n ✏️ Phase 3: Revisions"); + const revisions = await runPhase3Revisions(topic, essays, feedback, topicDir); + console.log(` ✓ Phase 3 complete: ${feedbackCount} revisions`); + + console.log("\n 🥊 Phase 4: Head-to-Head Comparisons"); + const comparisons = await runPhase4Comparisons( topic, - essayResult.text, - reviewResult.text + essays, + revisions, + topicDir ); - console.log("✓ Revision completed\n"); + console.log(` ✓ Phase 4 complete: ${comparisons.length} comparisons`); + + const rankings = calculateOneVsOneRankings(comparisons); + + return { + topic, + essays, + feedback, + revisions, + comparisons, + rankings, + }; +} + +/** + * Main 1v1 test orchestration. + */ +async function runOneVsOneTest(): Promise { + usageTracker = createUsageTracker(); + + const confirmed = await confirmOneVsOneRun(); + if (!confirmed) { + console.log("\nAborted."); + process.exit(0); + } + + const overallStart = Date.now(); + + const { baseDir, timestamp } = await initArenaRun("1v1"); + console.log(`\nResults will be saved to: ${baseDir}`); + + const topicResults: OneVsOneTopicResults[] = []; + const topicTimes: Array<{ topic: string; duration: number }> = []; + + for (let i = 0; i < TOPICS.length; i++) { + const topic = TOPICS[i]!; + const topicStart = Date.now(); + const result = await runOneVsOneTopicArena( + topic, + i, + TOPICS.length, + baseDir + ); + const topicDuration = Date.now() - topicStart; + topicResults.push(result); + topicTimes.push({ topic, duration: topicDuration }); + printOneVsOneTopicResults(result); + console.log(`\n ⏱️ Topic completed in ${formatDuration(topicDuration)}`); + } + + console.log("\n\n📊 Calculating aggregate rankings...\n"); + + // Log all topic results before aggregate + console.log("═".repeat(60)); + console.log("\n📋 INDIVIDUAL TOPIC RESULTS\n"); + for (const result of topicResults) { + printOneVsOneTopicResults(result); + console.log(""); + } + + const aggregateRankings = calculateOneVsOneAggregateRankings(topicResults); + + const results: OneVsOneResults = { + timestamp, + models: modelsToRun.map((m) => m.name), + topics: topicResults, + aggregateRankings, + }; - // Step 5: Write outputs to files - await writePipelineOutputs({ - essay: essayResult.text, - review: reviewResult.text, - revision: revisionResult.text, + await writeOneVsOneResultsJson(baseDir, results); + await writeOneVsOneSummary(baseDir, results); + + console.log("═".repeat(60)); + console.log("\n🏆 AGGREGATE RESULTS\n"); + + console.log("📝 Models (as Writers - Original Essays):\n"); + aggregateRankings.essays.forEach((entry, index) => { + console.log( + ` ${index + 1}. ${entry.author} - ${entry.wins}W/${entry.losses}L/${ + entry.ties + }T (${(entry.winRate * 100).toFixed(1)}% win rate)` + ); + }); + + console.log("\n🎯 Reviewers (by revised essay performance):\n"); + aggregateRankings.reviewers.forEach((entry, index) => { + console.log( + ` ${index + 1}. ${entry.reviewer} - ${entry.wins}W/${entry.losses}L/${ + entry.ties + }T (${(entry.winRate * 100).toFixed(1)}% win rate)` + ); + }); + + console.log("\n🤝 Pairings (Author + Reviewer):\n"); + aggregateRankings.pairings.forEach((entry, index) => { + console.log( + ` ${index + 1}. ${entry.author} ← ${entry.reviewer} - ${entry.wins}W/${ + entry.losses + }L/${entry.ties}T (${(entry.winRate * 100).toFixed(1)}% win rate)` + ); + }); + + printUsageSummary("1v1"); + + const overallDuration = Date.now() - overallStart; + console.log("\n" + "═".repeat(60)); + console.log("\n⏱️ RUNTIME SUMMARY\n"); + topicTimes.forEach((t) => { + console.log( + ` ${t.topic.slice(0, 40).padEnd(42)} ${formatDuration(t.duration)}` + ); }); + console.log(` ${"─".repeat(50)}`); + console.log(` ${"Total".padEnd(42)} ${formatDuration(overallDuration)}`); + + console.log(`\n✨ 1v1 test complete! Results saved to: ${baseDir}`); +} + +// ============================================================================ +// SHARED UTILITIES +// ============================================================================ + +/** + * Calculates average tokens from an array of usage records. + */ +function calcAverage(usages: TokenUsage[]) { + if (usages.length === 0) return { tokens: 0, cost: 0 }; + const totalTokens = usages.reduce((sum, u) => sum + u.totalTokens, 0); + const totalCost = usages.reduce((sum, u) => sum + u.cost, 0); + return { + tokens: Math.round(totalTokens / usages.length), + cost: totalCost, + }; +} + +/** + * Prints a summary of token usage and costs. + */ +function printUsageSummary(testType: TestType) { + console.log("\n" + "═".repeat(60)); + console.log("\n💰 TOKEN USAGE & COST SUMMARY\n"); + + let totalEssayCost = 0; + let totalReviewCost = 0; + let totalRevisionCost = 0; + let totalScoreCost = 0; + let totalComparisonCost = 0; + + const modelStats: Array<{ + name: string; + essayAvgTokens: number; + essayCost: number; + reviewAvgTokens: number; + reviewCost: number; + revisionAvgTokens: number; + revisionCost: number; + scoreCost: number; + comparisonCost: number; + totalCost: number; + }> = []; + + for (const model of modelsToRun) { + const essayStats = calcAverage(usageTracker.essays[model.name]!); + const reviewStats = calcAverage(usageTracker.reviews[model.name]!); + const revisionStats = calcAverage(usageTracker.revisions[model.name]!); + const scoreStats = calcAverage(usageTracker.scores[model.name]!); + const comparisonStats = calcAverage(usageTracker.comparisons[model.name]!); + + totalEssayCost += essayStats.cost; + totalReviewCost += reviewStats.cost; + totalRevisionCost += revisionStats.cost; + totalScoreCost += scoreStats.cost; + totalComparisonCost += comparisonStats.cost; + + modelStats.push({ + name: model.name, + essayAvgTokens: essayStats.tokens, + essayCost: essayStats.cost, + reviewAvgTokens: reviewStats.tokens, + reviewCost: reviewStats.cost, + revisionAvgTokens: revisionStats.tokens, + revisionCost: revisionStats.cost, + scoreCost: scoreStats.cost, + comparisonCost: comparisonStats.cost, + totalCost: + essayStats.cost + + reviewStats.cost + + revisionStats.cost + + scoreStats.cost + + comparisonStats.cost, + }); + } + + const grandTotal = + totalEssayCost + + totalReviewCost + + totalRevisionCost + + totalScoreCost + + totalComparisonCost; + + console.log("Phase Costs:"); + console.log(` Essays (First): $${totalEssayCost.toFixed(4)}`); + console.log(` Reviews: $${totalReviewCost.toFixed(4)}`); + console.log(` Revisions (Follow): $${totalRevisionCost.toFixed(4)}`); + if (testType === "scoring-test") { + console.log(` Scoring: $${totalScoreCost.toFixed(4)}`); + } else { + console.log(` Comparisons: $${totalComparisonCost.toFixed(4)}`); + } + console.log(` ────────────────────────────`); + console.log(` Total: $${grandTotal.toFixed(4)}`); + + console.log("\n\nPer-Model Token Averages & Costs:\n"); + console.log( + " Model".padEnd(32) + + "First Essay".padStart(14) + + "Reviews".padStart(14) + + "Follow-up".padStart(14) + + "Total Cost".padStart(12) + ); + console.log(" " + "─".repeat(84)); + + for (const stat of modelStats.sort((a, b) => b.totalCost - a.totalCost)) { + const essayCol = `${stat.essayAvgTokens} tok`.padStart(14); + const reviewCol = `${stat.reviewAvgTokens} tok`.padStart(14); + const revisionCol = `${stat.revisionAvgTokens} tok`.padStart(14); + const costCol = `$${stat.totalCost.toFixed(4)}`.padStart(12); + console.log( + ` ${stat.name.padEnd(30)}${essayCol}${reviewCol}${revisionCol}${costCol}` + ); + } + + console.log("\n " + "─".repeat(84)); + console.log(` ${"GRAND TOTAL".padEnd(72)}$${grandTotal.toFixed(4)}`); +} + +// ============================================================================ +// MAIN ENTRY POINT +// ============================================================================ + +async function main() { + let testType = getTestTypeFromArgs(); + + if (!testType) { + testType = await selectTestType(); + } - console.log("\n✨ Pipeline complete!"); + if (testType === "scoring-test") { + await runScoringTest(); + } else { + await runOneVsOneTest(); + } } -// Run the pipeline -runEssayPipeline().catch((error) => { - console.error("Error running pipeline:", error); +main().catch((error) => { + console.error("Error running arena:", error); process.exit(1); }); diff --git a/package.json b/package.json index f21c41c..e0f5ced 100644 --- a/package.json +++ b/package.json @@ -5,7 +5,9 @@ "private": true, "dependencies": { "ai": "^5.0.0", - "@openrouter/ai-sdk-provider": "^1.0.0" + "@openrouter/ai-sdk-provider": "^1.0.0", + "p-limit": "^6.1.0", + "zod": "^3.24.0" }, "devDependencies": { "@types/bun": "latest"