diff --git a/.cursor/plan.md b/.cursor/plan.md
index f1b7fed..01b2060 100644
--- a/.cursor/plan.md
+++ b/.cursor/plan.md
@@ -1,55 +1,116 @@
-## AI Essay → Review → Revision Pipeline
-
-### Goal
-
-Implement a **Bun-friendly TypeScript CLI** (`bun run index.ts`) that:
-
-- Prompts the user for an essay topic.
-- Uses **model A (OpenRouter via Vercel AI SDK)** to generate an essay.
-- Uses **model B** to review that essay and produce feedback.
-- Calls **model A again** with the feedback to produce a revised essay.
-- Saves all three artifacts as **markdown files** on disk in a consistent location, with the `runs/` directory ignored by git.
-
-### High-level Design
-
-- **Runtime & entrypoint**: Keep using Bun with `index.ts` as the main CLI entrypoint.
-- **AI client setup**:
-- Add `ai` and the **OpenRouter provider for the Vercel AI SDK** as dependencies (no separate `openai` package needed since we are using OpenRouter directly).
-- Configure a small `aiClient.ts` module (or keep logic inline in `index.ts` if very small) that wires the AI SDK to OpenRouter using an `OPENROUTER_API_KEY` env var.
-- Hard-code two model IDs (e.g. one for essay generation, one for review) with clear `const` names so you can easily change them later.
-- **Pipeline orchestration**:
-- Implement a `runEssayPipeline()` function that:
-- Reads the prompt from stdin (simple interactive question).
-- Calls the **essay model** with a system prompt + user prompt to generate the initial essay.
-- Calls the **review model** with system instructions plus the essay content to generate feedback.
-- Calls the **essay model** again with the original prompt and the feedback to produce a revised essay.
-- Keep everything **strongly typed** with small TypeScript interfaces for the pipeline results.
-- **Markdown file output**:
-- Decide on a simple folder and naming scheme (e.g. `runs/<timestamp>-essay.md`, `runs/<timestamp>-review.md`, `runs/<timestamp>-revision.md`).
-- Use Bun / Node fs APIs in a small utility to write each step as a separate markdown file.
-- Include basic front-matter or headings (e.g. `# Original Essay`, `# Review Feedback`, `# Revised Essay`) for easy inspection in an editor.
-- Ensure `runs/` is added to `.gitignore` so generated artifacts don’t clutter git history.
-
-### Implementation Steps
-
-- **setup-deps**: Add `ai` and the OpenRouter provider for the AI SDK to `package.json` and document the required `OPENROUTER_API_KEY` env var in `README.md`.
-- **ai-client**: Create a small AI client configuration that:
-- Instantiates the AI SDK with the OpenRouter provider.
-- Exposes typed helpers like `generateEssay(prompt)`, `reviewEssay(essay)`, and `reviseEssay(prompt, essay, feedback)`.
-- **pipeline-logic**: Implement `runEssayPipeline()` in `index.ts` that:
-- Interactively asks for a prompt via stdin.
-- Runs the three AI steps in sequence (no streaming needed) with clear logging to the console.
-- Returns a typed result object containing the three text outputs.
-- **file-output**: Add a small utility function to:
-- Create a `runs/` directory if it doesn’t exist.
-- Write three markdown files with timestamped names and simple headings.
-- Confirm that `runs/` is listed in `.gitignore`.
-- **polish-types**: Ensure all public functions are type-safe (typed params and return types where helpful) and that the code compiles under the existing `tsconfig`.
-
-### Todos
-
-- **setup-deps**: Add and configure Vercel AI SDK (`ai`) and the OpenRouter provider, and document `OPENROUTER_API_KEY`.
-- **ai-client**: Implement the AI client helper(s) for essay generation, review, and revision using hard-coded OpenRouter model IDs.
-- **pipeline-logic**: Implement the CLI flow in `index.ts` to run the generation → review → revision pipeline.
-- **file-output**: Implement markdown file-writing utilities (create `runs/` directory, timestamped filenames, headings) and ensure `runs/` is in `.gitignore`.
-- **polish-types**: Run TypeScript checks and tighten any loose types if needed.
+# Writing Quality Arena
+
+## Models Configuration
+
+Use the provided `modelsToRun` array in `constants.ts`:
+
+```ts
+export type RunnableModel = {
+  name: string;
+  llm: LanguageModelV1;
+  reasoning: boolean;
+};
+
+export const modelsToRun: RunnableModel[] = [
+  {
+    name: "claude-4.5-opus-reasoning",
+    llm: openrouter("anthropic/claude-opus-4.5"),
+    reasoning: true,
+  },
+  // ... 11 models total
+];
+
+export const PARALLEL_LIMIT = 5; // Configurable concurrency
+```
+
+## Execution Flow (4 Phases)
+
+### Phase 1: Essay Generation
+
+Each model writes an essay on the topic. **N calls**.
+
+### Phase 2: All-to-All Review
+
+Every model reviews EVERY essay (including their own). **N × N calls**.
+
+### Phase 3: Per-Reviewer Revisions
+
+Each model creates a separate revised essay for EACH piece of feedback received. **N × N revisions**.
+
+### Phase 4: Scoring
+
+Every model scores EVERY essay (N originals + N×(N-1) revisions). Use `generateObject` with Zod schema:
+
+```ts
+const ScoreSchema = z.object({
+  score: z.number().min(1).max(10),
+  justification: z.string(),
+});
+```
+
+**N × (N + N×(N-1)) = N × N² = N³ calls**.
+
+## API Call Summary (N=11 models)
+
+| Phase | Formula | Calls |
+
+|-------|---------|-------|
+
+| Essays | N | 11 |
+
+| Feedback | N×(N-1) | 110 |
+
+| Revisions | N×(N-1) | 110 |
+
+| Scores | N³ | 1331 |
+
+| **Total** | | **1562** |
+
+## Rankings
+
+**Essay Ranking**: All essays (original + revised) ranked by average score across all judges.
+
+**Reviewer Ranking**: For each reviewer, calculate avg improvement = mean(revision_score - original_score) for all revisions that used their feedback.
+
+## File Structure
+
+```
+results/{timestamp}/
+├── essays/{model-name}.md
+├── feedback/{reviewer}-on-{author}.md
+├── revisions/{author}-revised-by-{reviewer}.md
+├── results.json
+└── summary.md
+```
+
+## File Changes
+
+| File | Change |
+
+|------|--------|
+
+| `constants.ts` | Add `RunnableModel` type, `modelsToRun` array, `PARALLEL_LIMIT` |
+
+| `types.ts` | Already has appropriate types; verify alignment |
+
+| `aiClient.ts` | Update functions to accept `RunnableModel`, add `scoreEssay()` using `generateObject` |
+
+| `index.ts` | Rewrite with 4-phase arena orchestration, parallel execution via `p-limit`, `confirmRun()` |
+
+| `fileUtils.ts` | Rewrite for arena folder structure (`results/` dir, essays/, feedback/, revisions/, results.json, summary.md) |
+
+## CLI Confirmation
+
+Display call counts and prompt before running:
+
+```ts
+async function confirmRun(): Promise<boolean> {
+  const n = modelsToRun.length;
+  const essays = n;
+  const feedback = n * (n - 1);
+  const revisions = n * (n - 1);
+  const scores = n * n * n;
+  const total = essays + feedback + revisions + scores;
+  // ... display and prompt Y/n
+}
+```
diff --git a/.gitignore b/.gitignore
index 1c7683b..0cb6022 100644
--- a/.gitignore
+++ b/.gitignore
@@ -35,3 +35,4 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json
 
 # Generated essay runs
 runs/
+results/
diff --git a/aiClient.ts b/aiClient.ts
index 96ef8fc..bd4d0ec 100644
--- a/aiClient.ts
+++ b/aiClient.ts
@@ -1,63 +1,111 @@
-import { createOpenRouter } from "@openrouter/ai-sdk-provider";
 import { generateText } from "ai";
+import type { RunnableModel } from "./constants";
 
-// Model IDs - easily changeable constants
-const ESSAY_MODEL = "anthropic/claude-opus-4.5";
-const REVIEW_MODEL = "moonshotai/kimi-k2-thinking";
+/**
+ * Extracts cost from OpenRouter provider metadata.
+ */
+function extractCost(
+  providerMetadata: Record<string, unknown> | undefined
+): number {
+  if (!providerMetadata) return 0;
+  const openrouterMeta = providerMetadata.openrouter as any;
+
+  if (openrouterMeta?.usage?.cost) {
+    return openrouterMeta.usage.cost;
+  }
 
-// Initialize the OpenRouter provider
-if (!process.env.OPENROUTER_API_KEY) {
-  throw new Error(
-    "OPENROUTER_API_KEY environment variable is required. Please set it before running the script."
-  );
+  if (openrouterMeta?.usage?.costDetails?.upstreamInferenceCost) {
+    return openrouterMeta.usage.costDetails.upstreamInferenceCost;
+  }
+
+  return 0;
 }
 
-const openrouter = createOpenRouter({
-  apiKey: process.env.OPENROUTER_API_KEY,
-});
+export interface TokenUsage {
+  inputTokens: number;
+  outputTokens: number;
+  totalTokens: number;
+  /** Cost in USD from OpenRouter */
+  cost: number;
+}
 
 export interface EssayResult {
   text: string;
+  usage: TokenUsage;
 }
 
 export interface ReviewResult {
   text: string;
+  usage: TokenUsage;
 }
 
 export interface RevisionResult {
   text: string;
+  usage: TokenUsage;
+}
+
+export interface ScoreResult {
+  score: number;
+  justification: string;
+  usage: TokenUsage;
+}
+
+export interface CompareResult {
+  winner: "A" | "B" | "tie";
+  reasoning: string;
+  usage: TokenUsage;
 }
 
 /**
  * Generates an essay based on the given topic prompt.
  */
-export async function generateEssay(topic: string): Promise<EssayResult> {
+export async function generateEssay(
+  model: RunnableModel,
+  topic: string
+): Promise<EssayResult> {
   const result = await generateText({
-    model: openrouter(ESSAY_MODEL),
+    model: model.llm,
     system: `You are an expert essay writer. Write a well-structured, thoughtful essay on the given topic. 
-The essay should be clear, engaging, and demonstrate strong writing skills.`,
+The essay should be clear, engaging, and demonstrate strong writing skills.
+Write approximately 800-1200 words.`,
     prompt: `Write an essay on the following topic:\n\n${topic}`,
   });
 
   return {
     text: result.text,
+    usage: {
+      inputTokens: result.usage?.inputTokens ?? 0,
+      outputTokens: result.usage?.outputTokens ?? 0,
+      totalTokens: result.usage?.totalTokens ?? 0,
+      cost: extractCost(result.providerMetadata),
+    },
   };
 }
 
 /**
  * Reviews an essay and provides constructive feedback.
  */
-export async function reviewEssay(essay: string): Promise<ReviewResult> {
+export async function reviewEssay(
+  model: RunnableModel,
+  essay: string,
+  topic: string
+): Promise<ReviewResult> {
   const result = await generateText({
-    model: openrouter(REVIEW_MODEL),
+    model: model.llm,
     system: `You are an expert writing tutor and editor. Review the essay provided and give constructive, 
 specific feedback on areas such as structure, clarity, argumentation, style, and areas for improvement. 
-Be thorough but encouraging.`,
-    prompt: `Please review the following essay and provide detailed feedback:\n\n${essay}`,
+Be thorough but encouraging. Focus on actionable improvements.`,
+    prompt: `Topic: ${topic}\n\nPlease review the following essay and provide detailed feedback:\n\n${essay}`,
   });
 
   return {
     text: result.text,
+    usage: {
+      inputTokens: result.usage?.inputTokens ?? 0,
+      outputTokens: result.usage?.outputTokens ?? 0,
+      totalTokens: result.usage?.totalTokens ?? 0,
+      cost: extractCost(result.providerMetadata),
+    },
   };
 }
 
@@ -65,18 +113,134 @@ Be thorough but encouraging.`,
  * Revises an essay based on the original topic, original essay, and review feedback.
  */
 export async function reviseEssay(
+  model: RunnableModel,
   topic: string,
   originalEssay: string,
   feedback: string
 ): Promise<RevisionResult> {
   const result = await generateText({
-    model: openrouter(ESSAY_MODEL),
+    model: model.llm,
     system: `You are an expert essay writer. Revise the provided essay based on the feedback given, 
-while maintaining the core message and improving the areas identified.`,
+while maintaining the core message and improving the areas identified. 
+Produce a complete revised essay, not just suggestions.`,
     prompt: `Original topic: ${topic}\n\nOriginal essay:\n${originalEssay}\n\nReview feedback:\n${feedback}\n\nPlease revise the essay based on the feedback above.`,
   });
 
   return {
     text: result.text,
+    usage: {
+      inputTokens: result.usage?.inputTokens ?? 0,
+      outputTokens: result.usage?.outputTokens ?? 0,
+      totalTokens: result.usage?.totalTokens ?? 0,
+      cost: extractCost(result.providerMetadata),
+    },
+  };
+}
+
+/**
+ * Scores an essay on a scale of 1-10 with justification.
+ */
+export async function scoreEssay(
+  model: RunnableModel,
+  essay: string,
+  topic: string
+): Promise<ScoreResult> {
+  const result = await generateText({
+    model: model.llm,
+    system: `You are an expert essay judge. Score the essay on a scale of 1-10 based on:
+- Clarity and coherence of argument
+- Quality of writing (style, grammar, flow)
+- Depth of insight and originality
+- Relevance to the topic
+- Overall effectiveness
+
+Be fair and consistent in your scoring. A score of 5 is average, 7-8 is good, 9-10 is exceptional.
+
+IMPORTANT: Start your response with EXACTLY "Score: X/10" on the first line (where X is your score), then provide your detailed justification below.`,
+    prompt: `Topic: ${topic}\n\nPlease score the following essay:\n\n${essay}`,
+  });
+
+  // Parse score from the text - look for "Score: X/10" or similar patterns
+  const scoreMatch = result.text.match(/Score:\s*(\d+(?:\.\d+)?)\s*\/\s*10/i);
+  const score = scoreMatch?.[1] ? parseFloat(scoreMatch[1]) : 5; // Default to 5 if parsing fails
+
+  // Everything after the score line is the justification
+  const justification = result.text
+    .replace(/^Score:\s*\d+(?:\.\d+)?\s*\/\s*10\s*/i, "")
+    .trim();
+
+  return {
+    score: Math.min(10, Math.max(1, score)), // Clamp between 1-10
+    justification,
+    usage: {
+      inputTokens: result.usage?.inputTokens ?? 0,
+      outputTokens: result.usage?.outputTokens ?? 0,
+      totalTokens: result.usage?.totalTokens ?? 0,
+      cost: extractCost(result.providerMetadata),
+    },
+  };
+}
+
+/**
+ * Compares two essays head-to-head and picks a winner.
+ */
+export async function compareEssays(
+  judge: RunnableModel,
+  essayA: { author: string; text: string },
+  essayB: { author: string; text: string },
+  topic: string
+): Promise<CompareResult> {
+  const result = await generateText({
+    model: judge.llm,
+    system: `You are an expert essay judge conducting a head-to-head comparison. You will be shown two essays on the same topic, labeled Essay A and Essay B. 
+
+Compare them based on:
+- Clarity and coherence of argument
+- Quality of writing (style, grammar, flow)
+- Depth of insight and originality
+- Relevance to the topic
+- Overall effectiveness
+
+You MUST pick a winner. Only declare a tie if the essays are genuinely indistinguishable in quality.
+
+IMPORTANT: Start your response with EXACTLY one of these on the first line:
+- "Winner: A" (if Essay A is better)
+- "Winner: B" (if Essay B is better)
+- "Winner: Tie" (only if truly equal)
+
+Then provide your detailed reasoning below, explaining why you chose that winner.`,
+    prompt: `Topic: ${topic}
+
+Essay A:
+${essayA.text}
+
+Essay B:
+${essayB.text}
+
+Compare these essays and pick a winner.`,
+  });
+
+  // Parse winner from the text
+  const winnerMatch = result.text.match(/Winner:\s*(A|B|Tie)/i);
+  let winner: "A" | "B" | "tie" = "tie";
+  if (winnerMatch) {
+    const parsed = winnerMatch[1]!.toUpperCase();
+    if (parsed === "A") winner = "A";
+    else if (parsed === "B") winner = "B";
+    else winner = "tie";
+  }
+
+  // Everything after the winner line is the reasoning
+  const reasoning = result.text.replace(/^Winner:\s*(A|B|Tie)\s*/i, "").trim();
+
+  return {
+    winner,
+    reasoning,
+    usage: {
+      inputTokens: result.usage?.inputTokens ?? 0,
+      outputTokens: result.usage?.outputTokens ?? 0,
+      totalTokens: result.usage?.totalTokens ?? 0,
+      cost: extractCost(result.providerMetadata),
+    },
   };
 }
diff --git a/bun.lock b/bun.lock
index adc4602..325db0a 100644
--- a/bun.lock
+++ b/bun.lock
@@ -4,6 +4,12 @@
   "workspaces": {
     "": {
       "name": "auto-draftify",
+      "dependencies": {
+        "@openrouter/ai-sdk-provider": "^1.0.0",
+        "ai": "^5.0.0",
+        "p-limit": "^6.1.0",
+        "zod": "^3.24.0",
+      },
       "devDependencies": {
         "@types/bun": "latest",
       },
@@ -13,14 +19,42 @@
     },
   },
   "packages": {
+    "@ai-sdk/gateway": ["@ai-sdk/gateway@2.0.17", "", { "dependencies": { "@ai-sdk/provider": "2.0.0", "@ai-sdk/provider-utils": "3.0.18", "@vercel/oidc": "3.0.5" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-oVAG6q72KsjKlrYdLhWjRO7rcqAR8CjokAbYuyVZoCO4Uh2PH/VzZoxZav71w2ipwlXhHCNaInGYWNs889MMDA=="],
+
+    "@ai-sdk/provider": ["@ai-sdk/provider@2.0.0", "", { "dependencies": { "json-schema": "^0.4.0" } }, "sha512-6o7Y2SeO9vFKB8lArHXehNuusnpddKPk7xqL7T2/b+OvXMRIXUO1rR4wcv1hAFUAT9avGZshty3Wlua/XA7TvA=="],
+
+    "@ai-sdk/provider-utils": ["@ai-sdk/provider-utils@3.0.18", "", { "dependencies": { "@ai-sdk/provider": "2.0.0", "@standard-schema/spec": "^1.0.0", "eventsource-parser": "^3.0.6" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-ypv1xXMsgGcNKUP+hglKqtdDuMg68nWHucPPAhIENrbFAI+xCHiqPVN8Zllxyv1TNZwGWUghPxJXU+Mqps0YRQ=="],
+
+    "@openrouter/ai-sdk-provider": ["@openrouter/ai-sdk-provider@1.2.8", "", { "dependencies": { "@openrouter/sdk": "^0.1.8" }, "peerDependencies": { "ai": "^5.0.0", "zod": "^3.24.1 || ^v4" } }, "sha512-pQT8AzZBKg9f4bkt4doF486ZlhK0XjKkevrLkiqYgfh1Jplovieu28nK4Y+xy3sF18/mxjqh9/2y6jh01qzLrA=="],
+
+    "@openrouter/sdk": ["@openrouter/sdk@0.1.27", "", { "dependencies": { "zod": "^3.25.0 || ^4.0.0" } }, "sha512-RH//L10bSmc81q25zAZudiI4kNkLgxF2E+WU42vghp3N6TEvZ6F0jK7uT3tOxkEn91gzmMw9YVmDENy7SJsajQ=="],
+
+    "@opentelemetry/api": ["@opentelemetry/api@1.9.0", "", {}, "sha512-3giAOQvZiH5F9bMlMiv8+GSPMeqg0dbaeo58/0SlA9sxSqZhnUtxzX9/2FzyhS9sWQf5S0GJE0AKBrFqjpeYcg=="],
+
+    "@standard-schema/spec": ["@standard-schema/spec@1.0.0", "", {}, "sha512-m2bOd0f2RT9k8QJx1JN85cZYyH1RqFBdlwtkSlf4tBDYLCiiZnv1fIIwacK6cqwXavOydf0NPToMQgpKq+dVlA=="],
+
     "@types/bun": ["@types/bun@1.3.3", "", { "dependencies": { "bun-types": "1.3.3" } }, "sha512-ogrKbJ2X5N0kWLLFKeytG0eHDleBYtngtlbu9cyBKFtNL3cnpDZkNdQj8flVf6WTZUX5ulI9AY1oa7ljhSrp+g=="],
 
     "@types/node": ["@types/node@24.10.1", "", { "dependencies": { "undici-types": "~7.16.0" } }, "sha512-GNWcUTRBgIRJD5zj+Tq0fKOJ5XZajIiBroOF0yvj2bSU1WvNdYS/dn9UxwsujGW4JX06dnHyjV2y9rRaybH0iQ=="],
 
+    "@vercel/oidc": ["@vercel/oidc@3.0.5", "", {}, "sha512-fnYhv671l+eTTp48gB4zEsTW/YtRgRPnkI2nT7x6qw5rkI1Lq2hTmQIpHPgyThI0znLK+vX2n9XxKdXZ7BUbbw=="],
+
+    "ai": ["ai@5.0.104", "", { "dependencies": { "@ai-sdk/gateway": "2.0.17", "@ai-sdk/provider": "2.0.0", "@ai-sdk/provider-utils": "3.0.18", "@opentelemetry/api": "1.9.0" }, "peerDependencies": { "zod": "^3.25.76 || ^4.1.8" } }, "sha512-MZOkL9++nY5PfkpWKBR3Rv+Oygxpb9S16ctv8h91GvrSif7UnNEdPMVZe3bUyMd2djxf0AtBk/csBixP0WwWZQ=="],
+
     "bun-types": ["bun-types@1.3.3", "", { "dependencies": { "@types/node": "*" } }, "sha512-z3Xwlg7j2l9JY27x5Qn3Wlyos8YAp0kKRlrePAOjgjMGS5IG6E7Jnlx736vH9UVI4wUICwwhC9anYL++XeOgTQ=="],
 
+    "eventsource-parser": ["eventsource-parser@3.0.6", "", {}, "sha512-Vo1ab+QXPzZ4tCa8SwIHJFaSzy4R6SHf7BY79rFBDf0idraZWAkYrDjDj8uWaSm3S2TK+hJ7/t1CEmZ7jXw+pg=="],
+
+    "json-schema": ["json-schema@0.4.0", "", {}, "sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA=="],
+
+    "p-limit": ["p-limit@6.2.0", "", { "dependencies": { "yocto-queue": "^1.1.1" } }, "sha512-kuUqqHNUqoIWp/c467RI4X6mmyuojY5jGutNU0wVTmEOOfcuwLqyMVoAi9MKi2Ak+5i9+nhmrK4ufZE8069kHA=="],
+
     "typescript": ["typescript@5.9.3", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-jl1vZzPDinLr9eUt3J/t7V6FgNEw9QjvBPdysz9KfQDD41fQrC2Y4vKQdiaUpFT4bXlb1RHhLpp8wtm6M5TgSw=="],
 
     "undici-types": ["undici-types@7.16.0", "", {}, "sha512-Zz+aZWSj8LE6zoxD+xrjh4VfkIG8Ya6LvYkZqtUQGJPZjYl53ypCaUwWqo7eI0x66KBGeRo+mlBEkMSeSZ38Nw=="],
+
+    "yocto-queue": ["yocto-queue@1.2.2", "", {}, "sha512-4LCcse/U2MHZ63HAJVE+v71o7yOdIe4cZ70Wpf8D/IyjDKYQLV5GD46B+hSTjJsvV5PztjvHoU580EftxjDZFQ=="],
+
+    "zod": ["zod@3.25.76", "", {}, "sha512-gzUt/qt81nXsFGKIFcC3YnfEAx5NkunCfnDlvuBSSFS02bcXu4Lmea0AFIUwbLWxWPx3d9p8S5QoaujKcNQxcQ=="],
   }
 }
diff --git a/constants.ts b/constants.ts
new file mode 100644
index 0000000..e5768d0
--- /dev/null
+++ b/constants.ts
@@ -0,0 +1,131 @@
+import { createOpenRouter } from "@openrouter/ai-sdk-provider";
+import type { LanguageModel } from "ai";
+
+// Initialize the OpenRouter provider
+if (!process.env.OPENROUTER_API_KEY) {
+  throw new Error(
+    "OPENROUTER_API_KEY environment variable is required. Please set it before running the script."
+  );
+}
+
+const openrouter = createOpenRouter({
+  apiKey: process.env.OPENROUTER_API_KEY,
+});
+
+// Parallelism configuration
+export const PARALLEL_LIMIT = 30;
+
+// Essay topics
+export const TOPICS = [
+  // "The role of failure in personal growth",
+  // "Why boredom is underrated",
+  "The ethics of artificial intelligence",
+  "How social media reshapes human connection",
+  // "The value of slow living in a fast world",
+  // "Why we should embrace uncertainty",
+  // "The hidden costs of convenience",
+  // "What makes a good explanation",
+  // "The relationship between creativity and constraint",
+  // "Why some ideas spread and others don't",
+  "the negative impacts on society from artificial intelligence",
+] as const;
+
+// Model definition
+export interface RunnableModel {
+  name: string;
+  llm: LanguageModel;
+  reasoning: boolean;
+}
+
+// Include "usage" so we can log cost
+const defaultProviderOptions = {
+  usage: {
+    include: true,
+  },
+};
+
+export const modelsToRun: RunnableModel[] = [
+  // Anthropic
+  {
+    name: "claude-4.5-opus-reasoning",
+    llm: openrouter("anthropic/claude-opus-4.5", defaultProviderOptions),
+    reasoning: true,
+  },
+  // {
+  //   name: "claude-4.5-opus-non-reasoning",
+  //   llm: openrouter("anthropic/claude-opus-4.5", defaultProviderOptions),
+  //   reasoning: false,
+  // },
+
+  // OpenAI
+  // {
+  //   name: "gpt-4o",
+  //   llm: openrouter("openai/gpt-4o", defaultProviderOptions),
+  //   reasoning: false,
+  // },
+  {
+    name: "gpt-5.1",
+    llm: openrouter("openai/gpt-5.1", defaultProviderOptions),
+    reasoning: true,
+  },
+  // {
+  //   name: "gpt-5.1-chat",
+  //   llm: openrouter("openai/gpt-5.1-chat", defaultProviderOptions),
+  //   reasoning: false,
+  // },
+  // {
+  //   name: "gpt-5-mini",
+  //   llm: openrouter("openai/gpt-5-mini", defaultProviderOptions),
+  //   reasoning: true,
+  // },
+
+  // Google
+  {
+    name: "gemini-3-pro-preview",
+    llm: openrouter("google/gemini-3-pro-preview", defaultProviderOptions),
+    reasoning: true,
+  },
+  // {
+  //   name: "gemini-2.5-pro",
+  //   llm: openrouter("google/gemini-2.5-pro", defaultProviderOptions),
+  //   reasoning: true,
+  // },
+
+  // Grok
+  // {
+  //   name: "grok-4.1-fast",
+  //   llm: openrouter("x-ai/grok-4.1-fast", defaultProviderOptions),
+  //   reasoning: true,
+  // },
+
+  // Open Weight
+  // {
+  //   name: "kimi-k2",
+  //   llm: openrouter("moonshotai/kimi-k2", defaultProviderOptions),
+  //   reasoning: false,
+  // },
+  {
+    name: "kimi-k2-thinking",
+    llm: openrouter("moonshotai/kimi-k2-thinking", defaultProviderOptions),
+    reasoning: true,
+  },
+];
+
+// Cheap models for dry-run testing
+export const dryRunModels: RunnableModel[] = [
+  {
+    name: "claude-4.5-haiku",
+    llm: openrouter("anthropic/claude-haiku-4.5", defaultProviderOptions),
+    reasoning: false,
+  },
+  {
+    name: "gemini-2.5-flash",
+    llm: openrouter("google/gemini-2.5-flash", defaultProviderOptions),
+    reasoning: true,
+  },
+  {
+    name: "gpt-5-mini",
+    llm: openrouter("openai/gpt-5-mini", defaultProviderOptions),
+    reasoning: true,
+  },
+];
diff --git a/fileUtils.ts b/fileUtils.ts
index b11c2f9..ed96fed 100644
--- a/fileUtils.ts
+++ b/fileUtils.ts
@@ -1,26 +1,111 @@
 import { mkdir, writeFile } from "fs/promises";
 import { join } from "path";
 
-const RUNS_DIR = "runs";
+const RESULTS_DIR = "results";
 
-export interface PipelineOutput {
-  essay: string;
-  review: string;
-  revision: string;
+export type TestType = "scoring-test" | "1v1";
+
+export interface TopicResults {
+  topic: string;
+  essays: Record<string, string>;
+  feedback: Record<string, Record<string, string>>;
+  revisions: Record<string, Record<string, string>>;
+  scores: {
+    original: Record<
+      string,
+      Record<string, { score: number; justification: string }>
+    >;
+    revised: Record<
+      string,
+      Record<string, Record<string, { score: number; justification: string }>>
+    >;
+  };
+  rankings: {
+    essays: Array<{
+      type: "original" | "revised";
+      author: string;
+      reviewer?: string;
+      avgScore: number;
+    }>;
+    reviewers: Array<{
+      reviewer: string;
+      avgImprovement: number;
+    }>;
+  };
 }
 
-/**
- * Ensures the runs directory exists, creating it if necessary.
- */
-async function ensureRunsDirectory(): Promise<void> {
-  try {
-    await mkdir(RUNS_DIR, { recursive: true });
-  } catch (error) {
-    // Directory might already exist, which is fine
-    if ((error as NodeJS.ErrnoException).code !== "EEXIST") {
-      throw error;
-    }
-  }
+export interface ArenaResults {
+  timestamp: string;
+  models: string[];
+  topics: TopicResults[];
+  aggregateRankings: {
+    essays: Array<{
+      author: string;
+      avgScore: number;
+      avgImprovement: number;
+    }>;
+    reviewers: Array<{
+      reviewer: string;
+      avgImprovement: number;
+    }>;
+  };
+}
+
+// 1v1 specific types
+export interface ComparisonResult {
+  judge: string;
+  essayA: { author: string; reviewer?: string };
+  essayB: { author: string; reviewer?: string };
+  winner: "A" | "B" | "tie";
+  reasoning: string;
+}
+
+export interface OneVsOneTopicResults {
+  topic: string;
+  essays: Record<string, string>;
+  feedback: Record<string, Record<string, string>>;
+  revisions: Record<string, Record<string, string>>;
+  comparisons: ComparisonResult[];
+  rankings: {
+    essays: Array<{
+      author: string;
+      reviewer?: string;
+      wins: number;
+      losses: number;
+      ties: number;
+      winRate: number;
+    }>;
+  };
+}
+
+export interface OneVsOneResults {
+  timestamp: string;
+  models: string[];
+  topics: OneVsOneTopicResults[];
+  aggregateRankings: {
+    essays: Array<{
+      author: string;
+      wins: number;
+      losses: number;
+      ties: number;
+      winRate: number;
+    }>;
+    reviewers: Array<{
+      reviewer: string;
+      wins: number;
+      losses: number;
+      ties: number;
+      winRate: number;
+    }>;
+    pairings: Array<{
+      author: string;
+      reviewer: string;
+      wins: number;
+      losses: number;
+      ties: number;
+      winRate: number;
+    }>;
+  };
 }
 
 /**
@@ -32,35 +117,299 @@ function getTimestamp(): string {
 }
 
 /**
- * Writes the pipeline outputs to markdown files in the runs directory.
+ * Sanitizes a name for use in filenames.
  */
-export async function writePipelineOutputs(
-  outputs: PipelineOutput
-): Promise<void> {
-  await ensureRunsDirectory();
-  const timestamp = getTimestamp();
+function sanitizeName(name: string): string {
+  return name.replace(/[^a-zA-Z0-9-_]/g, "-").toLowerCase();
+}
 
-  const files = [
-    {
-      path: join(RUNS_DIR, `${timestamp}-essay.md`),
-      content: `# Original Essay\n\n${outputs.essay}`,
-    },
-    {
-      path: join(RUNS_DIR, `${timestamp}-review.md`),
-      content: `# Review Feedback\n\n${outputs.review}`,
-    },
-    {
-      path: join(RUNS_DIR, `${timestamp}-revision.md`),
-      content: `# Revised Essay\n\n${outputs.revision}`,
-    },
+/**
+ * Creates the arena results directory structure for a topic.
+ */
+export async function createTopicDirectories(baseDir: string, topic: string) {
+  const topicSlug = sanitizeName(topic).slice(0, 50);
+  const topicDir = join(baseDir, topicSlug);
+  const dirs = [
+    topicDir,
+    join(topicDir, "essays"),
+    join(topicDir, "feedback"),
+    join(topicDir, "revisions"),
   ];
 
-  for (const file of files) {
-    await writeFile(file.path, file.content, "utf-8");
+  for (const dir of dirs) {
+    await mkdir(dir, { recursive: true });
+  }
+
+  return topicDir;
+}
+
+/**
+ * Writes an essay to the essays directory.
+ */
+export async function writeEssay(
+  topicDir: string,
+  modelName: string,
+  essay: string
+) {
+  const filename = `${sanitizeName(modelName)}.md`;
+  const path = join(topicDir, "essays", filename);
+  await writeFile(path, `# Essay by ${modelName}\n\n${essay}`, "utf-8");
+  return path;
+}
+
+/**
+ * Writes feedback to the feedback directory.
+ */
+export async function writeFeedback(
+  topicDir: string,
+  reviewer: string,
+  author: string,
+  feedback: string
+) {
+  const filename = `${sanitizeName(reviewer)}-on-${sanitizeName(author)}.md`;
+  const path = join(topicDir, "feedback", filename);
+  await writeFile(
+    path,
+    `# Feedback by ${reviewer} on ${author}'s Essay\n\n${feedback}`,
+    "utf-8"
+  );
+  return path;
+}
+
+/**
+ * Writes a revision to the revisions directory.
+ */
+export async function writeRevision(
+  topicDir: string,
+  author: string,
+  reviewer: string,
+  revision: string
+) {
+  const filename = `${sanitizeName(author)}-revised-by-${sanitizeName(
+    reviewer
+  )}.md`;
+  const path = join(topicDir, "revisions", filename);
+  await writeFile(
+    path,
+    `# ${author}'s Essay Revised Based on ${reviewer}'s Feedback\n\n${revision}`,
+    "utf-8"
+  );
+  return path;
+}
+
+/**
+ * Writes the complete results JSON file.
+ */
+export async function writeResultsJson(baseDir: string, results: ArenaResults) {
+  const path = join(baseDir, "results.json");
+  await writeFile(path, JSON.stringify(results, null, 2), "utf-8");
+  return path;
+}
+
+/**
+ * Generates and writes the summary markdown file.
+ */
+export async function writeSummary(baseDir: string, results: ArenaResults) {
+  const path = join(baseDir, "summary.md");
+
+  let content = `# Writing Quality Arena Results\n\n`;
+  content += `**Date:** ${results.timestamp}\n\n`;
+  content += `**Models:** ${results.models.length}\n\n`;
+  content += `**Topics:** ${results.topics.length}\n\n`;
+
+  // Aggregate Model Rankings (as writers)
+  content += `## Aggregate Model Rankings (as Writers)\n\n`;
+  content += `| Rank | Model | Avg Score | Avg Improvement |\n`;
+  content += `|------|-------|-----------|----------------|\n`;
+
+  results.aggregateRankings.essays.forEach((entry, index) => {
+    const sign = entry.avgImprovement >= 0 ? "+" : "";
+    content += `| ${index + 1} | ${entry.author} | ${entry.avgScore.toFixed(
+      2
+    )} | ${sign}${entry.avgImprovement.toFixed(2)} |\n`;
+  });
+
+  // Aggregate Reviewer Rankings
+  content += `\n## Aggregate Reviewer Rankings (by Improvement Impact)\n\n`;
+  content += `| Rank | Reviewer | Avg Improvement |\n`;
+  content += `|------|----------|----------------|\n`;
+
+  results.aggregateRankings.reviewers.forEach((entry, index) => {
+    const sign = entry.avgImprovement >= 0 ? "+" : "";
+    content += `| ${index + 1} | ${
+      entry.reviewer
+    } | ${sign}${entry.avgImprovement.toFixed(2)} |\n`;
+  });
+
+  // Per-topic summaries
+  content += `\n## Per-Topic Results\n\n`;
+
+  for (const topic of results.topics) {
+    content += `### ${topic.topic}\n\n`;
+
+    // Top 3 essays for this topic
+    content += `**Top 3 Essays:**\n`;
+    topic.rankings.essays.slice(0, 3).forEach((entry, index) => {
+      const reviewer = entry.reviewer ? ` (← ${entry.reviewer})` : "";
+      content += `${index + 1}. ${entry.author}${reviewer} [${
+        entry.type
+      }] - ${entry.avgScore.toFixed(2)}\n`;
+    });
+
+    // Top 3 reviewers for this topic
+    content += `\n**Top 3 Reviewers:**\n`;
+    topic.rankings.reviewers.slice(0, 3).forEach((entry, index) => {
+      const sign = entry.avgImprovement >= 0 ? "+" : "";
+      content += `${index + 1}. ${
+        entry.reviewer
+      } - ${sign}${entry.avgImprovement.toFixed(2)}\n`;
+    });
+
+    content += `\n`;
   }
 
-  console.log(`\n✓ Files written:`);
-  files.forEach((file) => {
-    console.log(`  - ${file.path}`);
+  await writeFile(path, content, "utf-8");
+  return path;
+}
+
+/**
+ * Creates a new arena run and returns the base directory and timestamp.
+ */
+export async function initArenaRun(testType: TestType) {
+  const timestamp = getTimestamp();
+  const baseDir = join(RESULTS_DIR, testType, timestamp);
+  await mkdir(baseDir, { recursive: true });
+  return { baseDir, timestamp };
+}
+
+/**
+ * Writes a comparison result to the comparisons directory.
+ */
+export async function writeComparison(
+  topicDir: string,
+  judge: string,
+  essayA: { author: string; reviewer?: string },
+  essayB: { author: string; reviewer?: string },
+  winner: "A" | "B" | "tie",
+  reasoning: string
+) {
+  const comparisonsDir = join(topicDir, "comparisons");
+  await mkdir(comparisonsDir, { recursive: true });
+
+  const essayALabel = essayA.reviewer
+    ? `${sanitizeName(essayA.author)}-revised-by-${sanitizeName(
+        essayA.reviewer
+      )}`
+    : sanitizeName(essayA.author);
+  const essayBLabel = essayB.reviewer
+    ? `${sanitizeName(essayB.author)}-revised-by-${sanitizeName(
+        essayB.reviewer
+      )}`
+    : sanitizeName(essayB.author);
+
+  const filename = `${sanitizeName(judge)}-${essayALabel}-vs-${essayBLabel}.md`;
+  const path = join(comparisonsDir, filename);
+
+  const essayADisplay = essayA.reviewer
+    ? `${essayA.author} (revised by ${essayA.reviewer})`
+    : essayA.author;
+  const essayBDisplay = essayB.reviewer
+    ? `${essayB.author} (revised by ${essayB.reviewer})`
+    : essayB.author;
+
+  const winnerDisplay =
+    winner === "A" ? essayADisplay : winner === "B" ? essayBDisplay : "Tie";
+
+  await writeFile(
+    path,
+    `# Comparison by ${judge}\n\n**Essay A:** ${essayADisplay}\n**Essay B:** ${essayBDisplay}\n\n**Winner:** ${winnerDisplay}\n\n## Reasoning\n\n${reasoning}`,
+    "utf-8"
+  );
+  return path;
+}
+
+/**
+ * Writes the 1v1 results JSON file.
+ */
+export async function writeOneVsOneResultsJson(
+  baseDir: string,
+  results: OneVsOneResults
+) {
+  const path = join(baseDir, "results.json");
+  await writeFile(path, JSON.stringify(results, null, 2), "utf-8");
+  return path;
+}
+
+/**
+ * Generates and writes the 1v1 summary markdown file.
+ */
+export async function writeOneVsOneSummary(
+  baseDir: string,
+  results: OneVsOneResults
+) {
+  const path = join(baseDir, "summary.md");
+
+  let content = `# 1v1 Arena Results\n\n`;
+  content += `**Date:** ${results.timestamp}\n\n`;
+  content += `**Models:** ${results.models.length}\n\n`;
+  content += `**Topics:** ${results.topics.length}\n\n`;
+
+  // Aggregate Model Rankings (as Writers)
+  content += `## Aggregate Model Rankings (as Writers)\n\n`;
+  content += `| Rank | Model | Wins | Losses | Ties | Win Rate |\n`;
+  content += `|------|-------|------|--------|------|----------|\n`;
+
+  results.aggregateRankings.essays.forEach((entry, index) => {
+    content += `| ${index + 1} | ${entry.author} | ${entry.wins} | ${
+      entry.losses
+    } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`;
+  });
+
+  // Aggregate Reviewer Rankings
+  content += `\n## Aggregate Reviewer Rankings\n\n`;
+  content += `| Rank | Reviewer | Wins | Losses | Ties | Win Rate |\n`;
+  content += `|------|----------|------|--------|------|----------|\n`;
+
+  results.aggregateRankings.reviewers.forEach((entry, index) => {
+    content += `| ${index + 1} | ${entry.reviewer} | ${entry.wins} | ${
+      entry.losses
+    } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`;
   });
+
+  // Aggregate Pairing Rankings
+  content += `\n## Aggregate Pairing Rankings (Author + Reviewer)\n\n`;
+  content += `| Rank | Author | Reviewer | Wins | Losses | Ties | Win Rate |\n`;
+  content += `|------|--------|----------|------|--------|------|----------|\n`;
+
+  results.aggregateRankings.pairings.forEach((entry, index) => {
+    content += `| ${index + 1} | ${entry.author} | ${entry.reviewer} | ${
+      entry.wins
+    } | ${entry.losses} | ${entry.ties} | ${(entry.winRate * 100).toFixed(
+      1
+    )}% |\n`;
+  });
+
+  // Per-topic summaries
+  content += `\n## Per-Topic Results\n\n`;
+
+  for (const topic of results.topics) {
+    content += `### ${topic.topic}\n\n`;
+
+    content += `| Rank | Essay | Wins | Losses | Ties | Win Rate |\n`;
+    content += `|------|-------|------|--------|------|----------|\n`;
+
+    topic.rankings.essays.forEach((entry, index) => {
+      const label = entry.reviewer
+        ? `${entry.author} (← ${entry.reviewer})`
+        : entry.author;
+      content += `| ${index + 1} | ${label} | ${entry.wins} | ${
+        entry.losses
+      } | ${entry.ties} | ${(entry.winRate * 100).toFixed(1)}% |\n`;
+    });
+
+    content += `\n`;
+  }
+
+  await writeFile(path, content, "utf-8");
+  return path;
 }
diff --git a/index.ts b/index.ts
index c68e297..9a593af 100644
--- a/index.ts
+++ b/index.ts
@@ -1,66 +1,1447 @@
-import { generateEssay, reviewEssay, reviseEssay } from "./aiClient";
-import { writePipelineOutputs } from "./fileUtils";
+import pLimit from "p-limit";
+import {
+  generateEssay,
+  reviewEssay,
+  reviseEssay,
+  scoreEssay,
+  compareEssays,
+  type TokenUsage,
+} from "./aiClient";
+import {
+  modelsToRun as allModels,
+  dryRunModels,
+  PARALLEL_LIMIT,
+  TOPICS,
+  type RunnableModel,
+} from "./constants";
+import {
+  createTopicDirectories,
+  initArenaRun,
+  writeEssay,
+  writeFeedback,
+  writeResultsJson,
+  writeRevision,
+  writeSummary,
+  writeComparison,
+  writeOneVsOneResultsJson,
+  writeOneVsOneSummary,
+  type ArenaResults,
+  type TopicResults,
+  type TestType,
+  type OneVsOneResults,
+  type OneVsOneTopicResults,
+  type ComparisonResult,
+} from "./fileUtils";
+
+// Parse CLI flags
+const isDryRun = process.argv.includes("--dry-run");
+const modelsToRun = isDryRun ? dryRunModels : allModels;
+
+// Parse --test argument
+function getTestTypeFromArgs(): TestType | null {
+  const testArg = process.argv.find((arg) => arg.startsWith("--test="));
+  if (!testArg) return null;
+  const value = testArg.split("=")[1];
+  if (value === "scoring-test" || value === "1v1") {
+    return value;
+  }
+  console.error(`Invalid test type: ${value}. Use "scoring-test" or "1v1".`);
+  process.exit(1);
+}
+
+const limit = pLimit(PARALLEL_LIMIT);
+
+/**
+ * Tracks token usage and costs per model per phase.
+ */
+interface UsageTracker {
+  essays: Record<string, TokenUsage[]>;
+  reviews: Record<string, TokenUsage[]>;
+  revisions: Record<string, TokenUsage[]>;
+  scores: Record<string, TokenUsage[]>;
+  comparisons: Record<string, TokenUsage[]>;
+}
+
+function createUsageTracker(): UsageTracker {
+  const tracker: UsageTracker = {
+    essays: {},
+    reviews: {},
+    revisions: {},
+    scores: {},
+    comparisons: {},
+  };
+  for (const model of modelsToRun) {
+    tracker.essays[model.name] = [];
+    tracker.reviews[model.name] = [];
+    tracker.revisions[model.name] = [];
+    tracker.scores[model.name] = [];
+    tracker.comparisons[model.name] = [];
+  }
+  return tracker;
+}
+
+let usageTracker = createUsageTracker();
 
 /**
- * Prompts the user for an essay topic via stdin.
+ * Interactive test type selection UI.
  */
-async function promptForTopic(): Promise<string> {
-  const prompt = "Enter your essay topic: ";
-  process.stdout.write(prompt);
+async function selectTestType(): Promise<TestType> {
+  console.log("\n🏟️  Writing Quality Arena\n");
+  console.log("Select test type:\n");
+  console.log("  1. scoring-test - Models score essays on a 1-10 scale");
+  console.log("  2. 1v1          - Head-to-head essay comparisons\n");
+
+  process.stdout.write("Enter choice (1 or 2): ");
 
   return new Promise((resolve) => {
     process.stdin.once("data", (data) => {
-      const topic = data.toString().trim();
-      if (!topic) {
-        console.error("Topic cannot be empty. Please try again.");
-        process.exit(1);
+      const input = data.toString().trim();
+      if (input === "1" || input === "scoring-test") {
+        resolve("scoring-test");
+      } else if (input === "2" || input === "1v1") {
+        resolve("1v1");
+      } else {
+        console.log("Invalid choice, defaulting to scoring-test");
+        resolve("scoring-test");
       }
-      resolve(topic);
+    });
+  });
+}
+
+// ============================================================================
+// SHARED PHASES (used by both test types)
+// ============================================================================
+
+/**
+ * Phase 1: Each model generates an essay on the topic.
+ */
+async function runPhase1Essays(
+  topic: string,
+  topicDir: string
+): Promise<Record<string, string>> {
+  const essays: Record<string, string> = {};
+
+  const tasks = modelsToRun.map((model) =>
+    limit(async () => {
+      console.log(`    Generating essay: ${model.name}...`);
+      const result = await generateEssay(model, topic);
+      essays[model.name] = result.text;
+      usageTracker.essays[model.name]!.push(result.usage);
+      await writeEssay(topicDir, model.name, result.text);
+      console.log(
+        `    ✓ ${model.name} (${
+          result.usage.totalTokens
+        } tokens, $${result.usage.cost.toFixed(4)})`
+      );
+      return result;
+    })
+  );
+
+  await Promise.all(tasks);
+  return essays;
+}
+
+/**
+ * Phase 2: Every model reviews every OTHER model's essay.
+ */
+async function runPhase2Feedback(
+  topic: string,
+  essays: Record<string, string>,
+  topicDir: string
+): Promise<Record<string, Record<string, string>>> {
+  const feedback: Record<string, Record<string, string>> = {};
+
+  // Initialize nested objects
+  for (const reviewer of modelsToRun) {
+    feedback[reviewer.name] = {};
+  }
+
+  const tasks: Array<Promise<void>> = [];
+
+  for (const reviewer of modelsToRun) {
+    for (const author of modelsToRun) {
+      if (reviewer.name === author.name) continue;
+
+      tasks.push(
+        limit(async () => {
+          console.log(`    ${reviewer.name} reviewing ${author.name}...`);
+          const essayText = essays[author.name]!;
+          const result = await reviewEssay(reviewer, essayText, topic);
+          feedback[reviewer.name]![author.name] = result.text;
+          usageTracker.reviews[reviewer.name]!.push(result.usage);
+          await writeFeedback(
+            topicDir,
+            reviewer.name,
+            author.name,
+            result.text
+          );
+          console.log(
+            `    ✓ ${reviewer.name} → ${author.name} (${
+              result.usage.totalTokens
+            } tokens, $${result.usage.cost.toFixed(4)})`
+          );
+        })
+      );
+    }
+  }
+
+  await Promise.all(tasks);
+  return feedback;
+}
+
+/**
+ * Phase 3: Each author revises their essay for EACH piece of feedback received.
+ */
+async function runPhase3Revisions(
+  topic: string,
+  essays: Record<string, string>,
+  feedback: Record<string, Record<string, string>>,
+  topicDir: string
+): Promise<Record<string, Record<string, string>>> {
+  const revisions: Record<string, Record<string, string>> = {};
+
+  // Initialize nested objects
+  for (const author of modelsToRun) {
+    revisions[author.name] = {};
+  }
+
+  const tasks: Array<Promise<void>> = [];
+
+  for (const author of modelsToRun) {
+    for (const reviewer of modelsToRun) {
+      if (author.name === reviewer.name) continue;
+
+      tasks.push(
+        limit(async () => {
+          const reviewerFeedback = feedback[reviewer.name]![author.name]!;
+          const essayText = essays[author.name]!;
+          console.log(
+            `    ${author.name} revising based on ${reviewer.name}...`
+          );
+          const result = await reviseEssay(
+            author,
+            topic,
+            essayText,
+            reviewerFeedback
+          );
+          revisions[author.name]![reviewer.name] = result.text;
+          usageTracker.revisions[author.name]!.push(result.usage);
+          await writeRevision(
+            topicDir,
+            author.name,
+            reviewer.name,
+            result.text
+          );
+          console.log(
+            `    ✓ ${author.name} ← ${reviewer.name} (${
+              result.usage.totalTokens
+            } tokens, $${result.usage.cost.toFixed(4)})`
+          );
+        })
+      );
+    }
+  }
+
+  await Promise.all(tasks);
+  return revisions;
+}
+
+// ============================================================================
+// SCORING TEST SPECIFIC
+// ============================================================================
+
+/**
+ * Counts API calls for scoring test.
+ */
+function countScoringApiCalls() {
+  let essays = 0;
+  let feedback = 0;
+  let revisions = 0;
+  let scores = 0;
+
+  for (const _topic of TOPICS) {
+    for (const _model of modelsToRun) {
+      essays++;
+    }
+
+    for (const reviewer of modelsToRun) {
+      for (const author of modelsToRun) {
+        if (reviewer.name === author.name) continue;
+        feedback++;
+      }
+    }
+
+    for (const author of modelsToRun) {
+      for (const reviewer of modelsToRun) {
+        if (author.name === reviewer.name) continue;
+        revisions++;
+      }
+    }
+
+    for (const _judge of modelsToRun) {
+      for (const _author of modelsToRun) {
+        scores++;
+      }
+    }
+    for (const _judge of modelsToRun) {
+      for (const author of modelsToRun) {
+        for (const reviewer of modelsToRun) {
+          if (author.name === reviewer.name) continue;
+          scores++;
+        }
+      }
+    }
+  }
+
+  return {
+    essays,
+    feedback,
+    revisions,
+    scores,
+    total: essays + feedback + revisions + scores,
+  };
+}
+
+/**
+ * Prompts for scoring test confirmation.
+ */
+async function confirmScoringRun(): Promise<boolean> {
+  const { essays, feedback, revisions, scores, total } = countScoringApiCalls();
+
+  console.log("\n🏟️  Writing Quality Arena - Scoring Test\n");
+  if (isDryRun) {
+    console.log("⚡ DRY RUN MODE (using cheap models)\n");
+  }
+  console.log(`Models: ${modelsToRun.length}`);
+  console.log(`Topics: ${TOPICS.length}`);
+  console.log(`\nAPI Call Breakdown (across all ${TOPICS.length} topics):`);
+  console.log(`  Phase 1 - Essays:    ${essays.toString().padStart(6)} calls`);
+  console.log(
+    `  Phase 2 - Feedback:  ${feedback.toString().padStart(6)} calls`
+  );
+  console.log(
+    `  Phase 3 - Revisions: ${revisions.toString().padStart(6)} calls`
+  );
+  console.log(`  Phase 4 - Scores:    ${scores.toString().padStart(6)} calls`);
+  console.log(`  ────────────────────────────`);
+  console.log(`  Total:               ${total.toString().padStart(6)} calls\n`);
+  console.log(`Parallelism: ${PARALLEL_LIMIT} concurrent requests\n`);
+
+  process.stdout.write("Proceed? (Y/n): ");
+
+  return new Promise((resolve) => {
+    process.stdin.once("data", (data) => {
+      const input = data.toString().trim().toLowerCase();
+      resolve(input === "" || input === "y" || input === "yes");
+    });
+  });
+}
+
+/**
+ * Phase 4 (Scoring): Every model scores every essay.
+ */
+async function runPhase4Scoring(
+  topic: string,
+  essays: Record<string, string>,
+  revisions: Record<string, Record<string, string>>
+): Promise<{
+  original: Record<
+    string,
+    Record<string, { score: number; justification: string }>
+  >;
+  revised: Record<
+    string,
+    Record<string, Record<string, { score: number; justification: string }>>
+  >;
+}> {
+  const originalScores: Record<
+    string,
+    Record<string, { score: number; justification: string }>
+  > = {};
+  const revisedScores: Record<
+    string,
+    Record<string, Record<string, { score: number; justification: string }>>
+  > = {};
+
+  for (const judge of modelsToRun) {
+    originalScores[judge.name] = {};
+    revisedScores[judge.name] = {};
+    for (const author of modelsToRun) {
+      revisedScores[judge.name]![author.name] = {};
+    }
+  }
+
+  const tasks: Array<Promise<void>> = [];
+
+  for (const judge of modelsToRun) {
+    for (const author of modelsToRun) {
+      tasks.push(
+        limit(async () => {
+          const essayText = essays[author.name]!;
+          console.log(`    ${judge.name} scoring ${author.name} (original)...`);
+          const result = await scoreEssay(judge, essayText, topic);
+          originalScores[judge.name]![author.name] = {
+            score: result.score,
+            justification: result.justification,
+          };
+          usageTracker.scores[judge.name]!.push(result.usage);
+          console.log(
+            `    ✓ ${judge.name} → ${author.name} (original): ${
+              result.score
+            } (${result.usage.totalTokens} tokens, $${result.usage.cost.toFixed(
+              4
+            )})`
+          );
+        })
+      );
+    }
+  }
+
+  for (const judge of modelsToRun) {
+    for (const author of modelsToRun) {
+      for (const reviewer of modelsToRun) {
+        if (author.name === reviewer.name) continue;
+
+        tasks.push(
+          limit(async () => {
+            const revision = revisions[author.name]![reviewer.name]!;
+            console.log(
+              `    ${judge.name} scoring ${author.name}←${reviewer.name} (revised)...`
+            );
+            const result = await scoreEssay(judge, revision, topic);
+            revisedScores[judge.name]![author.name]![reviewer.name] = {
+              score: result.score,
+              justification: result.justification,
+            };
+            usageTracker.scores[judge.name]!.push(result.usage);
+            console.log(
+              `    ✓ ${judge.name} → ${author.name}←${reviewer.name}: ${
+                result.score
+              } (${
+                result.usage.totalTokens
+              } tokens, $${result.usage.cost.toFixed(4)})`
+            );
+          })
+        );
+      }
+    }
+  }
+
+  await Promise.all(tasks);
+  return { original: originalScores, revised: revisedScores };
+}
+
+/**
+ * Calculate rankings from scores for a single topic.
+ */
+function calculateScoringRankings(scores: {
+  original: Record<
+    string,
+    Record<string, { score: number; justification: string }>
+  >;
+  revised: Record<
+    string,
+    Record<string, Record<string, { score: number; justification: string }>>
+  >;
+}): TopicResults["rankings"] {
+  const essayScores: Array<{
+    type: "original" | "revised";
+    author: string;
+    reviewer?: string;
+    avgScore: number;
+  }> = [];
+
+  const judges = Object.keys(scores.original);
+  const firstJudge = judges[0]!;
+  const authors = Object.keys(scores.original[firstJudge]!);
+
+  for (const author of authors) {
+    const judgeScores = judges.map((j) => scores.original[j]![author]!.score);
+    const avgScore =
+      judgeScores.reduce((a, b) => a + b, 0) / judgeScores.length;
+    essayScores.push({ type: "original", author, avgScore });
+  }
+
+  for (const author of authors) {
+    for (const reviewer of authors) {
+      if (author === reviewer) continue;
+      const judgeScores = judges.map(
+        (j) => scores.revised[j]![author]![reviewer]!.score
+      );
+      const avgScore =
+        judgeScores.reduce((a, b) => a + b, 0) / judgeScores.length;
+      essayScores.push({ type: "revised", author, reviewer, avgScore });
+    }
+  }
+
+  essayScores.sort((a, b) => b.avgScore - a.avgScore);
+
+  const reviewerImpact: Record<string, number[]> = {};
+  for (const reviewer of authors) {
+    reviewerImpact[reviewer] = [];
+  }
+
+  for (const author of authors) {
+    const originalAvg =
+      judges.reduce((sum, j) => sum + scores.original[j]![author]!.score, 0) /
+      judges.length;
+
+    for (const reviewer of authors) {
+      if (author === reviewer) continue;
+      const revisedAvg =
+        judges.reduce(
+          (sum, j) => sum + scores.revised[j]![author]![reviewer]!.score,
+          0
+        ) / judges.length;
+      const improvement = revisedAvg - originalAvg;
+      reviewerImpact[reviewer]!.push(improvement);
+    }
+  }
+
+  const reviewerScores = Object.entries(reviewerImpact).map(
+    ([reviewer, improvements]) => ({
+      reviewer,
+      avgImprovement:
+        improvements.reduce((a, b) => a + b, 0) / improvements.length,
+    })
+  );
+
+  reviewerScores.sort((a, b) => b.avgImprovement - a.avgImprovement);
+
+  return {
+    essays: essayScores,
+    reviewers: reviewerScores,
+  };
+}
+
+/**
+ * Calculate aggregate rankings across all topics for scoring test.
+ */
+function calculateScoringAggregateRankings(
+  topics: TopicResults[]
+): ArenaResults["aggregateRankings"] {
+  const modelScores: Record<
+    string,
+    { scores: number[]; improvements: number[] }
+  > = {};
+  const reviewerImprovements: Record<string, number[]> = {};
+
+  for (const topic of topics) {
+    const originalByAuthor: Record<string, number> = {};
+    for (const entry of topic.rankings.essays) {
+      if (entry.type === "original") {
+        originalByAuthor[entry.author] = entry.avgScore;
+        if (!modelScores[entry.author]) {
+          modelScores[entry.author] = { scores: [], improvements: [] };
+        }
+        modelScores[entry.author]!.scores.push(entry.avgScore);
+      }
+    }
+
+    for (const entry of topic.rankings.essays) {
+      if (entry.type === "revised" && entry.reviewer) {
+        const original = originalByAuthor[entry.author]!;
+        const improvement = entry.avgScore - original;
+        modelScores[entry.author]!.improvements.push(improvement);
+
+        if (!reviewerImprovements[entry.reviewer]) {
+          reviewerImprovements[entry.reviewer] = [];
+        }
+        reviewerImprovements[entry.reviewer]!.push(improvement);
+      }
+    }
+  }
+
+  const essayRankings = Object.entries(modelScores).map(([author, data]) => ({
+    author,
+    avgScore: data.scores.reduce((a, b) => a + b, 0) / data.scores.length,
+    avgImprovement:
+      data.improvements.length > 0
+        ? data.improvements.reduce((a, b) => a + b, 0) /
+          data.improvements.length
+        : 0,
+  }));
+  essayRankings.sort((a, b) => b.avgScore - a.avgScore);
+
+  const reviewerRankings = Object.entries(reviewerImprovements).map(
+    ([reviewer, improvements]) => ({
+      reviewer,
+      avgImprovement:
+        improvements.reduce((a, b) => a + b, 0) / improvements.length,
+    })
+  );
+  reviewerRankings.sort((a, b) => b.avgImprovement - a.avgImprovement);
+
+  return {
+    essays: essayRankings,
+    reviewers: reviewerRankings,
+  };
+}
+
+/**
+ * Prints topic results for scoring test.
+ */
+function printScoringTopicResults(result: TopicResults) {
+  console.log(`\n  📊 Results for "${result.topic}":\n`);
+
+  console.log("  📝 Essay Rankings (by avg score):");
+  result.rankings.essays.slice(0, 5).forEach((entry, index) => {
+    const label = entry.reviewer
+      ? `${entry.author} ← ${entry.reviewer} (revised)`
+      : `${entry.author} (original)`;
+    console.log(`    ${index + 1}. ${label} - ${entry.avgScore.toFixed(2)}`);
+  });
+  if (result.rankings.essays.length > 5) {
+    console.log(`    ... and ${result.rankings.essays.length - 5} more`);
+  }
+
+  console.log("\n  🎯 Reviewer Rankings (by improvement impact):");
+  result.rankings.reviewers.forEach((entry, index) => {
+    const sign = entry.avgImprovement >= 0 ? "+" : "";
+    console.log(
+      `    ${index + 1}. ${
+        entry.reviewer
+      } - ${sign}${entry.avgImprovement.toFixed(2)}`
+    );
+  });
+}
+
+/**
+ * Run all phases for a single topic (scoring test).
+ */
+async function runScoringTopicArena(
+  topic: string,
+  topicIndex: number,
+  totalTopics: number,
+  baseDir: string
+): Promise<TopicResults> {
+  console.log(
+    `\n${"═".repeat(60)}\n📚 Topic ${
+      topicIndex + 1
+    }/${totalTopics}: "${topic}"\n${"═".repeat(60)}`
+  );
+
+  const topicDir = await createTopicDirectories(baseDir, topic);
+
+  console.log("\n  📝 Phase 1: Essay Generation");
+  const essays = await runPhase1Essays(topic, topicDir);
+  console.log(`  ✓ Phase 1 complete: ${modelsToRun.length} essays`);
+
+  console.log("\n  📋 Phase 2: Feedback Generation");
+  const feedback = await runPhase2Feedback(topic, essays, topicDir);
+  const feedbackCount = modelsToRun.length * (modelsToRun.length - 1);
+  console.log(`  ✓ Phase 2 complete: ${feedbackCount} feedback pieces`);
+
+  console.log("\n  ✏️  Phase 3: Revisions");
+  const revisions = await runPhase3Revisions(topic, essays, feedback, topicDir);
+  console.log(`  ✓ Phase 3 complete: ${feedbackCount} revisions`);
+
+  console.log("\n  ⭐ Phase 4: Scoring");
+  const scores = await runPhase4Scoring(topic, essays, revisions);
+  console.log(`  ✓ Phase 4 complete`);
+
+  const rankings = calculateScoringRankings(scores);
+
+  return {
+    topic,
+    essays,
+    feedback,
+    revisions,
+    scores,
+    rankings,
+  };
+}
+
+/**
+ * Formats duration in milliseconds to human-readable string.
+ */
+function formatDuration(ms: number): string {
+  if (ms < 1000) return `${ms}ms`;
+  const seconds = Math.floor(ms / 1000);
+  if (seconds < 60) return `${seconds}s`;
+  const minutes = Math.floor(seconds / 60);
+  const remainingSeconds = seconds % 60;
+  if (minutes < 60) return `${minutes}m ${remainingSeconds}s`;
+  const hours = Math.floor(minutes / 60);
+  const remainingMinutes = minutes % 60;
+  return `${hours}h ${remainingMinutes}m ${remainingSeconds}s`;
+}
+
+/**
+ * Main scoring test orchestration.
+ */
+async function runScoringTest(): Promise<void> {
+  usageTracker = createUsageTracker();
+
+  const confirmed = await confirmScoringRun();
+  if (!confirmed) {
+    console.log("\nAborted.");
+    process.exit(0);
+  }
+
+  const overallStart = Date.now();
+
+  const { baseDir, timestamp } = await initArenaRun("scoring-test");
+  console.log(`\nResults will be saved to: ${baseDir}`);
+
+  const topicResults: TopicResults[] = [];
+  const topicTimes: Array<{ topic: string; duration: number }> = [];
+
+  for (let i = 0; i < TOPICS.length; i++) {
+    const topic = TOPICS[i]!;
+    const topicStart = Date.now();
+    const result = await runScoringTopicArena(topic, i, TOPICS.length, baseDir);
+    const topicDuration = Date.now() - topicStart;
+    topicResults.push(result);
+    topicTimes.push({ topic, duration: topicDuration });
+    printScoringTopicResults(result);
+    console.log(`\n  ⏱️  Topic completed in ${formatDuration(topicDuration)}`);
+  }
+
+  console.log("\n\n📊 Calculating aggregate rankings...\n");
+
+  // Log all topic results before aggregate
+  console.log("═".repeat(60));
+  console.log("\n📋 INDIVIDUAL TOPIC RESULTS\n");
+  for (const result of topicResults) {
+    printScoringTopicResults(result);
+    console.log("");
+  }
+  const aggregateRankings = calculateScoringAggregateRankings(topicResults);
+
+  const results: ArenaResults = {
+    timestamp,
+    models: modelsToRun.map((m) => m.name),
+    topics: topicResults,
+    aggregateRankings,
+  };
+
+  await writeResultsJson(baseDir, results);
+  await writeSummary(baseDir, results);
+
+  console.log("═".repeat(60));
+  console.log("\n🏆 AGGREGATE RESULTS\n");
+
+  console.log("📝 Models (as Writers):\n");
+  aggregateRankings.essays.forEach((entry, index) => {
+    const sign = entry.avgImprovement >= 0 ? "+" : "";
+    console.log(
+      `  ${index + 1}. ${entry.author} - ${entry.avgScore.toFixed(
+        2
+      )} avg (${sign}${entry.avgImprovement.toFixed(2)} after feedback)`
+    );
+  });
+
+  console.log("\n🎯 Reviewers (by improvement impact):\n");
+  aggregateRankings.reviewers.forEach((entry, index) => {
+    const sign = entry.avgImprovement >= 0 ? "+" : "";
+    console.log(
+      `  ${index + 1}. ${
+        entry.reviewer
+      } - ${sign}${entry.avgImprovement.toFixed(2)}`
+    );
+  });
+
+  printUsageSummary("scoring-test");
+
+  const overallDuration = Date.now() - overallStart;
+  console.log("\n" + "═".repeat(60));
+  console.log("\n⏱️  RUNTIME SUMMARY\n");
+  topicTimes.forEach((t) => {
+    console.log(
+      `  ${t.topic.slice(0, 40).padEnd(42)} ${formatDuration(t.duration)}`
+    );
+  });
+  console.log(`  ${"─".repeat(50)}`);
+  console.log(`  ${"Total".padEnd(42)} ${formatDuration(overallDuration)}`);
+
+  console.log(`\n✨ Scoring test complete! Results saved to: ${baseDir}`);
+}
+
+// ============================================================================
+// 1V1 TEST SPECIFIC
+// ============================================================================
+
+/**
+ * Counts API calls for 1v1 test.
+ */
+function countOneVsOneApiCalls() {
+  let essays = 0;
+  let feedback = 0;
+  let revisions = 0;
+  let comparisons = 0;
+
+  const n = modelsToRun.length;
+
+  for (const _topic of TOPICS) {
+    // Phase 1: Essays
+    essays += n;
+
+    // Phase 2: Feedback (each model reviews every other)
+    feedback += n * (n - 1);
+
+    // Phase 3: Revisions (each author revises per reviewer)
+    revisions += n * (n - 1);
+
+    // Phase 4: Comparisons
+    // Original essays: C(n, 2) pairs = n*(n-1)/2, each judged by n models
+    const originalPairs = (n * (n - 1)) / 2;
+    comparisons += originalPairs * n;
+
+    // Revised essays: each model has (n-1) revisions
+    // Total revised essays = n * (n-1)
+    // Pairs of revised essays = C(n*(n-1), 2) = ...but we only compare within same topic
+    // Actually: all revised essays compete pairwise
+    const revisedCount = n * (n - 1);
+    const revisedPairs = (revisedCount * (revisedCount - 1)) / 2;
+    comparisons += revisedPairs * n;
+  }
+
+  return {
+    essays,
+    feedback,
+    revisions,
+    comparisons,
+    total: essays + feedback + revisions + comparisons,
+  };
+}
+
+/**
+ * Prompts for 1v1 test confirmation.
+ */
+async function confirmOneVsOneRun(): Promise<boolean> {
+  const { essays, feedback, revisions, comparisons, total } =
+    countOneVsOneApiCalls();
+
+  console.log("\n🏟️  Writing Quality Arena - 1v1 Test\n");
+  if (isDryRun) {
+    console.log("⚡ DRY RUN MODE (using cheap models)\n");
+  }
+  console.log(`Models: ${modelsToRun.length}`);
+  console.log(`Topics: ${TOPICS.length}`);
+  console.log(`\nAPI Call Breakdown (across all ${TOPICS.length} topics):`);
+  console.log(
+    `  Phase 1 - Essays:      ${essays.toString().padStart(6)} calls`
+  );
+  console.log(
+    `  Phase 2 - Feedback:    ${feedback.toString().padStart(6)} calls`
+  );
+  console.log(
+    `  Phase 3 - Revisions:   ${revisions.toString().padStart(6)} calls`
+  );
+  console.log(
+    `  Phase 4 - Comparisons: ${comparisons.toString().padStart(6)} calls`
+  );
+  console.log(`  ────────────────────────────`);
+  console.log(
+    `  Total:                 ${total.toString().padStart(6)} calls\n`
+  );
+  console.log(`Parallelism: ${PARALLEL_LIMIT} concurrent requests\n`);
+
+  process.stdout.write("Proceed? (Y/n): ");
+
+  return new Promise((resolve) => {
+    process.stdin.once("data", (data) => {
+      const input = data.toString().trim().toLowerCase();
+      resolve(input === "" || input === "y" || input === "yes");
     });
   });
 }
 
 /**
- * Runs the complete essay pipeline: generation → review → revision.
+ * Phase 4 (1v1): Head-to-head comparisons of all essays.
  */
-async function runEssayPipeline(): Promise<void> {
-  console.log("🎓 Auto-Draftify: Essay Generation Pipeline\n");
+async function runPhase4Comparisons(
+  topic: string,
+  essays: Record<string, string>,
+  revisions: Record<string, Record<string, string>>,
+  topicDir: string
+): Promise<ComparisonResult[]> {
+  const comparisons: ComparisonResult[] = [];
+
+  // Build list of all essays (original + revised)
+  interface EssayEntry {
+    author: string;
+    reviewer?: string;
+    text: string;
+  }
+
+  const allEssays: EssayEntry[] = [];
+
+  // Add original essays
+  for (const author of Object.keys(essays)) {
+    allEssays.push({ author, text: essays[author]! });
+  }
+
+  // Add revised essays
+  for (const author of Object.keys(revisions)) {
+    for (const reviewer of Object.keys(revisions[author]!)) {
+      allEssays.push({
+        author,
+        reviewer,
+        text: revisions[author]![reviewer]!,
+      });
+    }
+  }
+
+  // Generate all unique pairs
+  const pairs: Array<[EssayEntry, EssayEntry]> = [];
+  for (let i = 0; i < allEssays.length; i++) {
+    for (let j = i + 1; j < allEssays.length; j++) {
+      pairs.push([allEssays[i]!, allEssays[j]!]);
+    }
+  }
+
+  const tasks: Array<Promise<void>> = [];
 
-  // Step 1: Get topic from user
-  const topic = await promptForTopic();
-  console.log(`\n📝 Topic: ${topic}\n`);
+  for (const judge of modelsToRun) {
+    for (const [essayA, essayB] of pairs) {
+      tasks.push(
+        limit(async () => {
+          const labelA = essayA.reviewer
+            ? `${essayA.author}←${essayA.reviewer}`
+            : essayA.author;
+          const labelB = essayB.reviewer
+            ? `${essayB.author}←${essayB.reviewer}`
+            : essayB.author;
 
-  // Step 2: Generate initial essay
-  console.log("Step 1/3: Generating initial essay...");
-  const essayResult = await generateEssay(topic);
-  console.log("✓ Essay generated\n");
+          console.log(`    ${judge.name} comparing ${labelA} vs ${labelB}...`);
+
+          const result = await compareEssays(
+            judge,
+            { author: essayA.author, text: essayA.text },
+            { author: essayB.author, text: essayB.text },
+            topic
+          );
+
+          const comparison: ComparisonResult = {
+            judge: judge.name,
+            essayA: { author: essayA.author, reviewer: essayA.reviewer },
+            essayB: { author: essayB.author, reviewer: essayB.reviewer },
+            winner: result.winner,
+            reasoning: result.reasoning,
+          };
+
+          comparisons.push(comparison);
+          usageTracker.comparisons[judge.name]!.push(result.usage);
+
+          await writeComparison(
+            topicDir,
+            judge.name,
+            { author: essayA.author, reviewer: essayA.reviewer },
+            { author: essayB.author, reviewer: essayB.reviewer },
+            result.winner,
+            result.reasoning
+          );
+
+          const winnerLabel =
+            result.winner === "A"
+              ? labelA
+              : result.winner === "B"
+              ? labelB
+              : "Tie";
+          console.log(
+            `    ✓ ${judge.name}: ${labelA} vs ${labelB} → ${winnerLabel} (${
+              result.usage.totalTokens
+            } tokens, $${result.usage.cost.toFixed(4)})`
+          );
+        })
+      );
+    }
+  }
+
+  await Promise.all(tasks);
+  return comparisons;
+}
+
+/**
+ * Calculate rankings from comparisons for a single topic.
+ */
+function calculateOneVsOneRankings(
+  comparisons: ComparisonResult[]
+): OneVsOneTopicResults["rankings"] {
+  // Track wins/losses/ties per essay
+  const stats: Record<
+    string,
+    {
+      wins: number;
+      losses: number;
+      ties: number;
+      author: string;
+      reviewer?: string;
+    }
+  > = {};
+
+  function getKey(author: string, reviewer?: string) {
+    return reviewer ? `${author}:${reviewer}` : author;
+  }
+
+  for (const comp of comparisons) {
+    const keyA = getKey(comp.essayA.author, comp.essayA.reviewer);
+    const keyB = getKey(comp.essayB.author, comp.essayB.reviewer);
+
+    if (!stats[keyA]) {
+      stats[keyA] = {
+        wins: 0,
+        losses: 0,
+        ties: 0,
+        author: comp.essayA.author,
+        reviewer: comp.essayA.reviewer,
+      };
+    }
+    if (!stats[keyB]) {
+      stats[keyB] = {
+        wins: 0,
+        losses: 0,
+        ties: 0,
+        author: comp.essayB.author,
+        reviewer: comp.essayB.reviewer,
+      };
+    }
+
+    if (comp.winner === "A") {
+      stats[keyA]!.wins++;
+      stats[keyB]!.losses++;
+    } else if (comp.winner === "B") {
+      stats[keyB]!.wins++;
+      stats[keyA]!.losses++;
+    } else {
+      stats[keyA]!.ties++;
+      stats[keyB]!.ties++;
+    }
+  }
+
+  const essays = Object.values(stats).map((s) => ({
+    author: s.author,
+    reviewer: s.reviewer,
+    wins: s.wins,
+    losses: s.losses,
+    ties: s.ties,
+    winRate:
+      s.wins + s.losses + s.ties > 0
+        ? s.wins / (s.wins + s.losses + s.ties)
+        : 0,
+  }));
+
+  essays.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins);
+
+  return { essays };
+}
+
+/**
+ * Calculate aggregate rankings across all topics for 1v1 test.
+ */
+function calculateOneVsOneAggregateRankings(
+  topics: OneVsOneTopicResults[]
+): OneVsOneResults["aggregateRankings"] {
+  // Aggregate by original author only (not per-revision)
+  const authorStats: Record<
+    string,
+    { wins: number; losses: number; ties: number }
+  > = {};
+
+  // Aggregate by reviewer (how well essays do after being revised by this reviewer)
+  const reviewerStats: Record<
+    string,
+    { wins: number; losses: number; ties: number }
+  > = {};
+
+  // Aggregate by author+reviewer pairing
+  const pairingStats: Record<
+    string,
+    {
+      author: string;
+      reviewer: string;
+      wins: number;
+      losses: number;
+      ties: number;
+    }
+  > = {};
+
+  for (const topic of topics) {
+    for (const entry of topic.rankings.essays) {
+      if (!entry.reviewer) {
+        // Original essay - count for author
+        if (!authorStats[entry.author]) {
+          authorStats[entry.author] = { wins: 0, losses: 0, ties: 0 };
+        }
+        authorStats[entry.author]!.wins += entry.wins;
+        authorStats[entry.author]!.losses += entry.losses;
+        authorStats[entry.author]!.ties += entry.ties;
+      } else {
+        // Revised essay - count for reviewer and pairing
+        if (!reviewerStats[entry.reviewer]) {
+          reviewerStats[entry.reviewer] = { wins: 0, losses: 0, ties: 0 };
+        }
+        reviewerStats[entry.reviewer]!.wins += entry.wins;
+        reviewerStats[entry.reviewer]!.losses += entry.losses;
+        reviewerStats[entry.reviewer]!.ties += entry.ties;
+
+        const pairingKey = `${entry.author}:${entry.reviewer}`;
+        if (!pairingStats[pairingKey]) {
+          pairingStats[pairingKey] = {
+            author: entry.author,
+            reviewer: entry.reviewer,
+            wins: 0,
+            losses: 0,
+            ties: 0,
+          };
+        }
+        pairingStats[pairingKey]!.wins += entry.wins;
+        pairingStats[pairingKey]!.losses += entry.losses;
+        pairingStats[pairingKey]!.ties += entry.ties;
+      }
+    }
+  }
 
-  // Step 3: Review the essay
-  console.log("Step 2/3: Reviewing essay...");
-  const reviewResult = await reviewEssay(essayResult.text);
-  console.log("✓ Review completed\n");
+  const calcWinRate = (s: { wins: number; losses: number; ties: number }) =>
+    s.wins + s.losses + s.ties > 0 ? s.wins / (s.wins + s.losses + s.ties) : 0;
 
-  // Step 4: Revise the essay
-  console.log("Step 3/3: Revising essay based on feedback...");
-  const revisionResult = await reviseEssay(
+  const essays = Object.entries(authorStats).map(([author, s]) => ({
+    author,
+    wins: s.wins,
+    losses: s.losses,
+    ties: s.ties,
+    winRate: calcWinRate(s),
+  }));
+  essays.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins);
+
+  const reviewers = Object.entries(reviewerStats).map(([reviewer, s]) => ({
+    reviewer,
+    wins: s.wins,
+    losses: s.losses,
+    ties: s.ties,
+    winRate: calcWinRate(s),
+  }));
+  reviewers.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins);
+
+  const pairings = Object.values(pairingStats).map((s) => ({
+    author: s.author,
+    reviewer: s.reviewer,
+    wins: s.wins,
+    losses: s.losses,
+    ties: s.ties,
+    winRate: calcWinRate(s),
+  }));
+  pairings.sort((a, b) => b.winRate - a.winRate || b.wins - a.wins);
+
+  return { essays, reviewers, pairings };
+}
+
+/**
+ * Prints topic results for 1v1 test.
+ */
+function printOneVsOneTopicResults(result: OneVsOneTopicResults) {
+  console.log(`\n  📊 Results for "${result.topic}":\n`);
+
+  console.log("  📝 Essay Rankings (by win rate):");
+  result.rankings.essays.slice(0, 5).forEach((entry, index) => {
+    const label = entry.reviewer
+      ? `${entry.author} ← ${entry.reviewer} (revised)`
+      : `${entry.author} (original)`;
+    console.log(
+      `    ${index + 1}. ${label} - ${entry.wins}W/${entry.losses}L/${
+        entry.ties
+      }T (${(entry.winRate * 100).toFixed(1)}%)`
+    );
+  });
+  if (result.rankings.essays.length > 5) {
+    console.log(`    ... and ${result.rankings.essays.length - 5} more`);
+  }
+}
+
+/**
+ * Run all phases for a single topic (1v1 test).
+ */
+async function runOneVsOneTopicArena(
+  topic: string,
+  topicIndex: number,
+  totalTopics: number,
+  baseDir: string
+): Promise<OneVsOneTopicResults> {
+  console.log(
+    `\n${"═".repeat(60)}\n📚 Topic ${
+      topicIndex + 1
+    }/${totalTopics}: "${topic}"\n${"═".repeat(60)}`
+  );
+
+  const topicDir = await createTopicDirectories(baseDir, topic);
+
+  console.log("\n  📝 Phase 1: Essay Generation");
+  const essays = await runPhase1Essays(topic, topicDir);
+  console.log(`  ✓ Phase 1 complete: ${modelsToRun.length} essays`);
+
+  console.log("\n  📋 Phase 2: Feedback Generation");
+  const feedback = await runPhase2Feedback(topic, essays, topicDir);
+  const feedbackCount = modelsToRun.length * (modelsToRun.length - 1);
+  console.log(`  ✓ Phase 2 complete: ${feedbackCount} feedback pieces`);
+
+  console.log("\n  ✏️  Phase 3: Revisions");
+  const revisions = await runPhase3Revisions(topic, essays, feedback, topicDir);
+  console.log(`  ✓ Phase 3 complete: ${feedbackCount} revisions`);
+
+  console.log("\n  🥊 Phase 4: Head-to-Head Comparisons");
+  const comparisons = await runPhase4Comparisons(
     topic,
-    essayResult.text,
-    reviewResult.text
+    essays,
+    revisions,
+    topicDir
   );
-  console.log("✓ Revision completed\n");
+  console.log(`  ✓ Phase 4 complete: ${comparisons.length} comparisons`);
+
+  const rankings = calculateOneVsOneRankings(comparisons);
+
+  return {
+    topic,
+    essays,
+    feedback,
+    revisions,
+    comparisons,
+    rankings,
+  };
+}
+
+/**
+ * Main 1v1 test orchestration.
+ */
+async function runOneVsOneTest(): Promise<void> {
+  usageTracker = createUsageTracker();
+
+  const confirmed = await confirmOneVsOneRun();
+  if (!confirmed) {
+    console.log("\nAborted.");
+    process.exit(0);
+  }
+
+  const overallStart = Date.now();
+
+  const { baseDir, timestamp } = await initArenaRun("1v1");
+  console.log(`\nResults will be saved to: ${baseDir}`);
+
+  const topicResults: OneVsOneTopicResults[] = [];
+  const topicTimes: Array<{ topic: string; duration: number }> = [];
+
+  for (let i = 0; i < TOPICS.length; i++) {
+    const topic = TOPICS[i]!;
+    const topicStart = Date.now();
+    const result = await runOneVsOneTopicArena(
+      topic,
+      i,
+      TOPICS.length,
+      baseDir
+    );
+    const topicDuration = Date.now() - topicStart;
+    topicResults.push(result);
+    topicTimes.push({ topic, duration: topicDuration });
+    printOneVsOneTopicResults(result);
+    console.log(`\n  ⏱️  Topic completed in ${formatDuration(topicDuration)}`);
+  }
+
+  console.log("\n\n📊 Calculating aggregate rankings...\n");
+
+  // Log all topic results before aggregate
+  console.log("═".repeat(60));
+  console.log("\n📋 INDIVIDUAL TOPIC RESULTS\n");
+  for (const result of topicResults) {
+    printOneVsOneTopicResults(result);
+    console.log("");
+  }
+
+  const aggregateRankings = calculateOneVsOneAggregateRankings(topicResults);
+
+  const results: OneVsOneResults = {
+    timestamp,
+    models: modelsToRun.map((m) => m.name),
+    topics: topicResults,
+    aggregateRankings,
+  };
 
-  // Step 5: Write outputs to files
-  await writePipelineOutputs({
-    essay: essayResult.text,
-    review: reviewResult.text,
-    revision: revisionResult.text,
+  await writeOneVsOneResultsJson(baseDir, results);
+  await writeOneVsOneSummary(baseDir, results);
+
+  console.log("═".repeat(60));
+  console.log("\n🏆 AGGREGATE RESULTS\n");
+
+  console.log("📝 Models (as Writers - Original Essays):\n");
+  aggregateRankings.essays.forEach((entry, index) => {
+    console.log(
+      `  ${index + 1}. ${entry.author} - ${entry.wins}W/${entry.losses}L/${
+        entry.ties
+      }T (${(entry.winRate * 100).toFixed(1)}% win rate)`
+    );
+  });
+
+  console.log("\n🎯 Reviewers (by revised essay performance):\n");
+  aggregateRankings.reviewers.forEach((entry, index) => {
+    console.log(
+      `  ${index + 1}. ${entry.reviewer} - ${entry.wins}W/${entry.losses}L/${
+        entry.ties
+      }T (${(entry.winRate * 100).toFixed(1)}% win rate)`
+    );
+  });
+
+  console.log("\n🤝 Pairings (Author + Reviewer):\n");
+  aggregateRankings.pairings.forEach((entry, index) => {
+    console.log(
+      `  ${index + 1}. ${entry.author} ← ${entry.reviewer} - ${entry.wins}W/${
+        entry.losses
+      }L/${entry.ties}T (${(entry.winRate * 100).toFixed(1)}% win rate)`
+    );
+  });
+
+  printUsageSummary("1v1");
+
+  const overallDuration = Date.now() - overallStart;
+  console.log("\n" + "═".repeat(60));
+  console.log("\n⏱️  RUNTIME SUMMARY\n");
+  topicTimes.forEach((t) => {
+    console.log(
+      `  ${t.topic.slice(0, 40).padEnd(42)} ${formatDuration(t.duration)}`
+    );
   });
+  console.log(`  ${"─".repeat(50)}`);
+  console.log(`  ${"Total".padEnd(42)} ${formatDuration(overallDuration)}`);
+
+  console.log(`\n✨ 1v1 test complete! Results saved to: ${baseDir}`);
+}
+
+// ============================================================================
+// SHARED UTILITIES
+// ============================================================================
+
+/**
+ * Calculates average tokens from an array of usage records.
+ */
+function calcAverage(usages: TokenUsage[]) {
+  if (usages.length === 0) return { tokens: 0, cost: 0 };
+  const totalTokens = usages.reduce((sum, u) => sum + u.totalTokens, 0);
+  const totalCost = usages.reduce((sum, u) => sum + u.cost, 0);
+  return {
+    tokens: Math.round(totalTokens / usages.length),
+    cost: totalCost,
+  };
+}
+
+/**
+ * Prints a summary of token usage and costs.
+ */
+function printUsageSummary(testType: TestType) {
+  console.log("\n" + "═".repeat(60));
+  console.log("\n💰 TOKEN USAGE & COST SUMMARY\n");
+
+  let totalEssayCost = 0;
+  let totalReviewCost = 0;
+  let totalRevisionCost = 0;
+  let totalScoreCost = 0;
+  let totalComparisonCost = 0;
+
+  const modelStats: Array<{
+    name: string;
+    essayAvgTokens: number;
+    essayCost: number;
+    reviewAvgTokens: number;
+    reviewCost: number;
+    revisionAvgTokens: number;
+    revisionCost: number;
+    scoreCost: number;
+    comparisonCost: number;
+    totalCost: number;
+  }> = [];
+
+  for (const model of modelsToRun) {
+    const essayStats = calcAverage(usageTracker.essays[model.name]!);
+    const reviewStats = calcAverage(usageTracker.reviews[model.name]!);
+    const revisionStats = calcAverage(usageTracker.revisions[model.name]!);
+    const scoreStats = calcAverage(usageTracker.scores[model.name]!);
+    const comparisonStats = calcAverage(usageTracker.comparisons[model.name]!);
+
+    totalEssayCost += essayStats.cost;
+    totalReviewCost += reviewStats.cost;
+    totalRevisionCost += revisionStats.cost;
+    totalScoreCost += scoreStats.cost;
+    totalComparisonCost += comparisonStats.cost;
+
+    modelStats.push({
+      name: model.name,
+      essayAvgTokens: essayStats.tokens,
+      essayCost: essayStats.cost,
+      reviewAvgTokens: reviewStats.tokens,
+      reviewCost: reviewStats.cost,
+      revisionAvgTokens: revisionStats.tokens,
+      revisionCost: revisionStats.cost,
+      scoreCost: scoreStats.cost,
+      comparisonCost: comparisonStats.cost,
+      totalCost:
+        essayStats.cost +
+        reviewStats.cost +
+        revisionStats.cost +
+        scoreStats.cost +
+        comparisonStats.cost,
+    });
+  }
+
+  const grandTotal =
+    totalEssayCost +
+    totalReviewCost +
+    totalRevisionCost +
+    totalScoreCost +
+    totalComparisonCost;
+
+  console.log("Phase Costs:");
+  console.log(`  Essays (First):     $${totalEssayCost.toFixed(4)}`);
+  console.log(`  Reviews:            $${totalReviewCost.toFixed(4)}`);
+  console.log(`  Revisions (Follow): $${totalRevisionCost.toFixed(4)}`);
+  if (testType === "scoring-test") {
+    console.log(`  Scoring:            $${totalScoreCost.toFixed(4)}`);
+  } else {
+    console.log(`  Comparisons:        $${totalComparisonCost.toFixed(4)}`);
+  }
+  console.log(`  ────────────────────────────`);
+  console.log(`  Total:              $${grandTotal.toFixed(4)}`);
+
+  console.log("\n\nPer-Model Token Averages & Costs:\n");
+  console.log(
+    "  Model".padEnd(32) +
+      "First Essay".padStart(14) +
+      "Reviews".padStart(14) +
+      "Follow-up".padStart(14) +
+      "Total Cost".padStart(12)
+  );
+  console.log("  " + "─".repeat(84));
+
+  for (const stat of modelStats.sort((a, b) => b.totalCost - a.totalCost)) {
+    const essayCol = `${stat.essayAvgTokens} tok`.padStart(14);
+    const reviewCol = `${stat.reviewAvgTokens} tok`.padStart(14);
+    const revisionCol = `${stat.revisionAvgTokens} tok`.padStart(14);
+    const costCol = `$${stat.totalCost.toFixed(4)}`.padStart(12);
+    console.log(
+      `  ${stat.name.padEnd(30)}${essayCol}${reviewCol}${revisionCol}${costCol}`
+    );
+  }
+
+  console.log("\n  " + "─".repeat(84));
+  console.log(`  ${"GRAND TOTAL".padEnd(72)}$${grandTotal.toFixed(4)}`);
+}
+
+// ============================================================================
+// MAIN ENTRY POINT
+// ============================================================================
+
+async function main() {
+  let testType = getTestTypeFromArgs();
+
+  if (!testType) {
+    testType = await selectTestType();
+  }
 
-  console.log("\n✨ Pipeline complete!");
+  if (testType === "scoring-test") {
+    await runScoringTest();
+  } else {
+    await runOneVsOneTest();
+  }
 }
 
-// Run the pipeline
-runEssayPipeline().catch((error) => {
-  console.error("Error running pipeline:", error);
+main().catch((error) => {
+  console.error("Error running arena:", error);
   process.exit(1);
 });
diff --git a/package.json b/package.json
index f21c41c..e0f5ced 100644
--- a/package.json
+++ b/package.json
@@ -5,7 +5,9 @@
   "private": true,
   "dependencies": {
     "ai": "^5.0.0",
-    "@openrouter/ai-sdk-provider": "^1.0.0"
+    "@openrouter/ai-sdk-provider": "^1.0.0",
+    "p-limit": "^6.1.0",
+    "zod": "^3.24.0"
   },
   "devDependencies": {
     "@types/bun": "latest"