From 6343f0e375a27d66604b3758edcdc46b4d1217b3 Mon Sep 17 00:00:00 2001
From: James Chainey <james@runloop.ai>
Date: Wed, 7 Jan 2026 12:06:50 -0800
Subject: [PATCH 1/2] shuffled some things around; deleted some dead md files

---
 DEVPLAN.md                                    | 195 --------
 OG-ARCHITECTURE.md                            | 437 ------------------
 README.md                                     | 169 +------
 TODO                                          |  56 ---
 docs/README.md                                |  27 ++
 .../benchmark/README.md                       |   0
 docs/packages/frontend/README.md              |   1 +
 docs/packages/tax-processing/README.md        |   1 +
 docs/step0/README.md                          |  19 +
 docs/step1/README.md                          | 112 +++++
 STEP2-README.md => docs/step2/README.md       |   0
 STEP3-README.md => docs/step3/README.md       |   0
 12 files changed, 164 insertions(+), 853 deletions(-)
 delete mode 100644 DEVPLAN.md
 delete mode 100644 OG-ARCHITECTURE.md
 delete mode 100644 TODO
 create mode 100644 docs/README.md
 rename README-benchmark.md => docs/benchmark/README.md (100%)
 create mode 120000 docs/packages/frontend/README.md
 create mode 120000 docs/packages/tax-processing/README.md
 create mode 100644 docs/step0/README.md
 create mode 100644 docs/step1/README.md
 rename STEP2-README.md => docs/step2/README.md (100%)
 rename STEP3-README.md => docs/step3/README.md (100%)

diff --git a/DEVPLAN.md b/DEVPLAN.md
deleted file mode 100644
index ce639f7..0000000
--- a/DEVPLAN.md
+++ /dev/null
@@ -1,195 +0,0 @@
-# Short-term development plan
-
-This file describes the short term plan for developing the next
-feature. This file is where developers and LLMs can coordinate on the
-design of the feature and the expected development and testing steps.
-
-## Feature overview: agent code refactor
-
-We are going to refactor the agent code to separate out the prompt
-from the rest of the agent scaffolding. The goal is to wind up with
-the following:
-
-- A tax agent library, which includes the main code needed to interact
-  with the LLM. This is primarily run-agent-turn.ts and the related
-  files. This will also include any tools we want to provide to the LLM.
-- A very lightweight 'agent.txt' which consists just of the natural
-  language text describing how the LLM should handle the tax
-  processing.
-- There will be some shared LLM instructions, for things like parsing
-  the W2s and describing the json output format for the 1040 data.
-  These should be separate text files which are included in the tax
-  agent library. The main agent.txt can instruct the LLM to read these.
-
-Ultimately we will want the user to be able to easily provide
-different versions of agent.txt in order to try out different tax
-preparation behaviors. This will feed into the benchmark portion of
-the demo.
-
-## design review
-
-Please review this plan, including the overview above and the specific
-steps below. Look for things that might have been overlooked. In
-particular, we want the end product to be easy for the demo users to
-use and simple enough to understand to make a good demonstration of
-Runloop's features. Suggest specific development steps below, and I
-will review them before we proceed.
-
-## Implementation Steps
-
-### 1. Analyze and Document Current Structure
-
-- **Goal**: Understand the current agent code organization
-- **Tasks**:
-  - Document the current flow: `run-agent-turn.ts` → `CodexService` → `buildAgentPrompt()`
-  - Identify all prompt components in `tax-agent-prompt.ts` (currently ~376 lines)
-  - Map out which parts are "scaffolding" vs "natural language instructions"
-
-**Current structure:**
-
-- `run-agent-turn.ts`: Entry point with hardcoded systemPrompt (lines 11-26)
-- `tax-agent-prompt.ts`: Contains `TAX_AGENT_SYSTEM_PROMPT` constant and `buildAgentPrompt()` function
-- `CodexService`: Handles LLM interaction (not changing in this refactor)
-
-### 2. Design New File Structure
-
-- **Goal**: Create a clear separation between code and prompts
-- **Proposed structure**:
-  ```
-  packages/tax-processing/
-    src/
-      bin/
-        run-agent-turn.ts          # Entry point (scaffolding)
-      lib/
-        agent-loader.ts            # NEW: Loads and composes agent prompts
-        tax-agent-library.ts       # NEW: Core tax agent logic
-      prompts/
-        agent.txt                  # NEW: Main agent behavior (natural language)
-        shared/
-          w2-parsing.txt          # NEW: W-2 document parsing instructions
-          form1040-schema.txt     # NEW: Form 1040 JSON format specification
-          tax-brackets-2024.txt   # NEW: Tax calculation rules and brackets
-  ```
-
-### 3. Create Shared Instruction Files
-
-- **Goal**: Extract reusable tax domain knowledge into separate files
-- **Tasks**:
-  - Create `prompts/shared/w2-parsing.txt`: Extract W-2 parsing instructions and CLI tool usage (currently lines 18-39 of tax-agent-prompt.ts)
-  - Create `prompts/shared/form1040-schema.txt`: Extract Form 1040 JSON template and field specifications (currently lines 177-243)
-  - Create `prompts/shared/tax-brackets-2024.txt`: Extract tax calculation workflow, brackets, deductions, credits (currently lines 57-175)
-  - Each file should be self-contained and readable as standalone documentation
-
-### 4. Create Main Agent Prompt File
-
-- **Goal**: Create the lightweight `agent.txt` with high-level instructions
-- **Tasks**:
-  - Create `prompts/agent.txt` with:
-    - Agent role and responsibilities (from current systemPrompt)
-    - Overall workflow and approach
-    - References to read the shared instruction files
-    - Output format requirements
-    - Important notes and edge cases (currently lines 332-349)
-  - Should be ~50-100 lines of natural language
-  - Focus on WHAT the agent should do, not HOW to parse formats or calculate taxes
-
-### 5. Implement Agent Loader Library
-
-- **Goal**: Create infrastructure to load and compose prompts
-- **Tasks**:
-  - Create `lib/agent-loader.ts` with:
-    - `loadPromptFile(path: string): string` - Reads prompt files from disk
-    - `composeAgentPrompt(agentFile: string, sharedFiles: string[]): string` - Combines prompts
-    - `buildTaskPrompt(context: TaskContext): string` - Adds task-specific context
-  - Handle relative paths correctly (prompts should be relative to package root)
-  - Include error handling for missing prompt files
-  - Add validation that required files exist
-
-### 6. Refactor Tax Agent Library
-
-- **Goal**: Move tax-specific logic into a library module
-- **Tasks**:
-  - Create `lib/tax-agent-library.ts` with:
-    - Constants for file paths (AGENT_PROMPT_PATH, SHARED_PROMPTS_PATH, etc.)
-    - `getTaxAgentSystemPrompt(): string` - Loads and composes all tax agent prompts
-    - `buildTaxAgentTaskPrompt(params): string` - Adds task-specific context (replaces current `buildAgentPrompt`)
-  - Keep the public API similar to current `buildAgentPrompt()` for minimal disruption
-
-### 7. Update Entry Point
-
-- **Goal**: Simplify `run-agent-turn.ts` to use the new library
-- **Tasks**:
-  - Remove hardcoded `systemPrompt` from `run-agent-turn.ts`
-  - Import and use `getTaxAgentSystemPrompt()` from tax-agent-library
-  - Import and use `buildTaxAgentTaskPrompt()` instead of `buildAgentPrompt()`
-  - The file should be minimal - just CLI arg parsing and orchestration
-
-### 8. Update Build and Packaging
-
-- **Goal**: Ensure prompt files are included in the build and deployable artifacts
-- **Tasks**:
-  - Update `tsconfig.json` to include `prompts/**/*.txt` files
-  - Update `package.json` files section to include prompt files in distribution
-  - Verify that prompt files are copied to `dist/` during build
-  - Test that prompts load correctly in both development and built/packaged scenarios
-
-### 9. Test the Refactored Agent
-
-- **Goal**: Verify functionality is preserved after refactoring
-- **Tasks**:
-  - Run local test with sample W-2: `pnpm test:agent` or similar
-  - Verify agent can load all prompt files successfully
-  - Verify output JSON format is unchanged
-  - Verify agent behavior is the same as before refactor
-  - Test with multiple sample W-2 files (text and PDF)
-  - Check that error messages are clear if prompt files are missing
-
-### 10. Update Step 2 Benchmark Integration
-
-- **Goal**: Ensure benchmarks work with the new agent structure
-- **Tasks**:
-  - Test scenario creation still works
-  - Test benchmark runs successfully with refactored agent
-  - Verify scoring still works correctly
-  - Document how users can provide custom `agent.txt` files for benchmarking
-  - Consider adding a `--agent-prompt` flag to allow specifying alternative agent.txt files
-
-### 11. Documentation Updates
-
-- **Goal**: Document the new architecture for users and developers
-- **Tasks**:
-  - Update `CLAUDE.md` with new architecture diagram and file structure
-  - Add section explaining how to customize agent behavior by editing `agent.txt`
-  - Document the shared instruction files and their purpose
-  - Add examples of customizing the agent for benchmarking
-  - Update `README.md` if needed
-  - Add comments in code explaining the prompt loading system
-
-### 12. Create Example Custom Agent
-
-- **Goal**: Demonstrate the flexibility of the new system
-- **Tasks**:
-  - Create `prompts/examples/agent-conservative.txt` - More careful, detailed calculations
-  - Create `prompts/examples/agent-fast.txt` - Optimized for speed
-  - Add a CLI flag or environment variable to select which agent prompt to use
-  - Document how benchmark users can test different agent behaviors
-
-## Success Criteria
-
-The refactor will be complete when:
-
-1. ✅ All prompt content is in `.txt` files, not hardcoded in TypeScript
-2. ✅ The agent successfully processes W-2 files with identical behavior to before
-3. ✅ Shared instruction files are reusable and well-documented
-4. ✅ Users can easily modify `agent.txt` to customize behavior
-5. ✅ Benchmarks run successfully with the refactored agent
-6. ✅ Build process correctly packages all prompt files
-7. ✅ Documentation clearly explains the new architecture
-
-## Testing Plan
-
-1. **Unit Tests**: Prompt loader functions work correctly
-2. **Integration Tests**: Agent processes sample W-2 files successfully
-3. **Benchmark Tests**: All 4 scenarios pass with expected scores
-4. **Customization Tests**: Can swap `agent.txt` and get different behavior
-5. **Deployment Tests**: Packaged agent works in Runloop devbox environment
diff --git a/OG-ARCHITECTURE.md b/OG-ARCHITECTURE.md
deleted file mode 100644
index 27b9d00..0000000
--- a/OG-ARCHITECTURE.md
+++ /dev/null
@@ -1,437 +0,0 @@
-## System Architecture
-
-This system consists of three main components working together:
-
-### 1. 🖥️ Frontend (Next.js) - User Interface Layer
-
-**Location**: `packages/frontend/`
-
-- **Port**: 3000
-- **Purpose**: Web-based tax preparation dashboard and AI chat interface
-- **Key Components**:
-  - `TaxDashboard`: Step-by-step tax filing workflow
-  - `DocumentUpload`: Drag-and-drop W-2 upload interface
-  - `TaxSummary`: Display tax calculations and refund/owed amounts
-  - `Chat`: Conversational AI interface for tax assistance
-
-### 2. 🤖 Agent Backend (Express + Codex SDK) - Intelligence Layer
-
-**Location**: `packages/tax-processing/`
-
-- **Port**: 3001
-- **Purpose**: Bridges deterministic tax logic with AI capabilities
-
-The agent backend has two distinct parts:
-
-#### Deterministic Components (Traditional Code)
-
-These are predictable, rule-based functions that always produce the same output for the same input:
-
-- **Document Parser** (`packages/tax-processing/src/services/document-parser.ts`)
-  - Extracts W-2 data using pattern matching (e.g., `44629.357631.62` = Box1+Box2)
-  - No AI involved - pure regex and text processing
-- **Tax Engine** (`services/tax-engine.ts`)
-  - Calculates taxes using IRS formulas and 2024 tax brackets
-  - Deterministic calculations: AGI, deductions, tax liability
-- **PDF Generator** (`services/pdf-generator.ts`)
-  - Fills official IRS Form 1040 PDFs using pdf-lib
-  - Field mapping and validation - no AI required
-
-#### AI-Powered Components (Codex SDK)
-
-These use OpenAI's language models for dynamic problem-solving:
-
-- **Codex Service** (`packages/tax-processing/src/services/codex.service.ts`)
-  - Conversational tax guidance and Q&A
-  - Problem-solving when documents fail to parse
-  - Tax planning advice and optimization strategies
-  - Uses tools and prompts to solve unstructured problems
-- **AI Capabilities**:
-  - Natural language understanding for tax questions
-  - Troubleshooting document processing issues
-  - Providing personalized tax advice
-  - Explaining complex tax concepts
-
-### 3. 📁 File System - Data Layer
-
-**Location**: `content/`
-
-- **input/**: Uploaded W-2 PDFs and tax documents
-- **output/**: Generated Form 1040 PDFs and summaries
-
-## How Components Interact
-
-```
-User → Frontend (Next.js) → Agent Backend
-                               ├── Deterministic Path:
-                               │   Upload W-2 → Parse → Calculate → Generate 1040
-                               │   (No AI needed for standard workflow)
-                               │
-                               └── AI-Assisted Path:
-                                   Chat → Codex SDK → LLM → Response
-                                   (For questions, troubleshooting, advice)
-```
-
-## Project Structure
-
-```
-├── content/
-│   ├── input/          # Tax documents (W-2s, 1099s, etc.)
-│   └── output/         # Generated tax forms and summaries
-├── packages/
-│   ├── agent/          # Tax processing server with Codex SDK
-│   │   ├── services/
-│   │   │   ├── document-parser.ts  # Deterministic: W-2 text extraction
-│   │   │   ├── tax-engine.ts       # Deterministic: Tax calculations
-│   │   │   ├── pdf-generator.ts    # Deterministic: Form 1040 filling
-│   │   │   ├── codex.service.ts    # AI-Powered: Chat & assistance
-│   │   │   └── tax-agent.ts        # Orchestrates both paths
-│   │   ├── routes/     # API endpoints
-│   │   │   ├── tax.routes.ts       # Deterministic tax operations
-│   │   │   └── codex.routes.ts     # AI chat endpoints
-│   │   ├── templates/  # Form 1040 data structures
-│   │   └── utils/      # Tax calculation utilities
-│   └── frontend/  # Next.js web application
-│       └── components/ # UI components
-```
-
-## Available Scripts
-
-### Root Commands
-
-- `pnpm dev` - Start all packages in development mode
-- `pnpm dev:agent` - Start only the agent server
-- `pnpm dev:frontend` - Start only the frontend
-- `pnpm build` - Build all packages
-- `pnpm lint` - Lint all packages
-- `pnpm lint:fix` - Lint and fix all packages
-- `pnpm format` - Format code with Prettier
-- `pnpm type-check` - Run TypeScript type checking
-- `pnpm clean` - Clean all build artifacts
-
-## When AI is Used vs Deterministic Code
-
-### Deterministic Operations (No AI)
-
-The following operations use traditional programming without any AI/LLM involvement:
-
-- ✅ **W-2 Parsing**: Pattern matching to extract box values from PDFs
-- ✅ **Tax Calculations**: IRS formulas for AGI, deductions, tax brackets
-- ✅ **Form Generation**: Filling PDF form fields with calculated values
-- ✅ **File Operations**: Reading/writing documents to disk
-- ✅ **API Endpoints**: REST API request/response handling
-
-### AI-Assisted Operations (Codex SDK + LLM)
-
-The AI is activated for these scenarios:
-
-- 🤖 **Tax Questions**: "What deductions can I claim?"
-- 🤖 **Troubleshooting**: "Why isn't my W-2 being recognized?"
-- 🤖 **Tax Planning**: "How can I reduce my tax liability next year?"
-- 🤖 **Document Issues**: Helping when parsing fails or data is missing
-- 🤖 **Explanations**: "Why do I owe taxes this year?"
-
-### Hybrid Operations
-
-Some features use both approaches:
-
-- **Document Processing**: Deterministic parsing first, AI assistance if it fails
-- **Error Recovery**: Deterministic validation, AI-powered error messages
-- **User Guidance**: Deterministic workflow steps, AI for answering questions
-
-## Architecture
-
-### Streaming Flow
-
-1. User sends message via frontend
-2. Frontend establishes SSE connection to agent
-3. Agent calls Codex SDK's `runStreamed()` method
-4. Events stream from Codex → Agent → Frontend
-5. Frontend updates UI in real-time
-
-### Technology Stack
-
-**Agent:**
-
-- Express 4.x
-- @openai/codex-sdk ^0.63.0
-- TypeScript (strict mode)
-- Server-Sent Events (SSE)
-
-**Frontend:**
-
-- Next.js 14+ (App Router)
-- React 18
-- Tailwind CSS
-- react-markdown + remark-gfm
-
-**Dev Tools:**
-
-- pnpm workspaces
-- ESLint + Prettier
-- GitHub Actions CI
-- TypeScript strict mode
-
-## Tax Agent Capabilities
-
-The tax preparation agent provides specialized functionality:
-
-### Document Processing
-
-- **Auto-detection**: Monitors `content/input/` for new tax documents
-- **W-2 Parsing**: Extracts employer info, wages, and tax withholdings
-- **Data Validation**: Verifies extracted information for accuracy
-
-### Tax Calculations
-
-- **Form 1040 Generation**: Calculates AGI, taxable income, and tax liability
-- **Refund/Amount Owed**: Determines if you'll receive a refund or owe taxes
-- **Tax Planning**: Provides optimization suggestions and next-year planning
-
-### Form Generation
-
-- **PDF Output**: Creates completed Form 1040 in official IRS format
-- **Tax Summary**: Generates easy-to-read summary of your tax situation
-- **Filing Instructions**: Provides guidance on next steps for filing
-
-### AI Assistant
-
-The system includes a conversational AI that can:
-
-- Answer tax questions and explain calculations
-- Help troubleshoot document processing issues
-- Provide tax planning advice and strategies
-- Guide you through the filing process
-
-**System Prompt**: Configured for tax expertise - can be modified in `packages/tax-processing/src/routes/codex.routes.ts`.
-
-## Tax API Endpoints
-
-The agent exposes specialized tax processing endpoints:
-
-### Document Processing
-
-- **POST `/tax/process-documents`**: Process uploaded documents in `content/input/`
-- **POST `/tax/upload`**: Manual document upload interface
-- **GET `/tax/w2-summary`**: Get summary of processed W-2 forms
-
-### Tax Calculations
-
-- **POST `/tax/calculate`**: Calculate taxes with taxpayer information
-- **GET `/tax/tax-summary`**: Get tax calculation results
-
-### Form Generation
-
-- **POST `/tax/generate-forms`**: Generate Form 1040 and tax summary PDFs
-- **GET `/tax/download/:filename`**: Download generated forms from `content/output/`
-
-### Status & Monitoring
-
-- **GET `/tax/status`**: Get current processing status and stage
-
-## System Permissions
-
-The Codex agent is configured with the following permissions:
-
-- **File Operations**: Read/write access to `content/` directories for tax documents
-- **Network Access**: IRS API access for tax rates and form updates
-- **PDF Generation**: Fill official IRS Form 1040 using pdf-lib
-- **Document Parsing**: Extract data from uploaded tax documents
-
-Configuration files:
-
-- `~/.codex/config.toml` (created by `setup-codex-config.sh`)
-- Environment variables in `packages/tax-processing/src/services/codex.service.ts`
-
-## Features
-
-### Tax Dashboard
-
-- ✅ Step-by-step tax preparation workflow
-- ✅ Document upload with drag-and-drop
-- ✅ Real-time processing status updates
-- ✅ Tax calculation summary display
-- ✅ Form generation and download
-- ✅ Tax planning recommendations
-
-### AI Tax Assistant
-
-- ✅ Conversational tax guidance
-- ✅ Real-time streaming responses
-- ✅ Tax-specific knowledge and calculations
-- ✅ Document processing troubleshooting
-- ✅ Filing guidance and next steps
-
-### Tax Processing Engine
-
-- ✅ Automatic W-2 document detection
-- ✅ Generic W-2 parser with concatenated pattern extraction
-- ✅ Data extraction and validation
-- ✅ Tax calculation engine with 2024 rates
-- ✅ Official IRS Form 1040 PDF filling with pdf-lib
-- ✅ Tax planning and optimization advice
-- ✅ Multi-employer W-2 support
-
-### Backend Services
-
-- ✅ OpenAI Codex SDK integration
-- ✅ Tax-specific API endpoints
-- ✅ File monitoring for automatic processing
-- ✅ PDF form filling with pdf-lib
-- ✅ Document parsing with pdf-parse and validation
-- ✅ Real-time status tracking
-
-## Continuous Integration
-
-This project includes a GitHub Actions workflow that runs on every pull request and push to main:
-
-- ✅ ESLint - Code linting
-- ✅ Prettier - Code formatting check
-- ✅ TypeScript - Type checking
-- ✅ Build - Verify all packages build successfully
-
-The workflow ensures code quality and prevents broken code from being merged.
-
-## Cloud Mode (Runloop)
-
-Run agents on remote Runloop devboxes instead of locally:
-
-### Setup
-
-1. **Get Runloop API Key**: Sign up at [runloop.ai](https://runloop.ai)
-
-2. **Set environment variables**:
-
-   ```bash
-   export RUN_MODE=cloud
-   export RUNLOOP_API_KEY=your-api-key-here
-   export GITHUB_TOKEN=your-github-token  # For private repos
-   ```
-
-3. **Start frontend**:
-
-   ```bash
-   pnpm dev:frontend
-   ```
-
-   Agents will be created on Runloop devboxes when you start new threads.
-
-### How It Works
-
-- **Local Mode** (default): Spawns agent processes on your machine
-- **Cloud Mode**: Creates Runloop devboxes with your repo, runs agents there
-- Each thread gets its own isolated devbox
-- Tunnels provide HTTPS access to cloud agents
-
-### Deploy Agent to Runloop
-
-This project includes a GitHub Action for deploying the agent as a persistent service on Runloop.
-
-1. **Configure GitHub Secret**:
-
-   Go to your repository Settings → Secrets → Actions and add:
-   - `RUNLOOP_API_KEY` - Your Runloop API key
-
-2. **Trigger Deployment**:
-
-   Go to Actions → Deploy to Runloop → Run workflow
-
-   The workflow will:
-   - Build the agent package
-   - Create a deployment package
-   - Deploy to Runloop as a named agent
-
-3. **Access Deployed Agent**:
-
-   After deployment, the agent will be available at Runloop and can be referenced by name in your applications.
-
-## Development
-
-### Adding New Features
-
-1. **Agent**: Add tools or modify behavior in `packages/tax-processing/src/services/codex.service.ts`
-2. **Frontend**: Add components in `packages/frontend/src/components/`
-
-### Environment Variables
-
-**Agent (`packages/tax-processing/.env`):**
-
-```env
-PORT=3001
-FRONTEND_URL=http://localhost:3000
-OPENAI_API_KEY=your-api-key-here
-```
-
-**Frontend (`packages/frontend/.env.local`):**
-
-```env
-NEXT_PUBLIC_AGENT_URL=http://localhost:3001
-
-# Cloud mode (optional)
-RUN_MODE=local                    # or "cloud" for Runloop
-RUNLOOP_API_KEY=your-key-here     # Required for cloud mode
-RUNLOOP_BASE_URL=https://api.runloop.ai  # Optional
-GITHUB_TOKEN=your-token-here      # Required for private repos in cloud mode
-```
-
-## Troubleshooting
-
-### CORS Errors
-
-Ensure `FRONTEND_URL` in the agent `.env` matches your Next.js dev server URL.
-
-### Connection Issues
-
-1. Check both servers are running (`pnpm dev`)
-2. Verify agent is accessible at http://localhost:3001/health
-3. Check browser console for errors
-
-### Streaming Not Working
-
-- Ensure your browser supports Server-Sent Events (all modern browsers do)
-- Check network tab to see if SSE connection is established
-- Verify no proxy/firewall is blocking the connection
-
-## Tax Preparation Usage
-
-### Example Workflow
-
-1. **Upload W-2**: Place `w2-2024-employer.pdf` in `content/input/`
-2. **Auto-Processing**: Agent extracts employer info and wage data
-3. **Review**: Verify extracted information in Tax Dashboard
-4. **Personal Info**: Provide name, SSN, address, filing status
-5. **Calculate**: Agent computes AGI, taxable income, refund/owed
-6. **Generate**: Download completed Form 1040 from `content/output/`
-
-### Supported Tax Documents (2024)
-
-- ✅ **W-2**: Wage and Tax Statement
-- 🔄 **1099-MISC**: Miscellaneous Income (coming soon)
-- 🔄 **1099-INT**: Interest Income (coming soon)
-- 🔄 **1099-DIV**: Dividend Income (coming soon)
-
-### Sample Files
-
-The repository includes sample tax documents for testing:
-
-- `content/input/sample_w2_2024.txt` - Sample W-2 for testing
-
-## Future Enhancements
-
-### Tax Features
-
-- [ ] Multi-year tax return comparisons
-- [ ] State tax return generation
-- [ ] Itemized deduction support
-- [ ] Tax planning scenarios and projections
-- [ ] Integration with tax software APIs
-- [ ] E-filing capabilities
-
-### Technical Improvements
-
-- [ ] Authentication and user accounts
-- [ ] Persistent storage for tax data
-- [ ] Advanced PDF parsing
-- [ ] Audit trail and compliance logging
-- [ ] Production deployment configuration
-- [ ] Multi-tenant support
diff --git a/README.md b/README.md
index 796d90e..804eb45 100644
--- a/README.md
+++ b/README.md
@@ -4,6 +4,10 @@ Use agentic AI on Runloop to create a (fake) tax preparation service.
 
 **⚠️ IMPORTANT: This is a demonstration system only. Do not use for actual tax preparation or filing. Do not upload real personal or tax information. ⚠️**
 
+## Docs
+
+Full project docs live in [`docs/README.md`](./docs/README.md).
+
 ## Overview
 
 This demo illustrates the power of Agentic AI on the Runloop platform
@@ -133,171 +137,6 @@ best. We can perform quick experiments to measure performance after making chang
    - eg: runloop dashboard w/ created artifacts
    - eg: benchmark scenario runs
 
-## Detailed Development Steps
-
-### Step 0: Manual Tax Prep
-
-Our demo begins with a simple tax preparation website, which you can
-access at http://localhost:3000/step0. This site mimics a basic tax
-preparation workflow:
-
-- A client uses the site to upload their tax information
-- The server saves the tax information locally
-- A tax preparer goes through the client's information and
-  calculates values that go into the client's form 1040
-- The tax preparer sends the form 1040 values through a program which
-  creates a formatted 1040.pdf.
-
-When you visit the [step 0](http://localhost:3000/step0) landing page, you see two options:
-
-1. If you click the "client" option, you are taken to a page which
-   allows you to upload fake W2 information (or select from the
-   included examples).
-2. When you click the "tax preparer" option, you are given a chance to
-   upload form 1040 values which are inserted into a form 1040.pdf file.
-
-### Step 1: The Codex Tax Agent
-
-We want to use a Codex-based agent to replace the manual conversion of
-the client's tax information into form 1040 values. For this step, we
-will utilize a simple tax prep agent running on a Runloop devbox. To
-set up the Runloop environment, run `pnpm step1_runloop_setup`. Under the hood, this
-command does the following:
-
-1. Uploads the demo Agent as a Runloop object
-2. Creates the Agent from the uploaded object
-3. Creates a Runloop Blueprint with the agent mounted and required
-   packages installed. Creating a blueprint ensures that we can launch Devboxes using
-   this agent quickly
-
-After running the setup script, restart the service with `pnpm dev` and visit [step 1](http://localhost:3000/step1) to see this in action.
-
-After running the script, the user flow is as follows:
-
-- A client wanting to file their taxes uses the site to upload their tax information
-- After hitting submit, the server starts a Devbox using the Blueprint for this agent.
-- The server uploads the tax info to the Devbox and runs uses an exec
-  command to invoke the agent and produce the 1040 json values.
-- The server takes the form 1040 values output by the agent and
-  prepares the 1040.pdf.
-
-Note that the process of generating the 1040 values from input is now completely agent-driven and takes place on demand.
-
-**Key Code Snippets:**
-Step 1 largely reuses code from our original implementation: reading PDFs, parsing and rendering are untouched.
-
-- **API endpoint**:
-  You can walk through the streaming API route that orchestrates the entire processing flow:
-
-  ```84:238:packages/frontend/src/app/api/tax/process-step1-stream/route.ts
-  // Handles file upload, creates devbox, runs agent, generates PDF
-  // Returns Server-Sent Events for real-time progress updates
-  ```
-
-- **Devbox creation and agent execution**:
-  The API endpoint creates an instance of `TaxService` then calls `processTaxReturn`. In turn, this spins up the Runloop devbox, uploads files, and executes the agent:
-
-  ```70:178:packages/frontend/src/lib/tax-processing-service.ts
-  // Creates devbox from blueprint, uploads W-2 file and agent prompt,
-  // executes the agent via execAsync, and retrieves the JSON result
-  ```
-
-  Starting a devbox with the agent and wiring in our OpenAI secret is handled here:
-
-  ```ts
-  this.devbox = await this.runloop.devbox.create({
-    name: `tax-processing-${Date.now()}`,
-    // blueprint created by step1_runloop_setup.ts script
-    blueprint_id: blueprintId,
-    environment_variables: {
-      CODEX_SKIP_GIT_REPO_CHECK: 'true',
-      RUNLOOP_DEVBOX: '1',
-    },
-    // wire in the OpenAI key from the Runloop secret store
-    secrets: { OPENAI_API_KEY: 'OPENAI_API_KEY' },
-  });
-  ```
-
-  After the devbox has been started, we load tax processing instructions as a prompt to the agent here:
-
-  ```ts
-  await this.devbox.file.write({
-    file_path: '/home/user/agent-prompt.txt',
-    contents: agentPromptContent,
-  });
-  ```
-
-  Then the specific user's W2 tax information:
-
-  ```ts
-  // Upload W2 file (use upload() for binary files like PDFs)
-  logger.log(`Uploading ${w2Filename} to devbox...`);
-  await this.devbox.file.upload({
-    path: `/home/user/input/${w2Filename}`,
-    file: w2File,
-  });
-  ```
-
-- **Agent execution script**:
-  After setting up the evironment, we invoke the agent using a standalone script that runs on the devbox to process the W-2 and generate Form 1040 JSON:
-
-  ```11:97:packages/tax-processing/src/bin/run-agent-turn.ts
-  // Runs a single agent turn using CodexService to process W-2
-  // and write Form 1040 JSON output to the specified file
-  ```
-
-  Here the script uses a prompt to define the role and instruct the LLM to return output conforming to well defined JSON schemas. The prompt and the W2 information from the user are used to repeatedly call Cortex and stream the output. This is the core agent processing loop.
-
-  Rather than have the LLM perform calculations directly, we instead use the agent to process individual line items and return the results as JSON. This lets us use LLMs to do what they're best at while leveraging traditional code to perform the actual math and generate a PDF.
-
-  Importantly, since Runloop provides a secure isolated environment, Codex is allowed to run
-  with broad permissions: the burden of knowing what commands are safe to run in the execution environment is solved:
-
-  ```ts
-  // RUNLOOP_DEVBOX is set as env var during devbox startup
-  this.sandboxMode =
-    process.env.RUNLOOP_DEVBOX === '1'
-      ? 'danger-full-access'
-      : 'workspace-write';
-  ```
-
-- **PDF generation**:
-  After processing the input, the final step is to generate the 1040 PDF form:
-  ```29:50:packages/frontend/src/lib/pdf-generator.ts
-  // Loads IRS Form 1040 template, fills form fields with agent output,
-  // and saves the completed PDF to the output directory
-  ```
-  This code is the same as in Step 0.
-
-### Step 2: Testing Tax Agent Performance
-
-- create benchmkark scenarios
-- create the testing image
-- upload scenarios
-- run scenarios & collect results
-- show the results in some form
-
-- scenarios:
-  - N fake W2s with desired 1040 outputs
-  - maybe augment with text Q&A script
-
-- TS script to create benchmark stuff
-  - create benchmark
-  - create blueprint with desired data (via API calls, Docker build, etc)
-  - create scenarios
-  - run the benchmark w/ initial agent to get a score
-
-### Step 3: Comparing Tax Agents
-
-- take several agent implementations
-- use agent API to add them to your account
-- run benchmarks against each one
-
-### Step \inf: Tax Agent Smith
-
-- self-improving tax agent
-- wanted to take over the world, but satisfied with filing Neo's taxes
-
 ## License
 
 MIT
diff --git a/TODO b/TODO
deleted file mode 100644
index dd94e30..0000000
--- a/TODO
+++ /dev/null
@@ -1,56 +0,0 @@
--*-mode:org-*-
-
-* blueprint creation issues
-** TS API call needs a different timeout; dies after a while, before BP creation has completed?
-** error message when things die is crappy
-** Blueprint building takes longer than it should.  (3-4 min for a simple addition to Ubuntu base?)
-** when creating, BP page shows 'provisioning' for a long time (on dev at least)
-** Blueprint build w/ dockerfile FROM runloop::runloop/starter-x86_64
-*** still takes a long time
-*** build installs a bunch of basic stuff; shouldn't that already be pre-built?
-**** maybe using the raw Dockerfile rather than the previously-built image?
-**** maybe faster on prod?
-*** weird issue w/ different python versions competing: 3.11 vs 3.12
-** We should have an easy-to-find list of public blueprints
-*** minimal ubuntu
-*** fairly complete ubuntu (eg, has recent python, node, glibc, gcc, etc)
-** debugging flow is terrible
-*** current order: create object, create agent, create blueprint
-*** problem with any requires full rebuild of the others
-*** blueprint creation is slow, so really irritating to set up
-*** any error causes full failure; no way to have a devbox to poke around
-
-* step1 setup script:
-** blueprint polling logic is wrong!!
-
-* general plan
-** get step1 working on devbox
-** refactor to move crap out of the agent; simplify 
-
-* devbox creation: should have the API key somwhere not in the env?
-
-
-* stuff to fix
-** create a setup_runloop.ts script
-*** setup blueprint; 
-*** create benchmarks
-*** use tax_demo_ or something as prefix for Runloop artifacts created for this demo
-** duplicate agent def in 
-packages/tax-processing/src/services/codex.service.ts
-packages/tax-processing/src/routes/codex.routes.ts
-packages/tax-processing/src/server.ts
-
-** codex configuration
-*** remove from packages/tax-processing/codex_config.toml; set values in agent setup instead
-
-* weird stuff; maybe don't fix?
-** packages/tax-processing/src: unnecessarily separate CLI from bin dir?
-**
-
-
-
-* questions
-** why cors.middleware.ts?
-** where is the .next directory defined?  
-** what does 'pnpm dev' do?
-** 
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 0000000..b363775
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,27 @@
+# Documentation
+
+← Back to the repo README: [`README.md`](../README.md)
+
+This directory contains consolidated docs for the demo project.
+
+## Detailed Development Steps
+
+- **Step 0 (manual flow)**: Upload a W-2 and manually prepare Form 1040 values and generate a PDF. See [`docs/step0/README.md`](./step0/README.md).
+- **Step 1 (agent-powered flow)**: Use a Codex-based agent running in a Runloop Devbox to generate Form 1040 JSON, then render the PDF. See [`docs/step1/README.md`](./step1/README.md).
+- **Step 2 (benchmarks)**: Set up and run Runloop benchmarks to measure agent performance across scenarios. See [`docs/step2/README.md`](./step2/README.md).
+- **Step 3 (prompt iteration)**: Benchmark multiple prompt versions across standardized scenarios using the harness. See [`docs/step3/README.md`](./step3/README.md).
+
+## Guides
+
+- [Benchmark harness guide](./benchmark/README.md)
+- [Step 0: Manual Tax Prep](./step0/README.md)
+- [Step 1: The Codex Tax Agent](./step1/README.md)
+- [Step 2: Tax Agent Benchmarking](./step2/README.md)
+- [Step 3: Benchmarking Agent Prompt Versions](./step3/README.md)
+
+## Package docs
+
+- [frontend](./packages/frontend/README.md)
+- [tax-processing](./packages/tax-processing/README.md)
+
+
diff --git a/README-benchmark.md b/docs/benchmark/README.md
similarity index 100%
rename from README-benchmark.md
rename to docs/benchmark/README.md
diff --git a/docs/packages/frontend/README.md b/docs/packages/frontend/README.md
new file mode 120000
index 0000000..e3c5dea
--- /dev/null
+++ b/docs/packages/frontend/README.md
@@ -0,0 +1 @@
+../../../packages/frontend/README.md
\ No newline at end of file
diff --git a/docs/packages/tax-processing/README.md b/docs/packages/tax-processing/README.md
new file mode 120000
index 0000000..3c52e01
--- /dev/null
+++ b/docs/packages/tax-processing/README.md
@@ -0,0 +1 @@
+../../../packages/tax-processing/README.md
\ No newline at end of file
diff --git a/docs/step0/README.md b/docs/step0/README.md
new file mode 100644
index 0000000..1d6f4f2
--- /dev/null
+++ b/docs/step0/README.md
@@ -0,0 +1,19 @@
+# Step 0: Manual Tax Prep
+
+← Back to docs index: [`docs/README.md`](../README.md)
+
+Our demo begins with a simple tax preparation website, which you can access at `http://localhost:3000/step0`.
+
+This site mimics a basic tax preparation workflow:
+
+- A client uses the site to upload their tax information
+- The server saves the tax information locally
+- A tax preparer goes through the client's information and calculates values that go into the client's form 1040
+- The tax preparer sends the form 1040 values through a program which creates a formatted 1040.pdf.
+
+When you visit the [step 0](http://localhost:3000/step0) landing page, you see two options:
+
+1. If you click the "client" option, you are taken to a page which allows you to upload fake W2 information (or select from the included examples).
+2. When you click the "tax preparer" option, you are given a chance to upload form 1040 values which are inserted into a form 1040.pdf file.
+
+
diff --git a/docs/step1/README.md b/docs/step1/README.md
new file mode 100644
index 0000000..5786ba1
--- /dev/null
+++ b/docs/step1/README.md
@@ -0,0 +1,112 @@
+# Step 1: The Codex Tax Agent
+
+← Back to docs index: [`docs/README.md`](../README.md)
+
+We want to use a Codex-based agent to replace the manual conversion of the client's tax information into form 1040 values. For this step, we will utilize a simple tax prep agent running on a Runloop devbox.
+
+To set up the Runloop environment, run `pnpm step1_runloop_setup`. Under the hood, this command does the following:
+
+1. Uploads the demo Agent as a Runloop object
+2. Creates the Agent from the uploaded object
+3. Creates a Runloop Blueprint with the agent mounted and required packages installed. Creating a blueprint ensures that we can launch Devboxes using this agent quickly
+
+After running the setup script, restart the service with `pnpm dev` and visit [step 1](http://localhost:3000/step1) to see this in action.
+
+After running the script, the user flow is as follows:
+
+- A client wanting to file their taxes uses the site to upload their tax information
+- After hitting submit, the server starts a Devbox using the Blueprint for this agent.
+- The server uploads the tax info to the Devbox and runs uses an exec command to invoke the agent and produce the 1040 json values.
+- The server takes the form 1040 values output by the agent and prepares the 1040.pdf.
+
+Note that the process of generating the 1040 values from input is now completely agent-driven and takes place on demand.
+
+## Key Code Snippets
+
+Step 1 largely reuses code from our original implementation: reading PDFs, parsing and rendering are untouched.
+
+- **API endpoint**:
+  You can walk through the streaming API route that orchestrates the entire processing flow:
+
+  ```84:238:packages/frontend/src/app/api/tax/process-step1-stream/route.ts
+  // Handles file upload, creates devbox, runs agent, generates PDF
+  // Returns Server-Sent Events for real-time progress updates
+  ```
+
+- **Devbox creation and agent execution**:
+  The API endpoint creates an instance of `TaxService` then calls `processTaxReturn`. In turn, this spins up the Runloop devbox, uploads files, and executes the agent:
+
+  ```70:178:packages/frontend/src/lib/tax-processing-service.ts
+  // Creates devbox from blueprint, uploads W-2 file and agent prompt,
+  // executes the agent via execAsync, and retrieves the JSON result
+  ```
+
+  Starting a devbox with the agent and wiring in our OpenAI secret is handled here:
+
+  ```ts
+  this.devbox = await this.runloop.devbox.create({
+    name: `tax-processing-${Date.now()}`,
+    // blueprint created by step1_runloop_setup.ts script
+    blueprint_id: blueprintId,
+    environment_variables: {
+      CODEX_SKIP_GIT_REPO_CHECK: 'true',
+      RUNLOOP_DEVBOX: '1',
+    },
+    // wire in the OpenAI key from the Runloop secret store
+    secrets: { OPENAI_API_KEY: 'OPENAI_API_KEY' },
+  });
+  ```
+
+  After the devbox has been started, we load tax processing instructions as a prompt to the agent here:
+
+  ```ts
+  await this.devbox.file.write({
+    file_path: '/home/user/agent-prompt.txt',
+    contents: agentPromptContent,
+  });
+  ```
+
+  Then the specific user's W2 tax information:
+
+  ```ts
+  // Upload W2 file (use upload() for binary files like PDFs)
+  logger.log(`Uploading ${w2Filename} to devbox...`);
+  await this.devbox.file.upload({
+    path: `/home/user/input/${w2Filename}`,
+    file: w2File,
+  });
+  ```
+
+- **Agent execution script**:
+  After setting up the evironment, we invoke the agent using a standalone script that runs on the devbox to process the W-2 and generate Form 1040 JSON:
+
+  ```11:97:packages/tax-processing/src/bin/run-agent-turn.ts
+  // Runs a single agent turn using CodexService to process W-2
+  // and write Form 1040 JSON output to the specified file
+  ```
+
+  Here the script uses a prompt to define the role and instruct the LLM to return output conforming to well defined JSON schemas. The prompt and the W2 information from the user are used to repeatedly call Cortex and stream the output. This is the core agent processing loop.
+
+  Rather than have the LLM perform calculations directly, we instead use the agent to process individual line items and return the results as JSON. This lets us use LLMs to do what they're best at while leveraging traditional code to perform the actual math and generate a PDF.
+
+  Importantly, since Runloop provides a secure isolated environment, Codex is allowed to run with broad permissions: the burden of knowing what commands are safe to run in the execution environment is solved:
+
+  ```ts
+  // RUNLOOP_DEVBOX is set as env var during devbox startup
+  this.sandboxMode =
+    process.env.RUNLOOP_DEVBOX === '1'
+      ? 'danger-full-access'
+      : 'workspace-write';
+  ```
+
+- **PDF generation**:
+  After processing the input, the final step is to generate the 1040 PDF form:
+
+  ```29:50:packages/frontend/src/lib/pdf-generator.ts
+  // Loads IRS Form 1040 template, fills form fields with agent output,
+  // and saves the completed PDF to the output directory
+  ```
+
+  This code is the same as in Step 0.
+
+
diff --git a/STEP2-README.md b/docs/step2/README.md
similarity index 100%
rename from STEP2-README.md
rename to docs/step2/README.md
diff --git a/STEP3-README.md b/docs/step3/README.md
similarity index 100%
rename from STEP3-README.md
rename to docs/step3/README.md

From 713a39950f235e6af76614490356d0a8069224a2 Mon Sep 17 00:00:00 2001
From: James Chainey <james@runloop.ai>
Date: Wed, 7 Jan 2026 12:15:30 -0800
Subject: [PATCH 2/2] now with formatting!

---
 docs/README.md       | 2 --
 docs/step0/README.md | 2 --
 docs/step1/README.md | 2 --
 3 files changed, 6 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index b363775..ad2f303 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -23,5 +23,3 @@ This directory contains consolidated docs for the demo project.
 
 - [frontend](./packages/frontend/README.md)
 - [tax-processing](./packages/tax-processing/README.md)
-
-
diff --git a/docs/step0/README.md b/docs/step0/README.md
index 1d6f4f2..fcf5123 100644
--- a/docs/step0/README.md
+++ b/docs/step0/README.md
@@ -15,5 +15,3 @@ When you visit the [step 0](http://localhost:3000/step0) landing page, you see t
 
 1. If you click the "client" option, you are taken to a page which allows you to upload fake W2 information (or select from the included examples).
 2. When you click the "tax preparer" option, you are given a chance to upload form 1040 values which are inserted into a form 1040.pdf file.
-
-
diff --git a/docs/step1/README.md b/docs/step1/README.md
index 5786ba1..20ea915 100644
--- a/docs/step1/README.md
+++ b/docs/step1/README.md
@@ -108,5 +108,3 @@ Step 1 largely reuses code from our original implementation: reading PDFs, parsi
   ```
 
   This code is the same as in Step 0.
-
-