karpathy · LaurenceLong · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 12, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 # Python-generated files
 __pycache__/
 *.py[oc]
+.pytest_cache/
 build/
 dist/
 wheels/
@@ -16,8 +17,18 @@ queue/
 CLAUDE.md
 AGENTS.md
 
+# Prompt audit output (generated by tests)
+component_system/prompt_audit/
+
 # Experimental code/artifacts
 dev/
 
 # Results file
 results.tsv
+
+# Component-system runtime artifacts (logs, queue, state, worktrees under history/)
+component_system/history/
+component_system/baseline_branches.json
+component_system/baseline_metrics.json
+*.log
+.ipynb_checkpoints/
diff --git a/README.md b/README.md
@@ -49,6 +49,8 @@ Hi have a look at program.md and let's kick off a new experiment! let's do the s
 
 The `program.md` file is essentially a super lightweight "skill".
 
+For the component-system workflow, see `component_system/README.md`.
+
 ## Project structure
 
 ```

diff --git a/component_system/PDCA-DO-CHECK-ACTION.md b/component_system/PDCA-DO-CHECK-ACTION.md
@@ -0,0 +1,75 @@
+# DCA — Do, Check, Action
+
+## Responsibility
+Take the generated plan from P, adapt/fix it in the seed worktree,
+run the canonical training entrypoint, evaluate results against baseline, and
+promote only when the signal is positive. Do not propose new ideas or optimize for better metrics; only adapt/fix so the plan runs and report outcomes.
+
+## Workspace and paths
+**CWD = seed worktree.** Read and edit only inside it; use relative paths only. Treat `component_system/` in the worktree as canonical context.
+
+## Input
+- Runner prompt (task content).
+- Baseline: `component_system/baseline_branches.json`, `component_system/baseline_metrics.json`.
+- Worktree-local files only.
+
+## Baseline measurement (seed_id __baseline__)
+Retry until the run succeeds and you report real metrics. No empty metrics.
+
+- **OOM:** Reduce `device_batch_size` in `component_system/components/trainer.py` (default 128); keep `total_batch_size % (device_batch_size * sequence_length) == 0`. Rerun until training completes.
+- Only trivial fixes (e.g. batch size); no model/training logic changes.
+- **Commit before reporting.** Uncommitted changes break the follow-up merge.
+
+## Workflow
+1. Work in the seed worktree (one branch per seed).
+2. Adapt/fix until it runs (runtime only: bugs, OOM, imports, config; no model/hyperparameter/training-logic changes for better metrics).
+3. Run canonical command (**≥900s**): `timeout 900 uv run --active component_system/entrypoint.py > training.log 2>&1` (or `... 2>&1 | tee training.log` to also see output). **Must set command/tool timeout ≥900s**. After the run, inspect `training.log` to confirm completion and recover or verify metrics.
+4. On bug/OOM: fix and rerun; for baseline, retry until success.
+5. Commit on seed branch before reporting.
+6. Print DCA summary block with `commit_sha` in JSON.
+7. Runner evaluates signal and handles promotion.
+
+## Output Format
+Print the summary block. Put metrics in JSON; runner falls back to stdout/stderr parsing if missing.
+
+```text
+AUTORESEARCH_DCA_SUMMARY_BEGIN
+{"checks":["entrypoint"],"notes":"...","completed_at":"YYYY-MM-DD HH:MM:SS","commit_sha":"git sha","metrics":{"val_bpb":1.24,...}}
+AUTORESEARCH_DCA_SUMMARY_END
+```
+
+If no final metrics, use `"metrics": {}`. Runner extracts from stdout/stderr: `val_bpb`, `training_seconds`, `total_seconds`, `peak_vram_mb`, `mfu_percent`, `total_tokens_M`, `num_steps`, `num_params_M`, `depth`. No metrics → recovery DCA inspects logs; only then treat as failed.
+
+## Check: Signal Rules
+
+| Condition | Signal |
+|-----------|--------|
+| `val_bpb` drops >= 0.001 vs baseline | `positive_signal` |
+| `val_bpb` rises >= 0.001 vs baseline | `negative_signal` |
+| difference < 0.001 | `neutral` |
+| no historical baseline (best_val_bpb) | `positive_signal` (first recording) |
+| metrics missing or training error | `error` |
+
+The threshold is defined in `component_system/config.py` (`PROMOTION_THRESHOLD`).
+
+## Action: Promotion Rules
+
+Only DCA may trigger a merge into baseline; P must not. Runner records `commit_sha`; on positive signal the workflow merges seed into baseline first, then updates metrics/state. Merge conflict → system queues merge-resolution DCA.
+
+### Promotion (`positive_signal`)
+1. System merges seed into baseline first (you do not run merge).
+2. Workflow updates `baseline_metrics.json` / `baseline_branches.json`.
+3. Metadata in seed/run state.
+
+### Merge failure
+- **Normal seed:** In seed worktree: `git merge __baseline__`, resolve conflicts, commit, print DCA summary for retry.
+- **Baseline seed (__baseline__):** Merge __baseline__ into target (e.g. master). Run from worktree that has target checked out (`git worktree list`); do not run from __baseline__ worktree or `git merge master` there.
+
+### Non-promotion
+`neutral` / `negative_signal` / `error`: log only. Failure info in queue/state logs.
+
+## Constraints
+- No model/optimizer/training-logic changes for better metrics; only make the plan run (bugs, OOM, etc.).
+- Use `run_mainline_training` (or equivalent); do not skip `val_bpb` evaluation.
+- Do not edit baseline JSON files; only DCA promotion updates them.
+- Canonical runner: `component_system/entrypoint.py`. Traceability: git + state files.
diff --git a/component_system/PDCA-PLAN.md b/component_system/PDCA-PLAN.md
@@ -0,0 +1,61 @@
+# P - Seed Planning And Generation
+
+## Responsibility
+Extract exactly one testable improvement hypothesis from the seed prompt,
+generate the first implementation in a candidate worktree, and hand the result
+to DCA through the runner.
+
+## Workspace and paths
+**CWD = seed worktree.** Read and edit only inside it; use relative paths only.
+
+## arXiv search (CLI)
+
+Run from repo root with uv (e.g. `uv run python component_system/run_arxiv.py ...`); arxiv is already a project dependency.
+
+### Search (CLI script)
+
+From repo root, use the script in this component:
+
+```bash
+uv run python component_system/run_arxiv.py --query "machine learning" --max-results 5
+uv run python component_system/run_arxiv.py --id 1605.08386v1 --output json
+```
+
+**CLI arguments:** `--query` / `-q`, `--id` (one or more arXiv IDs; overrides query), `--max-results` / `-n`, `--sort-by` (relevance | submittedDate | lastUpdatedDate), `--sort-order` (ascending | descending), `--output` / `-o` (text | json), `--download-dir`, `--verbose` / `-v`.
+
+### Hypothesis from results
+1. Read abstracts; pick one concrete change (not just a concept).
+2. Map to component: `model`, `optimizer`, or `trainer`.
+3. State expected benefit; reduce to one isolated, evaluable improvement.
+
+## Input
+- **results.tsv** in cwd (if present) ? read first to avoid duplicating tried/discarded ideas.
+- arXiv via arxiv-search; past failures in `queue/done/`; manual seed files.
+
+## One-Improvement Rule
+
+One seed = one hypothesis = one causal change. Do not bundle ideas. If the prompt has several options, pick the single best for this run. Prefer the smallest coherent change that tests the hypothesis.
+
+**Good:** one optimizer schedule change; one architectural block; one training heuristic. **Bad:** model + optimizer + batch together; multiple paper ideas in one seed; "cleanup + new feature" in one candidate.
+
+## Output Format
+Print a summary block for the runner:
+```text
+AUTORESEARCH_P_SUMMARY_BEGIN
+{"idea":"short title","target_component":"model | optimizer | trainer","description":"change details, hypothesis, expected benefit","source_refs":["arXiv:<id>"],"commit_sha":"git sha","completed_at":"YYYY-MM-DD HH:MM:SS"}
+AUTORESEARCH_P_SUMMARY_END
+```
+
+## Runner / worktree
+Before each P run, the runner syncs the seed worktree with its baseline branch (merge baseline into seed) so P always starts from the latest baseline.
+
+## Steps
+1. Read `results.tsv` if present.
+2. Refine prompt ? one concrete idea ? one isolated improvement; name target component.
+3. Implement in worktree (from baseline); commit on seed branch.
+4. Print summary block (runner records commit). Description must be enough for DCA.
+
+## Constraints
+- One component, one improvement per seed. Smallest viable implementation.
+- No exploratory cleanup or opportunistic refactors unless required for the one change.
+- Commit on seed branch; runner does not merge. **P must never merge;** only DCA triggers merge into baseline.
diff --git a/component_system/README.md b/component_system/README.md
@@ -0,0 +1,99 @@
+# autoresearch
+
+![teaser](progress.png)
+
+*One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026*.
+
+The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of [nanochat](https://github.com/karpathy/nanochat). The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the `program.md` Markdown files that provide context to the AI agents and set up your autonomous research org. The default `program.md` in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this [tweet](https://x.com/karpathy/status/2029701092347630069).
+
+## How it works
+
+The repo is deliberately kept small and only really has a three files that matter:
+
+- **`prepare.py`** — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
+- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. **This file is edited and iterated on by the agent**.
+- **`program.md`** — baseline instructions for one agent. Point your agent here and let it go. **This file is edited and iterated on by the human**.
+
+By design, training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.
+
+## Quick start
+
+**Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).
+
+```bash
+
+# 1. Install uv project manager (if you don't already have it)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# 2. Install dependencies
+uv sync
+
+# 3. Download data and train tokenizer (one-time, ~2 min)
+uv run prepare.py
+
+# 4. Manually run a single training experiment (~5 min)
+uv run train.py
+```
+
+If the above commands all work ok, your setup is working and you can go into autonomous research mode.
+
+## Running the agent
+
+Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:
+
+```
+Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
+```
+
+The `program.md` file is essentially a super lightweight "skill".
+
+### Component-system workflow
+
+**Seed → P → DCA** loop: daemon runs two workers that poll a file queue and dispatch to an external agent (Claude, Codex, or OpenCode).
+
+1. **Dashboard** (optional): `uv run uvicorn component_system.web.app:app --reload` → http://127.0.0.1:8000/component-system
+2. **Daemon:** `uv run component_system/run.py` (or `PDCA_AGENT=codex|opencode` for other backends)
+3. **Bootstrap:** Have the agent follow `component_system/protocol.md`, create a seed and queue it for P, then start the daemon. Do not run P/DCA stages manually in-session.
+
+Seeds flow: `queue/p/` → P → `queue/dca/` → DCA → `state/`. Results in dashboard.
+
+## Project structure
+
+```
+prepare.py      — constants, data prep + runtime utilities (do not modify)
+train.py        — model, optimizer, training loop (agent modifies this)
+program.md      — agent instructions
+pyproject.toml  — dependencies
+```
+
+## Design choices
+
+- **Single file to modify.** The agent only touches `train.py`. This keeps the scope manageable and diffs reviewable.
+- **Fixed time budget.** Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
+- **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.
+
+## Platform support
+
+This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.
+
+Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:
+
+1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this [TinyStories dataset](https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean). These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
+2. You might experiment with decreasing `vocab_size`, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
+3. In `prepare.py`, you'll want to lower `MAX_SEQ_LEN` a lot, depending on the computer even down to 256 etc. As you lower `MAX_SEQ_LEN`, you may want to experiment with increasing `DEVICE_BATCH_SIZE` in `train.py` slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
+4. Also in `prepare.py`, you'll want to decrease `EVAL_TOKENS` so that your validation loss is evaluated on a lot less data.
+5. In `train.py`, the primary single knob that controls model complexity is the `DEPTH` (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
+6. You'll want to most likely use `WINDOW_PATTERN` of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
+7. You'll want to lower `TOTAL_BATCH_SIZE` a lot, but keep it powers of 2, e.g. down to `2**14` (~16K) or so even, hard to tell.
+
+I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.
+
+## Notable forks
+
+- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) (MacOS)
+- [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) (MacOS)
+- [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx) (Windows)
+
+## License
+
+MIT
-Original file line number
+Diff line change
@@ Expand Up @@
     The `program.md` file is essentially a super lightweight "skill".
+    For the component-system workflow, see `component_system/README.md`.
     ## Project structure
     ```
@@ Expand Down @@