Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
9132b9e
Introduce Component-System for component optimization, add a dashboar…
Mar 10, 2026
cb5aa72
Introduce Component-System for component optimization, add a dashboar…
Mar 10, 2026
fa8f3b6
Introduce Component-System for component optimization, add a dashboar…
Mar 10, 2026
49568b1
Merge dev into master (accepting all dev changes)
Mar 12, 2026
636b25e
Merge pull request #2 from karpathy/master
LaurenceLong Mar 12, 2026
2b2b45d
P stage: Add gradient clipping for training stability
Mar 11, 2026
0ae358d
Add optional gradient clipping and improve Ralph worktree restoration
Mar 12, 2026
85065a4
Commit uncommitted changes
Mar 12, 2026
506e430
Merge __baseline__ into current branch
Mar 12, 2026
4370961
Replace ReLU² with SwiGLU activation in MLP
Mar 12, 2026
ad41c96
Update component-system: extend DCA timeout to 900s, add baseline-fir…
Mar 12, 2026
4655486
Merge __baseline__ into master
Mar 12, 2026
f866a76
refactor(component-system): improve PDCA workflow, metrics tracking, …
Mar 12, 2026
b409165
Merge branch 'master' of https://github.com/LaurenceLong/autoresearch
Mar 12, 2026
1b0ec7f
Add arxiv dependency
Mar 12, 2026
0e1e85a
Merge branch 'master' of https://github.com/LaurenceLong/autoresearch
Mar 12, 2026
73c75af
Merge origin/master - accept all origin changes
Mar 12, 2026
e4f4895
Add local changes to model, trainer, and clean_history
Mar 12, 2026
63e87c5
Remove .ipynb_checkpoints directories from git tracking
Mar 12, 2026
e0e0e47
feat: improve worktree handling and merge resolution
Mar 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Python-generated files
__pycache__/
*.py[oc]
.pytest_cache/
build/
dist/
wheels/
Expand All @@ -16,8 +17,18 @@ queue/
CLAUDE.md
AGENTS.md

# Prompt audit output (generated by tests)
component_system/prompt_audit/

# Experimental code/artifacts
dev/

# Results file
results.tsv

# Component-system runtime artifacts (logs, queue, state, worktrees under history/)
component_system/history/
component_system/baseline_branches.json
component_system/baseline_metrics.json
*.log
.ipynb_checkpoints/
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ Hi have a look at program.md and let's kick off a new experiment! let's do the s

The `program.md` file is essentially a super lightweight "skill".

For the component-system workflow, see `component_system/README.md`.

## Project structure

```
Expand Down
75 changes: 75 additions & 0 deletions component_system/PDCA-DO-CHECK-ACTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# DCA — Do, Check, Action

## Responsibility
Take the generated plan from P, adapt/fix it in the seed worktree,
run the canonical training entrypoint, evaluate results against baseline, and
promote only when the signal is positive. Do not propose new ideas or optimize for better metrics; only adapt/fix so the plan runs and report outcomes.

## Workspace and paths
**CWD = seed worktree.** Read and edit only inside it; use relative paths only. Treat `component_system/` in the worktree as canonical context.

## Input
- Runner prompt (task content).
- Baseline: `component_system/baseline_branches.json`, `component_system/baseline_metrics.json`.
- Worktree-local files only.

## Baseline measurement (seed_id __baseline__)
Retry until the run succeeds and you report real metrics. No empty metrics.

- **OOM:** Reduce `device_batch_size` in `component_system/components/trainer.py` (default 128); keep `total_batch_size % (device_batch_size * sequence_length) == 0`. Rerun until training completes.
- Only trivial fixes (e.g. batch size); no model/training logic changes.
- **Commit before reporting.** Uncommitted changes break the follow-up merge.

## Workflow
1. Work in the seed worktree (one branch per seed).
2. Adapt/fix until it runs (runtime only: bugs, OOM, imports, config; no model/hyperparameter/training-logic changes for better metrics).
3. Run canonical command (**≥900s**): `timeout 900 uv run --active component_system/entrypoint.py > training.log 2>&1` (or `... 2>&1 | tee training.log` to also see output). **Must set command/tool timeout ≥900s**. After the run, inspect `training.log` to confirm completion and recover or verify metrics.
4. On bug/OOM: fix and rerun; for baseline, retry until success.
5. Commit on seed branch before reporting.
6. Print DCA summary block with `commit_sha` in JSON.
7. Runner evaluates signal and handles promotion.

## Output Format
Print the summary block. Put metrics in JSON; runner falls back to stdout/stderr parsing if missing.

```text
AUTORESEARCH_DCA_SUMMARY_BEGIN
{"checks":["entrypoint"],"notes":"...","completed_at":"YYYY-MM-DD HH:MM:SS","commit_sha":"git sha","metrics":{"val_bpb":1.24,...}}
AUTORESEARCH_DCA_SUMMARY_END
```

If no final metrics, use `"metrics": {}`. Runner extracts from stdout/stderr: `val_bpb`, `training_seconds`, `total_seconds`, `peak_vram_mb`, `mfu_percent`, `total_tokens_M`, `num_steps`, `num_params_M`, `depth`. No metrics → recovery DCA inspects logs; only then treat as failed.

## Check: Signal Rules

| Condition | Signal |
|-----------|--------|
| `val_bpb` drops >= 0.001 vs baseline | `positive_signal` |
| `val_bpb` rises >= 0.001 vs baseline | `negative_signal` |
| difference < 0.001 | `neutral` |
| no historical baseline (best_val_bpb) | `positive_signal` (first recording) |
| metrics missing or training error | `error` |

The threshold is defined in `component_system/config.py` (`PROMOTION_THRESHOLD`).

## Action: Promotion Rules

Only DCA may trigger a merge into baseline; P must not. Runner records `commit_sha`; on positive signal the workflow merges seed into baseline first, then updates metrics/state. Merge conflict → system queues merge-resolution DCA.

### Promotion (`positive_signal`)
1. System merges seed into baseline first (you do not run merge).
2. Workflow updates `baseline_metrics.json` / `baseline_branches.json`.
3. Metadata in seed/run state.

### Merge failure
- **Normal seed:** In seed worktree: `git merge __baseline__`, resolve conflicts, commit, print DCA summary for retry.
- **Baseline seed (__baseline__):** Merge __baseline__ into target (e.g. master). Run from worktree that has target checked out (`git worktree list`); do not run from __baseline__ worktree or `git merge master` there.

### Non-promotion
`neutral` / `negative_signal` / `error`: log only. Failure info in queue/state logs.

## Constraints
- No model/optimizer/training-logic changes for better metrics; only make the plan run (bugs, OOM, etc.).
- Use `run_mainline_training` (or equivalent); do not skip `val_bpb` evaluation.
- Do not edit baseline JSON files; only DCA promotion updates them.
- Canonical runner: `component_system/entrypoint.py`. Traceability: git + state files.
61 changes: 61 additions & 0 deletions component_system/PDCA-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# P - Seed Planning And Generation

## Responsibility
Extract exactly one testable improvement hypothesis from the seed prompt,
generate the first implementation in a candidate worktree, and hand the result
to DCA through the runner.

## Workspace and paths
**CWD = seed worktree.** Read and edit only inside it; use relative paths only.

## arXiv search (CLI)

Run from repo root with uv (e.g. `uv run python component_system/run_arxiv.py ...`); arxiv is already a project dependency.

### Search (CLI script)

From repo root, use the script in this component:

```bash
uv run python component_system/run_arxiv.py --query "machine learning" --max-results 5
uv run python component_system/run_arxiv.py --id 1605.08386v1 --output json
```

**CLI arguments:** `--query` / `-q`, `--id` (one or more arXiv IDs; overrides query), `--max-results` / `-n`, `--sort-by` (relevance | submittedDate | lastUpdatedDate), `--sort-order` (ascending | descending), `--output` / `-o` (text | json), `--download-dir`, `--verbose` / `-v`.

### Hypothesis from results
1. Read abstracts; pick one concrete change (not just a concept).
2. Map to component: `model`, `optimizer`, or `trainer`.
3. State expected benefit; reduce to one isolated, evaluable improvement.

## Input
- **results.tsv** in cwd (if present) ? read first to avoid duplicating tried/discarded ideas.
- arXiv via arxiv-search; past failures in `queue/done/`; manual seed files.

## One-Improvement Rule

One seed = one hypothesis = one causal change. Do not bundle ideas. If the prompt has several options, pick the single best for this run. Prefer the smallest coherent change that tests the hypothesis.

**Good:** one optimizer schedule change; one architectural block; one training heuristic. **Bad:** model + optimizer + batch together; multiple paper ideas in one seed; "cleanup + new feature" in one candidate.

## Output Format
Print a summary block for the runner:
```text
AUTORESEARCH_P_SUMMARY_BEGIN
{"idea":"short title","target_component":"model | optimizer | trainer","description":"change details, hypothesis, expected benefit","source_refs":["arXiv:<id>"],"commit_sha":"git sha","completed_at":"YYYY-MM-DD HH:MM:SS"}
AUTORESEARCH_P_SUMMARY_END
```

## Runner / worktree
Before each P run, the runner syncs the seed worktree with its baseline branch (merge baseline into seed) so P always starts from the latest baseline.

## Steps
1. Read `results.tsv` if present.
2. Refine prompt ? one concrete idea ? one isolated improvement; name target component.
3. Implement in worktree (from baseline); commit on seed branch.
4. Print summary block (runner records commit). Description must be enough for DCA.

## Constraints
- One component, one improvement per seed. Smallest viable implementation.
- No exploratory cleanup or opportunistic refactors unless required for the one change.
- Commit on seed branch; runner does not merge. **P must never merge;** only DCA triggers merge into baseline.
99 changes: 99 additions & 0 deletions component_system/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# autoresearch

![teaser](progress.png)

*One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026*.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of [nanochat](https://github.com/karpathy/nanochat). The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the `program.md` Markdown files that provide context to the AI agents and set up your autonomous research org. The default `program.md` in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this [tweet](https://x.com/karpathy/status/2029701092347630069).

## How it works

The repo is deliberately kept small and only really has a three files that matter:

- **`prepare.py`** — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
- **`train.py`** — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. **This file is edited and iterated on by the agent**.
- **`program.md`** — baseline instructions for one agent. Point your agent here and let it go. **This file is edited and iterated on by the human**.

By design, training runs for a **fixed 5-minute time budget** (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is **val_bpb** (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

## Quick start

**Requirements:** A single NVIDIA GPU (tested on H100), Python 3.10+, [uv](https://docs.astral.sh/uv/).

```bash

# 1. Install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Manually run a single training experiment (~5 min)
uv run train.py
```

If the above commands all work ok, your setup is working and you can go into autonomous research mode.

## Running the agent

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

```
Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
```

The `program.md` file is essentially a super lightweight "skill".

### Component-system workflow

**Seed → P → DCA** loop: daemon runs two workers that poll a file queue and dispatch to an external agent (Claude, Codex, or OpenCode).

1. **Dashboard** (optional): `uv run uvicorn component_system.web.app:app --reload`http://127.0.0.1:8000/component-system
2. **Daemon:** `uv run component_system/run.py` (or `PDCA_AGENT=codex|opencode` for other backends)
3. **Bootstrap:** Have the agent follow `component_system/protocol.md`, create a seed and queue it for P, then start the daemon. Do not run P/DCA stages manually in-session.

Seeds flow: `queue/p/` → P → `queue/dca/` → DCA → `state/`. Results in dashboard.

## Project structure

```
prepare.py — constants, data prep + runtime utilities (do not modify)
train.py — model, optimizer, training loop (agent modifies this)
program.md — agent instructions
pyproject.toml — dependencies
```

## Design choices

- **Single file to modify.** The agent only touches `train.py`. This keeps the scope manageable and diffs reviewable.
- **Fixed time budget.** Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
- **Self-contained.** No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

## Platform support

This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.

Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:

1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this [TinyStories dataset](https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean). These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
2. You might experiment with decreasing `vocab_size`, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
3. In `prepare.py`, you'll want to lower `MAX_SEQ_LEN` a lot, depending on the computer even down to 256 etc. As you lower `MAX_SEQ_LEN`, you may want to experiment with increasing `DEVICE_BATCH_SIZE` in `train.py` slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
4. Also in `prepare.py`, you'll want to decrease `EVAL_TOKENS` so that your validation loss is evaluated on a lot less data.
5. In `train.py`, the primary single knob that controls model complexity is the `DEPTH` (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
6. You'll want to most likely use `WINDOW_PATTERN` of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
7. You'll want to lower `TOTAL_BATCH_SIZE` a lot, but keep it powers of 2, e.g. down to `2**14` (~16K) or so even, hard to tell.

I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.

## Notable forks

- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) (MacOS)
- [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) (MacOS)
- [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx) (Windows)

## License

MIT
Loading