Skip to content

fayerman-source/autoresearch

 
 

Repository files navigation

autoresearch

Autonomous experiment loop for a small single-GPU language-model training setup.

This fork stays close to karpathy/autoresearch in shape, but it is easier to run on mainstream NVIDIA cards instead of assuming Hopper-class hardware.

Autonomous Research Loop

Why This Fork Exists

If you're trying to run autoresearch on a normal local NVIDIA GPU, the upstream defaults may not work cleanly as-is. This fork keeps the original small-loop idea, but smooths out the first-run path for consumer cards with a safer attention fallback, small runtime presets, and clearer operator docs.

What This Repo Is

The loop is intentionally small:

  • one mutable target: train.py
  • one fixed evaluation harness: prepare.py
  • one fixed run budget: 5 minutes
  • one scalar outcome: val_bpb (lower is better)
  • one ledger: results.tsv

The point is not to preserve this exact training recipe forever. The useful pattern is the keep/discard experiment loop under a fixed evaluation harness.

What This Fork Changes

Upstream vs This Fork

Compared with upstream, this fork adds a few practical changes for local research use:

  • safer attention fallback on non-Hopper NVIDIA GPUs
  • optional consumer-GPU presets through AR_PRESET
  • clearer run logging and startup output
  • log_result.py for appending completed runs to results.tsv
  • more direct operator docs in README.md and program.md

Requirements

  • Linux
  • one NVIDIA GPU
  • Python 3.10+
  • uv

This repo is still GPU-only. It is not intended for CPU training or broad multi-platform support.

Files That Matter

  • prepare.py: data prep, tokenizer training, dataloaders, fixed eval utilities
  • train.py: model, optimizer, training loop, experiment surface
  • program.md: baseline instructions for an autonomous agent
  • log_result.py: append a completed run summary to results.tsv
  • pyproject.toml: dependencies

Quick Start

# install dependencies
uv sync

# one-time data/tokenizer prep
uv run prepare.py

# run the tuned 12 GB baseline used for the latest RTX 5070 validation
AR_PRESET=12gb uv run train.py

# append the summary from the run log to the ledger
python log_result.py --log run.log --status keep --description "baseline"

If those commands work, the repo is ready for manual or agent-driven iteration.

Consumer GPU Presets

This fork supports a small preset layer so people on weaker GPUs can get to a working first run without editing source.

Available presets:

  • AR_PRESET=8gb: smaller microbatch, smaller total batch, compile disabled
  • AR_PRESET=12gb: validated RTX 5070-style defaults, compile disabled, WINDOW_PATTERN=L, WARMDOWN_RATIO=0.75, FINAL_LR_FRAC=0.05
  • AR_PRESET=h100: keeps the higher-throughput defaults closer to upstream assumptions

Examples:

# tuned first run on a 12 GB card
AR_PRESET=12gb uv run train.py

# even smaller run shape for tighter VRAM budgets
AR_PRESET=8gb uv run train.py

# override any preset explicitly
AR_PRESET=12gb AR_DEVICE_BATCH_SIZE=2 AR_TOTAL_BATCH_SIZE=16384 uv run train.py

Environment variables supported by train.py:

  • AR_PRESET
  • AR_ASPECT_RATIO
  • AR_HEAD_DIM
  • AR_WINDOW_PATTERN
  • AR_DEPTH
  • AR_DEVICE_BATCH_SIZE
  • AR_TOTAL_BATCH_SIZE
  • AR_COMPILE
  • AR_WARMDOWN_RATIO
  • AR_FINAL_LR_FRAC
  • AR_EMBEDDING_WEIGHT_DECAY
  • AR_VALUE_EMBED_WEIGHT_DECAY
  • AR_LM_HEAD_WEIGHT_DECAY
  • AR_PEAK_FLOPS - optional manual override for peak FLOPS if auto-detection is wrong or you want fixed MFU reporting

Precedence is simple: explicit AR_* values override the preset.

The current best known 12 GB config from the fixed 5-minute RTX 5070 run is exactly the preset:

AR_PRESET=12gb uv run train.py

Equivalent explicit settings:

AR_WINDOW_PATTERN=L AR_WARMDOWN_RATIO=0.75 AR_FINAL_LR_FRAC=0.05 uv run train.py

Expected Behavior On Consumer GPUs

This repo still assumes a single CUDA GPU and a short fixed-time training run. On non-Hopper GPUs, the fork falls back from Flash Attention 3 to PyTorch SDPA automatically and prints that choice at startup. The SDPA path now reuses precomputed sliding-window masks instead of rebuilding them every forward pass. The startup summary also prints the detected GPU name, compute capability, and peak FLOPS source used for MFU reporting.

A few practical notes:

  • start with AR_PRESET=12gb on 12 GB cards
  • if compilation is slow or unstable, keep AR_COMPILE=0
  • if VRAM is tight, lower AR_DEVICE_BATCH_SIZE first
  • the fork now uses the validated L window pattern and longer warmdown by default for the 12gb preset
  • AR_PEAK_FLOPS can be used to override the detected peak FLOPS if needed
  • this fork is meant to be easier to start, not universally portable

This path has been validated on an RTX 5070 12 GB setup.

Agent Loop

The reusable idea here is the loop shape:

  1. keep one mutable file
  2. keep the evaluation harness fixed
  3. run on a fixed time budget
  4. record a compact summary
  5. keep or discard based on the result

program.md contains the baseline agent instructions for that loop.

Project Structure

prepare.py      fixed data prep and evaluation utilities
train.py        mutable experiment surface
program.md      agent instructions
log_result.py   append summaries to results.tsv
pyproject.toml  dependencies
results.tsv     experiment ledger

Design Constraints

  • single machine
  • single NVIDIA GPU
  • short runs over maximum throughput
  • inspectable code over configuration sprawl
  • minimal dependencies

If you need distributed training, broad hardware portability, or a larger experiment platform, this repo is the wrong base.

Notable Forks

License

MIT

About

AI agents running research on single-GPU nanochat training automatically

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 85.9%
  • Jupyter Notebook 14.1%