Autonomous experiment loop for a small single-GPU language-model training setup.
This fork stays close to karpathy/autoresearch in shape, but it is easier to run on mainstream NVIDIA cards instead of assuming Hopper-class hardware.
If you're trying to run autoresearch on a normal local NVIDIA GPU, the upstream defaults may not work cleanly as-is. This fork keeps the original small-loop idea, but smooths out the first-run path for consumer cards with a safer attention fallback, small runtime presets, and clearer operator docs.
The loop is intentionally small:
- one mutable target:
train.py - one fixed evaluation harness:
prepare.py - one fixed run budget: 5 minutes
- one scalar outcome:
val_bpb(lower is better) - one ledger:
results.tsv
The point is not to preserve this exact training recipe forever. The useful pattern is the keep/discard experiment loop under a fixed evaluation harness.
Compared with upstream, this fork adds a few practical changes for local research use:
- safer attention fallback on non-Hopper NVIDIA GPUs
- optional consumer-GPU presets through
AR_PRESET - clearer run logging and startup output
log_result.pyfor appending completed runs toresults.tsv- more direct operator docs in
README.mdandprogram.md
- Linux
- one NVIDIA GPU
- Python 3.10+
uv
This repo is still GPU-only. It is not intended for CPU training or broad multi-platform support.
prepare.py: data prep, tokenizer training, dataloaders, fixed eval utilitiestrain.py: model, optimizer, training loop, experiment surfaceprogram.md: baseline instructions for an autonomous agentlog_result.py: append a completed run summary toresults.tsvpyproject.toml: dependencies
# install dependencies
uv sync
# one-time data/tokenizer prep
uv run prepare.py
# run the tuned 12 GB baseline used for the latest RTX 5070 validation
AR_PRESET=12gb uv run train.py
# append the summary from the run log to the ledger
python log_result.py --log run.log --status keep --description "baseline"If those commands work, the repo is ready for manual or agent-driven iteration.
This fork supports a small preset layer so people on weaker GPUs can get to a working first run without editing source.
Available presets:
AR_PRESET=8gb: smaller microbatch, smaller total batch, compile disabledAR_PRESET=12gb: validated RTX 5070-style defaults, compile disabled,WINDOW_PATTERN=L,WARMDOWN_RATIO=0.75,FINAL_LR_FRAC=0.05AR_PRESET=h100: keeps the higher-throughput defaults closer to upstream assumptions
Examples:
# tuned first run on a 12 GB card
AR_PRESET=12gb uv run train.py
# even smaller run shape for tighter VRAM budgets
AR_PRESET=8gb uv run train.py
# override any preset explicitly
AR_PRESET=12gb AR_DEVICE_BATCH_SIZE=2 AR_TOTAL_BATCH_SIZE=16384 uv run train.pyEnvironment variables supported by train.py:
AR_PRESETAR_ASPECT_RATIOAR_HEAD_DIMAR_WINDOW_PATTERNAR_DEPTHAR_DEVICE_BATCH_SIZEAR_TOTAL_BATCH_SIZEAR_COMPILEAR_WARMDOWN_RATIOAR_FINAL_LR_FRACAR_EMBEDDING_WEIGHT_DECAYAR_VALUE_EMBED_WEIGHT_DECAYAR_LM_HEAD_WEIGHT_DECAYAR_PEAK_FLOPS- optional manual override for peak FLOPS if auto-detection is wrong or you want fixed MFU reporting
Precedence is simple: explicit AR_* values override the preset.
The current best known 12 GB config from the fixed 5-minute RTX 5070 run is exactly the preset:
AR_PRESET=12gb uv run train.pyEquivalent explicit settings:
AR_WINDOW_PATTERN=L AR_WARMDOWN_RATIO=0.75 AR_FINAL_LR_FRAC=0.05 uv run train.pyThis repo still assumes a single CUDA GPU and a short fixed-time training run. On non-Hopper GPUs, the fork falls back from Flash Attention 3 to PyTorch SDPA automatically and prints that choice at startup. The SDPA path now reuses precomputed sliding-window masks instead of rebuilding them every forward pass. The startup summary also prints the detected GPU name, compute capability, and peak FLOPS source used for MFU reporting.
A few practical notes:
- start with
AR_PRESET=12gbon 12 GB cards - if compilation is slow or unstable, keep
AR_COMPILE=0 - if VRAM is tight, lower
AR_DEVICE_BATCH_SIZEfirst - the fork now uses the validated
Lwindow pattern and longer warmdown by default for the12gbpreset AR_PEAK_FLOPScan be used to override the detected peak FLOPS if needed- this fork is meant to be easier to start, not universally portable
This path has been validated on an RTX 5070 12 GB setup.
The reusable idea here is the loop shape:
- keep one mutable file
- keep the evaluation harness fixed
- run on a fixed time budget
- record a compact summary
- keep or discard based on the result
program.md contains the baseline agent instructions for that loop.
prepare.py fixed data prep and evaluation utilities
train.py mutable experiment surface
program.md agent instructions
log_result.py append summaries to results.tsv
pyproject.toml dependencies
results.tsv experiment ledger
- single machine
- single NVIDIA GPU
- short runs over maximum throughput
- inspectable code over configuration sprawl
- minimal dependencies
If you need distributed training, broad hardware portability, or a larger experiment platform, this repo is the wrong base.
MIT

