karpathy · aniruddhaadak80 · Mar 10, 2026 · Copilot · Mar 10, 2026 · Copilot
diff --git a/program.md b/program.md
@@ -93,21 +93,20 @@ The experiment runs on a dedicated branch (e.g. `autoresearch/mar5` or `autorese
 
 LOOP FOREVER:
 
-1. Look at the git state: the current branch/commit we're on
-2. Tune `train.py` with an experimental idea by directly hacking the code.
-3. git commit
-4. Run the experiment: `uv run train.py > run.log 2>&1` (redirect everything — do NOT use tee or let output flood your context)
-5. Read out the results: `grep "^val_bpb:\|^peak_vram_mb:" run.log`
-6. If the grep output is empty, the run crashed. Run `tail -n 50 run.log` to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
-7. Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)
-8. If val_bpb improved (lower), you "advance" the branch, keeping the git commit
-9. If val_bpb is equal or worse, you git reset back to where you started
-
-The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).
-
-**Timeout**: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).
-
-**Crashes**: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.
+1. Tune `train.py` with an experimental idea by directly hacking the code.
+2. Run the Critic Sandbox: `python sandbox.py "Short description of what you changed"`
+3. The sandbox will:
+   - Lint the code for syntax checks.
+   - Run the 5 minute experiment.
+   - Extract `val_bpb` and `peak_vram_mb`.
+   - Log the outcome directly to `results.tsv`.
+   - Automatically `git commit` (if improved) or `git reset` (if degraded/crashed).
+4. Read the `results.tsv` tail to understand what the Sandbox just did.
+5. If the outcome was a crash you think you can easily fix (e.g., typo or minor syntax error seen in `run.log`), re-tune `train.py` and run the sandbox again.
+
+The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.
+
+**Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.
 
 **NEVER STOP**: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working *indefinitely* until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.
 

diff --git a/sandbox.py b/sandbox.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+import subprocess
+import sys
+import os
+
+def get_best_bpb():
+    best_bpb = float('inf')
+    if os.path.exists('results.tsv'):
+        with open('results.tsv', 'r', encoding='utf-8') as f:
+            for line in f.readlines()[1:]: # skip header
+                parts = line.strip().split('\t')
+                if len(parts) >= 4 and parts[3].upper() == 'KEEP':
+                    try:
+                        bpb = float(parts[1])
+                        if bpb < best_bpb:
+                            best_bpb = bpb
+                    except ValueError:
+                        pass
+    return best_bpb
+
+def run_experiment(description):
+    print(f"🔬 Sandbox Critic: Starting experiment '{description}'...")
+
+    # 1. Check syntax first (Critic: Linter)
+    print("🧹 Critic Phase 1: Linting and compiling train.py...")
+    comp = subprocess.run([sys.executable, "-m", "py_compile", "train.py"], capture_output=True, text=True)
+    if comp.returncode != 0:
+        print("❌ CRASH: train.py has Python syntax errors!")
+        print(comp.stderr)
+        rollback()
+        log_result("N/A", "N/A", "CRASH", description)
+        return
+
+    # 2. Run the training script
+    print("⏳ Critic Phase 2: Running 5-minute training budget... (tail run.log for live output)")
+    with open("run.log", "w", encoding="utf-8") as f:
+        train = subprocess.run(["uv", "run", "train.py"], stdout=f, stderr=subprocess.STDOUT)
+
+    # 3. Analyze the outcomes
+    print("📊 Critic Phase 3: Analyzing results...")
+
+    # Extract metrics
+    val_bpb = None
+    peak_vram = None
+    if os.path.exists("run.log"):
+        with open("run.log", "r", encoding="utf-8") as f:
+            for line in f:
+                if line.startswith("val_bpb:"):
+                    try:
+                        val_bpb = float(line.split(":")[1].strip())
+                    except ValueError: pass
+                elif line.startswith("peak_vram_mb:"):
+                    try:
+                        peak_vram = float(line.split(":")[1].strip())
+                    except ValueError: pass
+
+    if train.returncode != 0 or val_bpb is None:
+        print("❌ CRASH: Runtime error, OOM, or missing metrics detected in run.log.")
+        rollback()
+        log_result(val_bpb if val_bpb else "N/A", peak_vram if peak_vram else "N/A", "CRASH", description)
+        return
+
+    print(f"📈 Result: val_bpb = {val_bpb:.6f}, RAM = {peak_vram}MB")
+
+    best_bpb = get_best_bpb()
+    if best_bpb == float('inf'):
+        print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
+        best_bpb = 5.0 # Loose initial fallback
+
+    if val_bpb < best_bpb:
+        print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
+        subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        log_result(val_bpb, peak_vram, "KEEP", description)
+    else:
+        print(f"📉 FAIL: BPB {val_bpb:.6f} is not better than best {best_bpb:.6f}. Discarding.")
+        rollback()
+        log_result(val_bpb, peak_vram, "DISCARD", description)
+
+def log_result(bpb, vram, status, desc):
+    if not os.path.exists("results.tsv"):
+        with open("results.tsv", "w", encoding="utf-8") as f:
+            f.write("commit\tval_bpb\tmemory_gb\tstatus\tdescription\n")
+
+    commit_hash = "N/A"
+    try:
+        commit_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode('utf-8').strip()
+    except subprocess.CalledProcessError: pass
+
+    mem_gb = f"{float(vram) / 1024:.2f}" if vram != "N/A" else "N/A"
+    bpb_str = f"{float(bpb):.6f}" if bpb != "N/A" else "N/A"
+
+    with open("results.tsv", "a", encoding="utf-8") as f:
+        f.write(f"{commit_hash}\t{bpb_str}\t{mem_gb}\t{status}\t{desc}\n")
+
+def rollback():
+    print("⏪ Rolling back to previous stable git state...")
+    subprocess.run(["git", "reset", "--hard", "HEAD"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    subprocess.run(["git", "clean", "-fd"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python sandbox.py \"Description of your experiment\"")
+        sys.exit(1)
+
+    run_experiment(" ".join(sys.argv[1:]))