feat: Actor-Critic Sandbox execution loop by aniruddhaadak80 · Pull Request #116 · karpathy/autoresearch

aniruddhaadak80 · 2026-03-10T07:45:14Z

Summary

Introduces a robust sandbox.py Actor-Critic orchestration layer that safely encapsulates LLM code execution. Previously, LLMs directly interacted with uv run train.py, which caused runaway errors, unlogged crashes, and merge/git state fragmentation.

Key Capabilities

Fail-Safe Pre-flight: Runs py_compile to catch syntax errors before allocating heavy GPU resources.
Automated Validation: Parses run.log directly and handles results.tsv management.
Git Branch Enforcement: Issues strict git commit checkpoints when val_bpb successfully improves, and actively issues git reset --hard when degrading or crashing, safely returning to a known working state.
Updated Prompting: Modified program.md specifically instructing the AI researchers to rely on this new sandbox.

Usage

python sandbox.py "Short description of what idea was changed"

Copilot

Pull request overview

Adds a new sandbox.py orchestration script to automate the experiment execution loop (syntax-check → run → parse metrics → log results → keep/discard via git), and updates the documented workflow in program.md to route experiments through this sandbox.

Changes:

Introduce sandbox.py to run train.py, parse val_bpb / peak_vram_mb, append to results.tsv, and auto keep/discard changes with git.
Update program.md experiment loop instructions to use the sandbox instead of manual commands.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File	Description
sandbox.py	New experiment runner that compiles, executes training, parses metrics, logs results, and performs git keep/discard actions.
program.md	Updates the experiment loop instructions to invoke `sandbox.py` as the orchestration layer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T07:52:12Z

program.md

+The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.
+
+**Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.


This section says the Sandbox "handles the timeout" and suggests reading run.log after a crash, but sandbox.py currently has no timeout enforcement and its rollback deletes run.log via git clean -fd. Please align the docs with the implementation (or update the implementation to match).

Copilot · 2026-03-10T07:52:12Z

sandbox.py

+def rollback():
+    print("⏪ Rolling back to previous stable git state...")
+    subprocess.run(["git", "reset", "--hard", "HEAD"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    subprocess.run(["git", "clean", "-fd"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)


rollback() runs git clean -fd, which will delete run.log because it is an untracked, non-ignored file. That makes post-crash diagnosis via run.log impossible (and contradicts the doc guidance). Consider excluding run.log (and any other diagnostics you want to keep) from cleaning, moving logs into an ignored directory, or dropping git clean here.

Copilot · 2026-03-10T07:52:12Z

sandbox.py

+    print(f"📈 Result: val_bpb = {val_bpb:.6f}, RAM = {peak_vram}MB")
+
+    best_bpb = get_best_bpb()
+    if best_bpb == float('inf'):
+        print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
+        best_bpb = 5.0 # Loose initial fallback
+
+    if val_bpb < best_bpb:
+        print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
+        subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        log_result(val_bpb, peak_vram, "KEEP", description)


On the success path you call log_result(val_bpb, peak_vram, ...) even if peak_vram was never parsed (still None). log_result() then does float(vram) which will raise TypeError for None. Either require peak_vram to be present before treating a run as successful, or make log_result robust to None (e.g., treat as N/A / NaN).

Copilot · 2026-03-10T07:52:12Z

sandbox.py

+    if val_bpb < best_bpb:
+        print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
+        subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        log_result(val_bpb, peak_vram, "KEEP", description)


The git commit call ignores failures (e.g., nothing staged/changed, not in a git repo, hooks rejecting the commit). If the commit fails, log_result() will still record the current HEAD, which can misattribute results. Capture and check the commit return code (and/or stage the intended files explicitly) so KEEP results are reliably associated with a new commit.

Copilot · 2026-03-10T07:52:13Z

sandbox.py

+def log_result(bpb, vram, status, desc):
+    if not os.path.exists("results.tsv"):
+        with open("results.tsv", "w", encoding="utf-8") as f:
+            f.write("commit\tval_bpb\tmemory_gb\tstatus\tdescription\n")
+
+    commit_hash = "N/A"
+    try:
+        commit_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode('utf-8').strip()
+    except subprocess.CalledProcessError: pass
+
+    mem_gb = f"{float(vram) / 1024:.2f}" if vram != "N/A" else "N/A"
+    bpb_str = f"{float(bpb):.6f}" if bpb != "N/A" else "N/A"
+
+    with open("results.tsv", "a", encoding="utf-8") as f:
+        f.write(f"{commit_hash}\t{bpb_str}\t{mem_gb}\t{status}\t{desc}\n")


log_result() writes "N/A" strings for BPB/memory on crashes and uses uppercase statuses (KEEP/DISCARD/CRASH). That disagrees with the program.md logging spec (numeric 0.0 values; lowercase status) and can make downstream analysis more brittle. Consider logging numeric defaults for crashes and/or keeping status casing consistent with the documented format.

Copilot · 2026-03-10T07:52:13Z

sandbox.py

@@ -0,0 +1,105 @@
+#!/usr/bin/env python3


The file starts with a UTF-8 BOM before the shebang (#!/usr/bin/env python3). That BOM can prevent the shebang from being recognized when executing the script directly. Please remove the BOM so the first bytes of the file are #!.

Copilot · 2026-03-10T07:52:13Z

sandbox.py

+    print("⏳ Critic Phase 2: Running 5-minute training budget... (tail run.log for live output)")
+    with open("run.log", "w", encoding="utf-8") as f:
+        train = subprocess.run(["uv", "run", "train.py"], stdout=f, stderr=subprocess.STDOUT)


subprocess.run(["uv", "run", "train.py"], ...) does not enforce the advertised 5-minute budget; it can hang indefinitely, and program.md claims the sandbox handles timeouts. Add a timeout and handle subprocess.TimeoutExpired (treat as CRASH, rollback, and log).

Copilot · 2026-03-10T07:52:13Z

sandbox.py

+    if train.returncode != 0 or val_bpb is None:
+        print("❌ CRASH: Runtime error, OOM, or missing metrics detected in run.log.")
+        rollback()
+        log_result(val_bpb if val_bpb else "N/A", peak_vram if peak_vram else "N/A", "CRASH", description)
+        return


In the crash path you use val_bpb if val_bpb else "N/A" and the same for peak_vram. This treats legitimate 0.0 values as missing because 0.0 is falsy. Use an explicit is not None check (or normalize to numeric defaults) so the log output is correct.

Copilot · 2026-03-10T07:52:13Z

sandbox.py

+    best_bpb = get_best_bpb()
+    if best_bpb == float('inf'):
+        print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
+        best_bpb = 5.0 # Loose initial fallback
+


If there are no prior KEEP records, best_bpb is set to 5.0. That can cause the very first successful run to be discarded when val_bpb >= 5.0, which conflicts with the idea of establishing a baseline on the first run. Consider treating inf as “no baseline yet” and KEEP the first non-crashing result (or initialize best_bpb from the first run’s val_bpb).

aniruddhaadak80 · 2026-03-10T08:35:17Z

@copilot open a new pull request to apply changes based on the comments in this thread

feat: implement Actor-Critic sandbox loop orchestration

6ac040d

Copilot AI review requested due to automatic review settings March 10, 2026 07:45

Copilot started reviewing on behalf of aniruddhaadak80 March 10, 2026 07:45 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Actor-Critic Sandbox execution loop#116

feat: Actor-Critic Sandbox execution loop#116
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/actor-critic-loop

aniruddhaadak80 commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

aniruddhaadak80 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.

		Timeout: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.

Conversation

aniruddhaadak80 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Capabilities

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

aniruddhaadak80 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aniruddhaadak80 commented Mar 10, 2026 •

edited

Loading