Skip to content

feat: Actor-Critic Sandbox execution loop#116

Open
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/actor-critic-loop
Open

feat: Actor-Critic Sandbox execution loop#116
aniruddhaadak80 wants to merge 1 commit intokarpathy:masterfrom
aniruddhaadak80:feat/actor-critic-loop

Conversation

@aniruddhaadak80
Copy link

@aniruddhaadak80 aniruddhaadak80 commented Mar 10, 2026

Summary

Introduces a robust sandbox.py Actor-Critic orchestration layer that safely encapsulates LLM code execution. Previously, LLMs directly interacted with uv run train.py, which caused runaway errors, unlogged crashes, and merge/git state fragmentation.

Key Capabilities

  • Fail-Safe Pre-flight: Runs py_compile to catch syntax errors before allocating heavy GPU resources.
  • Automated Validation: Parses run.log directly and handles results.tsv management.
  • Git Branch Enforcement: Issues strict git commit checkpoints when val_bpb successfully improves, and actively issues git reset --hard when degrading or crashing, safely returning to a known working state.
  • Updated Prompting: Modified program.md specifically instructing the AI researchers to rely on this new sandbox.

Usage

python sandbox.py "Short description of what idea was changed"

Copilot AI review requested due to automatic review settings March 10, 2026 07:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new sandbox.py orchestration script to automate the experiment execution loop (syntax-check → run → parse metrics → log results → keep/discard via git), and updates the documented workflow in program.md to route experiments through this sandbox.

Changes:

  • Introduce sandbox.py to run train.py, parse val_bpb / peak_vram_mb, append to results.tsv, and auto keep/discard changes with git.
  • Update program.md experiment loop instructions to use the sandbox instead of manual commands.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
sandbox.py New experiment runner that compiles, executes training, parses metrics, logs results, and performs git keep/discard actions.
program.md Updates the experiment loop instructions to invoke sandbox.py as the orchestration layer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +107 to +109
The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.

**Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the Sandbox "handles the timeout" and suggests reading run.log after a crash, but sandbox.py currently has no timeout enforcement and its rollback deletes run.log via git clean -fd. Please align the docs with the implementation (or update the implementation to match).

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +98
def rollback():
print("⏪ Rolling back to previous stable git state...")
subprocess.run(["git", "reset", "--hard", "HEAD"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(["git", "clean", "-fd"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollback() runs git clean -fd, which will delete run.log because it is an untracked, non-ignored file. That makes post-crash diagnosis via run.log impossible (and contradicts the doc guidance). Consider excluding run.log (and any other diagnostics you want to keep) from cleaning, moving logs into an ignored directory, or dropping git clean here.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +73
print(f"📈 Result: val_bpb = {val_bpb:.6f}, RAM = {peak_vram}MB")

best_bpb = get_best_bpb()
if best_bpb == float('inf'):
print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
best_bpb = 5.0 # Loose initial fallback

if val_bpb < best_bpb:
print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
log_result(val_bpb, peak_vram, "KEEP", description)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the success path you call log_result(val_bpb, peak_vram, ...) even if peak_vram was never parsed (still None). log_result() then does float(vram) which will raise TypeError for None. Either require peak_vram to be present before treating a run as successful, or make log_result robust to None (e.g., treat as N/A / NaN).

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +73
if val_bpb < best_bpb:
print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
log_result(val_bpb, peak_vram, "KEEP", description)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git commit call ignores failures (e.g., nothing staged/changed, not in a git repo, hooks rejecting the commit). If the commit fails, log_result() will still record the current HEAD, which can misattribute results. Capture and check the commit return code (and/or stage the intended files explicitly) so KEEP results are reliably associated with a new commit.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +93
def log_result(bpb, vram, status, desc):
if not os.path.exists("results.tsv"):
with open("results.tsv", "w", encoding="utf-8") as f:
f.write("commit\tval_bpb\tmemory_gb\tstatus\tdescription\n")

commit_hash = "N/A"
try:
commit_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode('utf-8').strip()
except subprocess.CalledProcessError: pass

mem_gb = f"{float(vram) / 1024:.2f}" if vram != "N/A" else "N/A"
bpb_str = f"{float(bpb):.6f}" if bpb != "N/A" else "N/A"

with open("results.tsv", "a", encoding="utf-8") as f:
f.write(f"{commit_hash}\t{bpb_str}\t{mem_gb}\t{status}\t{desc}\n")
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_result() writes "N/A" strings for BPB/memory on crashes and uses uppercase statuses (KEEP/DISCARD/CRASH). That disagrees with the program.md logging spec (numeric 0.0 values; lowercase status) and can make downstream analysis more brittle. Consider logging numeric defaults for crashes and/or keeping status casing consistent with the documented format.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file starts with a UTF-8 BOM before the shebang (#!/usr/bin/env python3). That BOM can prevent the shebang from being recognized when executing the script directly. Please remove the BOM so the first bytes of the file are #!.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +37
print("⏳ Critic Phase 2: Running 5-minute training budget... (tail run.log for live output)")
with open("run.log", "w", encoding="utf-8") as f:
train = subprocess.run(["uv", "run", "train.py"], stdout=f, stderr=subprocess.STDOUT)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(["uv", "run", "train.py"], ...) does not enforce the advertised 5-minute budget; it can hang indefinitely, and program.md claims the sandbox handles timeouts. Add a timeout and handle subprocess.TimeoutExpired (treat as CRASH, rollback, and log).

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +61
if train.returncode != 0 or val_bpb is None:
print("❌ CRASH: Runtime error, OOM, or missing metrics detected in run.log.")
rollback()
log_result(val_bpb if val_bpb else "N/A", peak_vram if peak_vram else "N/A", "CRASH", description)
return
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the crash path you use val_bpb if val_bpb else "N/A" and the same for peak_vram. This treats legitimate 0.0 values as missing because 0.0 is falsy. Use an explicit is not None check (or normalize to numeric defaults) so the log output is correct.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +69
best_bpb = get_best_bpb()
if best_bpb == float('inf'):
print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
best_bpb = 5.0 # Loose initial fallback

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no prior KEEP records, best_bpb is set to 5.0. That can cause the very first successful run to be discarded when val_bpb >= 5.0, which conflicts with the idea of establishing a baseline on the first run. Consider treating inf as “no baseline yet” and KEEP the first non-crashing result (or initialize best_bpb from the first run’s val_bpb).

Copilot uses AI. Check for mistakes.
@aniruddhaadak80
Copy link
Author

@copilot open a new pull request to apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants