Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 14 additions & 15 deletions program.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,21 +93,20 @@ The experiment runs on a dedicated branch (e.g. `autoresearch/mar5` or `autorese

LOOP FOREVER:

1. Look at the git state: the current branch/commit we're on
2. Tune `train.py` with an experimental idea by directly hacking the code.
3. git commit
4. Run the experiment: `uv run train.py > run.log 2>&1` (redirect everything — do NOT use tee or let output flood your context)
5. Read out the results: `grep "^val_bpb:\|^peak_vram_mb:" run.log`
6. If the grep output is empty, the run crashed. Run `tail -n 50 run.log` to read the Python stack trace and attempt a fix. If you can't get things to work after more than a few attempts, give up.
7. Record the results in the tsv (NOTE: do not commit the results.tsv file, leave it untracked by git)
8. If val_bpb improved (lower), you "advance" the branch, keeping the git commit
9. If val_bpb is equal or worse, you git reset back to where you started

The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. And you're advancing the branch so that you can iterate. If you feel like you're getting stuck in some way, you can rewind but you should probably do this very very sparingly (if ever).

**Timeout**: Each experiment should take ~5 minutes total (+ a few seconds for startup and eval overhead). If a run exceeds 10 minutes, kill it and treat it as a failure (discard and revert).

**Crashes**: If a run crashes (OOM, or a bug, or etc.), use your judgment: If it's something dumb and easy to fix (e.g. a typo, a missing import), fix it and re-run. If the idea itself is fundamentally broken, just skip it, log "crash" as the status in the tsv, and move on.
1. Tune `train.py` with an experimental idea by directly hacking the code.
2. Run the Critic Sandbox: `python sandbox.py "Short description of what you changed"`
3. The sandbox will:
- Lint the code for syntax checks.
- Run the 5 minute experiment.
- Extract `val_bpb` and `peak_vram_mb`.
- Log the outcome directly to `results.tsv`.
- Automatically `git commit` (if improved) or `git reset` (if degraded/crashed).
4. Read the `results.tsv` tail to understand what the Sandbox just did.
5. If the outcome was a crash you think you can easily fix (e.g., typo or minor syntax error seen in `run.log`), re-tune `train.py` and run the sandbox again.

The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.

**Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.
Comment on lines +107 to +109
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the Sandbox "handles the timeout" and suggests reading run.log after a crash, but sandbox.py currently has no timeout enforcement and its rollback deletes run.log via git clean -fd. Please align the docs with the implementation (or update the implementation to match).

Copilot uses AI. Check for mistakes.

**NEVER STOP**: Once the experiment loop has begun (after the initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human might be asleep, or gone from a computer and expects you to continue working *indefinitely* until you are manually stopped. You are autonomous. If you run out of ideas, think harder — read papers referenced in the code, re-read the in-scope files for new angles, try combining previous near-misses, try more radical architectural changes. The loop runs until the human interrupts you, period.

Expand Down
105 changes: 105 additions & 0 deletions sandbox.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file starts with a UTF-8 BOM before the shebang (#!/usr/bin/env python3). That BOM can prevent the shebang from being recognized when executing the script directly. Please remove the BOM so the first bytes of the file are #!.

Copilot uses AI. Check for mistakes.
import subprocess
import sys
import os

def get_best_bpb():
best_bpb = float('inf')
if os.path.exists('results.tsv'):
with open('results.tsv', 'r', encoding='utf-8') as f:
for line in f.readlines()[1:]: # skip header
parts = line.strip().split('\t')
if len(parts) >= 4 and parts[3].upper() == 'KEEP':
try:
bpb = float(parts[1])
if bpb < best_bpb:
best_bpb = bpb
except ValueError:
pass
return best_bpb

def run_experiment(description):
print(f"🔬 Sandbox Critic: Starting experiment '{description}'...")

# 1. Check syntax first (Critic: Linter)
print("🧹 Critic Phase 1: Linting and compiling train.py...")
comp = subprocess.run([sys.executable, "-m", "py_compile", "train.py"], capture_output=True, text=True)
if comp.returncode != 0:
print("❌ CRASH: train.py has Python syntax errors!")
print(comp.stderr)
rollback()
log_result("N/A", "N/A", "CRASH", description)
return

# 2. Run the training script
print("⏳ Critic Phase 2: Running 5-minute training budget... (tail run.log for live output)")
with open("run.log", "w", encoding="utf-8") as f:
train = subprocess.run(["uv", "run", "train.py"], stdout=f, stderr=subprocess.STDOUT)
Comment on lines +35 to +37
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(["uv", "run", "train.py"], ...) does not enforce the advertised 5-minute budget; it can hang indefinitely, and program.md claims the sandbox handles timeouts. Add a timeout and handle subprocess.TimeoutExpired (treat as CRASH, rollback, and log).

Copilot uses AI. Check for mistakes.

# 3. Analyze the outcomes
print("📊 Critic Phase 3: Analyzing results...")

# Extract metrics
val_bpb = None
peak_vram = None
if os.path.exists("run.log"):
with open("run.log", "r", encoding="utf-8") as f:
for line in f:
if line.startswith("val_bpb:"):
try:
val_bpb = float(line.split(":")[1].strip())
except ValueError: pass
elif line.startswith("peak_vram_mb:"):
try:
peak_vram = float(line.split(":")[1].strip())
except ValueError: pass

if train.returncode != 0 or val_bpb is None:
print("❌ CRASH: Runtime error, OOM, or missing metrics detected in run.log.")
rollback()
log_result(val_bpb if val_bpb else "N/A", peak_vram if peak_vram else "N/A", "CRASH", description)
return
Comment on lines +57 to +61
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the crash path you use val_bpb if val_bpb else "N/A" and the same for peak_vram. This treats legitimate 0.0 values as missing because 0.0 is falsy. Use an explicit is not None check (or normalize to numeric defaults) so the log output is correct.

Copilot uses AI. Check for mistakes.

print(f"📈 Result: val_bpb = {val_bpb:.6f}, RAM = {peak_vram}MB")

best_bpb = get_best_bpb()
if best_bpb == float('inf'):
print("ℹ️ No previous KEEP records found in results.tsv. Setting baseline.")
best_bpb = 5.0 # Loose initial fallback

Comment on lines +65 to +69
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no prior KEEP records, best_bpb is set to 5.0. That can cause the very first successful run to be discarded when val_bpb >= 5.0, which conflicts with the idea of establishing a baseline on the first run. Consider treating inf as “no baseline yet” and KEEP the first non-crashing result (or initialize best_bpb from the first run’s val_bpb).

Copilot uses AI. Check for mistakes.
if val_bpb < best_bpb:
print(f"✅ SUCCESS: BPB ({val_bpb:.4f}) improved over best ({best_bpb:.4f})! Keeping changes.")
subprocess.run(["git", "commit", "-am", f"KEEP: {description} (BPB: {val_bpb:.4f})"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
log_result(val_bpb, peak_vram, "KEEP", description)
Comment on lines +63 to +73
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the success path you call log_result(val_bpb, peak_vram, ...) even if peak_vram was never parsed (still None). log_result() then does float(vram) which will raise TypeError for None. Either require peak_vram to be present before treating a run as successful, or make log_result robust to None (e.g., treat as N/A / NaN).

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +73
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git commit call ignores failures (e.g., nothing staged/changed, not in a git repo, hooks rejecting the commit). If the commit fails, log_result() will still record the current HEAD, which can misattribute results. Capture and check the commit return code (and/or stage the intended files explicitly) so KEEP results are reliably associated with a new commit.

Copilot uses AI. Check for mistakes.
else:
print(f"📉 FAIL: BPB {val_bpb:.6f} is not better than best {best_bpb:.6f}. Discarding.")
rollback()
log_result(val_bpb, peak_vram, "DISCARD", description)

def log_result(bpb, vram, status, desc):
if not os.path.exists("results.tsv"):
with open("results.tsv", "w", encoding="utf-8") as f:
f.write("commit\tval_bpb\tmemory_gb\tstatus\tdescription\n")

commit_hash = "N/A"
try:
commit_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode('utf-8').strip()
except subprocess.CalledProcessError: pass

mem_gb = f"{float(vram) / 1024:.2f}" if vram != "N/A" else "N/A"
bpb_str = f"{float(bpb):.6f}" if bpb != "N/A" else "N/A"

with open("results.tsv", "a", encoding="utf-8") as f:
f.write(f"{commit_hash}\t{bpb_str}\t{mem_gb}\t{status}\t{desc}\n")
Comment on lines +79 to +93
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_result() writes "N/A" strings for BPB/memory on crashes and uses uppercase statuses (KEEP/DISCARD/CRASH). That disagrees with the program.md logging spec (numeric 0.0 values; lowercase status) and can make downstream analysis more brittle. Consider logging numeric defaults for crashes and/or keeping status casing consistent with the documented format.

Copilot uses AI. Check for mistakes.

def rollback():
print("⏪ Rolling back to previous stable git state...")
subprocess.run(["git", "reset", "--hard", "HEAD"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(["git", "clean", "-fd"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
Comment on lines +95 to +98
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rollback() runs git clean -fd, which will delete run.log because it is an untracked, non-ignored file. That makes post-crash diagnosis via run.log impossible (and contradicts the doc guidance). Consider excluding run.log (and any other diagnostics you want to keep) from cleaning, moving logs into an ignored directory, or dropping git clean here.

Copilot uses AI. Check for mistakes.

if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python sandbox.py \"Description of your experiment\"")
sys.exit(1)

run_experiment(" ".join(sys.argv[1:]))
Loading