Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Agent Runbook: Autoresearch Execution

Use `workflows/run_experiment.py` for all autoresearch execution. Do not use `workflow/dag.json` or task-graph-planner for running experiments.

## Core Rules

1. Keep all run artifacts under `workflows/runs/`.
2. Modify only `train.py` during experiments.
3. Never modify `prepare.py`.
4. Always run training via `uv run train.py` through the script.

## Natural-language to Command Mapping

- User says: "Start running the experiment, run 5 loops"
- Run: `python workflows/run_experiment.py start --loops 5`

- User says: "Run another 5 iterations"
- Run: `python workflows/run_experiment.py resume --loops 5`

- User says: "Resume run <run_id> and run 5 loops"
- Run: `python workflows/run_experiment.py resume --run-id <run_id> --loops 5`

- User says: "Only run setup and baseline"
- Run: `python workflows/run_experiment.py start --only setup,baseline`

- User says: "Only run training and decision parts in loops for 3 iterations"
- Run: `python workflows/run_experiment.py resume --loops 3 --only loop --loop-only train,record,decide`

- User says: "Show run status"
- Run: `python workflows/run_experiment.py status`

## Stage Controls

- Top-level stages: `setup`, `baseline`, `loop`
- Loop stages: `propose`, `apply`, `commit`, `train`, `triage`, `record`, `decide`

Supported control flags:

- `--only <comma-list>`: run only selected stages
- `--from-stage <setup|baseline|loop>` + `--to-stage <...>`: run a top-level stage range
- `--loop-only <comma-list>`: limit loop internals to selected stages
- `--loops N`: run `N` loop iterations

## Human Proposal Override

- You can inject a human-authored proposal for the next `propose` stage.
- Option A (run-scoped default): write JSON to `workflows/runs/<run_id>/next_proposal.json`.
- Option B (explicit path): pass `--proposal-file <path>` on `start`/`resume`.
- Proposal JSON must include exactly these keys:
- `status` (`ok` or `need_input`)
- `description`
- `change_plan`
- `commit_description`
- Canonical schema: `workflows/schemas/proposal.schema.json`
- Example:
```json
{
"status": "ok",
"description": "Try slightly higher learning rate with warmup unchanged.",
"change_plan": "In train.py, increase base LR by ~10% and keep all other settings unchanged.",
"commit_description": "experiment: bump LR by 10%"
}
```
- Precedence in propose stage:
1) `--proposal-file` (if present)
2) `workflows/runs/<run_id>/next_proposal.json`
3) stochastic proposal
4) deterministic fallback (`--no-stochastic`)
- When `next_proposal.json` is consumed, it is moved to `workflows/runs/<run_id>/consumed_proposals/iter_<NNNN>.json`.

## Resume Behavior

- The script checkpoints state at `workflows/runs/<run_id>/state.json`.
- If a loop iteration is partially complete, `resume` continues that iteration from the next pending stage.
- "Run another N iterations" means execute N more loop iterations from current state.
- Training runs are started in background by default (`--background-train`), and `resume` polls/continues in-flight baseline/train jobs.

## Logging and Observability

- Human-readable execution log: `workflows/runs/<run_id>/runner.log`
- Structured event log: `workflows/runs/<run_id>/history.jsonl`
- Checkpoint state: `workflows/runs/<run_id>/state.json`
- Per-iteration details (including opencode raw outputs): `workflows/runs/<run_id>/iterations/<NNNN>/`

## Run ID Policy

- Default run id: `<branch-slug>-rNNN`
- Example: branch `autoresearch/mar10` -> `autoresearch-mar10-r001`
- On `resume` without `--run-id`, script picks latest run for current branch.

## Notes

- Use `--no-stochastic` only when opencode stochastic execution is unavailable.
- Setup auto-runs `uv run prepare.py` if cache/tokenizer are missing (disable via `--no-auto-prepare`).
- Background training is enabled by default; disable with `--no-background-train` for fully foreground execution.
- `results.tsv` is maintained in repo root and should remain untracked.
37 changes: 37 additions & 0 deletions PR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## Summary

This PR adds a fork-specific execution workflow for autonomous experiments, replacing session-by-session `program.md` interpretation with a deterministic runner plus agent runbook.

### What’s included

- Add `workflows/run_experiment.py` as the single experiment orchestrator:
- `start`, `resume`, `status` commands
- top-level stage controls: `setup`, `baseline`, `loop`
- loop sub-stage controls: `propose`, `apply`, `commit`, `train`, `triage`, `record`, `decide`
- resumable checkpointing under `workflows/runs/<run_id>/`
- run-id policy: `<branch-slug>-rNNN`
- Add `AGENTS.md` runbook with explicit natural-language to command mapping for agent sessions.
- Improve observability:
- `runner.log` (human-readable timeline)
- `history.jsonl` (structured events)
- per-iteration artifacts under `workflows/runs/<run_id>/iterations/<NNNN>/`
- raw stochastic stage traces (`propose/apply/triage` OpenCode outputs)
- Add setup robustness:
- auto-run `uv run prepare.py` when cache/tokenizer is missing (default on, opt-out via `--no-auto-prepare`)
- explicit setup precondition checks before baseline/loop
- Add background training support:
- training stages start in background by default (`--background-train`)
- `resume` polls/continues in-flight baseline/train jobs
- Update README:
- OpenCode quickstart instructions
- fork-specific explanation of why this workflow layer (`run_experiment.py` + `AGENTS.md`) is better than a program.md-only execution style

## Why

In long-running autonomous sessions, prose-only execution is fragile and inconsistent.
This PR makes runs repeatable, resumable, and inspectable while preserving `program.md` as the policy/objective layer.

## Notes

- `results.tsv` remains untracked and is updated as run output.
- `prepare.py` is never modified; only `train.py` is intended to change during experiments.
48 changes: 48 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,54 @@ Hi have a look at program.md and let's kick off a new experiment! let's do the s

The `program.md` file is essentially a super lightweight "skill".

### OpenCode quickstart

If you want to run this with OpenCode, use this flow:

1. Clone the repo.
2. Install OpenCode.
3. `cd` into the repo.
4. Run `opencode`.
5. Type: `Run the experiment for 1 loop`.

The agent will use `workflows/run_experiment.py` and kick off the run.

### Human proposal override

In this fork, you can provide your own proposal JSON for the next experiment iteration.

- Run-scoped default: `workflows/runs/<run_id>/next_proposal.json`
- Explicit file: pass `--proposal-file <path>` to `start`/`resume`
- Proposal schema: `workflows/schemas/proposal.schema.json`

Minimal example:

```json
{
"status": "ok",
"description": "Try a slightly higher base learning rate.",
"change_plan": "In train.py increase base LR by about 10% and keep other settings unchanged.",
"commit_description": "experiment: increase LR by 10%"
}
```

### Why this workflow is better

In **this fork** (`buzypi/autoresearch`), autonomous runs are executed through a dedicated workflow script (`workflows/run_experiment.py`) and an explicit agent runbook (`AGENTS.md`).

That is different from the older "just follow `program.md` directly" execution style, where each agent session had to repeatedly infer process details from prose instructions.

Benefits:

- **Operational consistency vs prose-only execution.** Instead of relying on session-by-session interpretation of `program.md`, one script now encodes setup, baseline, loop control, and resume behavior.
- **Natural-language intent still works.** `AGENTS.md` maps prompts like "run another 5 iterations" to deterministic commands, so agents stay aligned.
- **Reliable continuation.** Runs persist state under `workflows/runs/<run_id>/`, so partial iterations can resume from the next pending stage.
- **Controlled execution surface.** You can target only selected top-level stages or loop sub-stages while keeping the same run state.
- **Long-run friendly.** Training can run in background, and later `resume` invocations poll/continue in-flight work.
- **Clear audit trail.** `runner.log`, `history.jsonl`, `state.json`, and per-iteration artifacts make each decision traceable.

`program.md` is still the research policy and objective layer (what to optimize, constraints, judgment criteria). In this fork, `workflows/run_experiment.py` + `AGENTS.md` provide the execution layer (how runs are actually carried out repeatably).

## Project structure

```
Expand Down
Loading