Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Agent Runbook: Autoresearch Execution

Use `workflows/run_experiment.py` for all autoresearch execution. Do not use `workflow/dag.json` or task-graph-planner for running experiments.

## Core Rules

1. Keep all run artifacts under `workflows/runs/`.
2. Modify only `train.py` during experiments.
3. Never modify `prepare.py`.
4. Always run training via `uv run train.py` through the script.

## Natural-language to Command Mapping

- User says: "Start running the experiment, run 5 loops"
- Run: `python workflows/run_experiment.py start --loops 5`

- User says: "Run another 5 iterations"
- Run: `python workflows/run_experiment.py resume --loops 5`

- User says: "Resume run <run_id> and run 5 loops"
- Run: `python workflows/run_experiment.py resume --run-id <run_id> --loops 5`

- User says: "Only run setup and baseline"
- Run: `python workflows/run_experiment.py start --only setup,baseline`

- User says: "Only run training and decision parts in loops for 3 iterations"
- Run: `python workflows/run_experiment.py resume --loops 3 --only loop --loop-only train,record,decide`

- User says: "Show run status"
- Run: `python workflows/run_experiment.py status`

## Stage Controls

- Top-level stages: `setup`, `baseline`, `loop`
- Loop stages: `propose`, `apply`, `commit`, `train`, `triage`, `record`, `decide`

Supported control flags:

- `--only <comma-list>`: run only selected stages
- `--from-stage <setup|baseline|loop>` + `--to-stage <...>`: run a top-level stage range
- `--loop-only <comma-list>`: limit loop internals to selected stages
- `--loops N`: run `N` loop iterations

## Resume Behavior

- The script checkpoints state at `workflows/runs/<run_id>/state.json`.
- If a loop iteration is partially complete, `resume` continues that iteration from the next pending stage.
- "Run another N iterations" means execute N more loop iterations from current state.
- Training runs are started in background by default (`--background-train`), and `resume` polls/continues in-flight baseline/train jobs.

## Logging and Observability

- Human-readable execution log: `workflows/runs/<run_id>/runner.log`
- Structured event log: `workflows/runs/<run_id>/history.jsonl`
- Checkpoint state: `workflows/runs/<run_id>/state.json`
- Per-iteration details (including opencode raw outputs): `workflows/runs/<run_id>/iterations/<NNNN>/`

## Run ID Policy

- Default run id: `<branch-slug>-rNNN`
- Example: branch `autoresearch/mar10` -> `autoresearch-mar10-r001`
- On `resume` without `--run-id`, script picks latest run for current branch.

## Notes

- Use `--no-stochastic` only when opencode stochastic execution is unavailable.
- Setup auto-runs `uv run prepare.py` if cache/tokenizer are missing (disable via `--no-auto-prepare`).
- Background training is enabled by default; disable with `--no-background-train` for fully foreground execution.
- `results.tsv` is maintained in repo root and should remain untracked.
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,35 @@ Hi have a look at program.md and let's kick off a new experiment! let's do the s

The `program.md` file is essentially a super lightweight "skill".

### OpenCode quickstart

If you want to run this with OpenCode, use this flow:

1. Clone the repo.
2. Install OpenCode.
3. `cd` into the repo.
4. Run `opencode`.
5. Type: `Run the experiment for 1 loop`.

The agent will use `workflows/run_experiment.py` and kick off the run.

### Why this workflow is better

In **this fork** (`buzypi/autoresearch`), autonomous runs are executed through a dedicated workflow script (`workflows/run_experiment.py`) and an explicit agent runbook (`AGENTS.md`).

That is different from the older "just follow `program.md` directly" execution style, where each agent session had to repeatedly infer process details from prose instructions.

Benefits:

- **Operational consistency vs prose-only execution.** Instead of relying on session-by-session interpretation of `program.md`, one script now encodes setup, baseline, loop control, and resume behavior.
- **Natural-language intent still works.** `AGENTS.md` maps prompts like "run another 5 iterations" to deterministic commands, so agents stay aligned.
- **Reliable continuation.** Runs persist state under `workflows/runs/<run_id>/`, so partial iterations can resume from the next pending stage.
- **Controlled execution surface.** You can target only selected top-level stages or loop sub-stages while keeping the same run state.
- **Long-run friendly.** Training can run in background, and later `resume` invocations poll/continue in-flight work.
- **Clear audit trail.** `runner.log`, `history.jsonl`, `state.json`, and per-iteration artifacts make each decision traceable.

`program.md` is still the research policy and objective layer (what to optimize, constraints, judgment criteria). In this fork, `workflows/run_experiment.py` + `AGENTS.md` provide the execution layer (how runs are actually carried out repeatably).

## Project structure

```
Expand Down
Loading