greyhaven-ai · jayscambler · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/README.md b/README.md
@@ -332,6 +332,7 @@ Yes. Wire `autoctx mcp-serve` (or `bunx autoctx mcp-serve`) into Claude Code, Cu
 - Repo layout for coding agents: [AGENTS.md](AGENTS.md)
 - Local + cross-platform model training (MLX and TRL backends): [autocontext/docs/mlx-training.md](autocontext/docs/mlx-training.md)
 - Validated training result (on-policy distillation vs RLVR on GSM8K): [autocontext/docs/case-study-on-policy-distillation.md](autocontext/docs/case-study-on-policy-distillation.md)
+- Recursive loop closed end to end on local MLX (train -> auto-serve -> improve): [autocontext/docs/case-study-recursive-loop.md](autocontext/docs/case-study-recursive-loop.md)
 - Sandbox and executor notes: [autocontext/docs/sandbox.md](autocontext/docs/sandbox.md)
 - Persistent host worker: [autocontext/docs/persistent-host.md](autocontext/docs/persistent-host.md)
 - License: [LICENSE](LICENSE)

diff --git a/autocontext/README.md b/autocontext/README.md
@@ -699,6 +699,7 @@ For the TypeScript equivalent, see `ts/src/integrations/anthropic/STABILITY.md`.
 - [Sandbox modes](docs/sandbox.md)
 - [Persistent host worker](docs/persistent-host.md)
 - [MLX host training](docs/mlx-training.md)
+- [Case study: recursive loop closed on local MLX](docs/case-study-recursive-loop.md)
 - [TypeScript package guide](../ts/README.md) — `analyze`, mission control, and interactive TUI surfaces
 - [Demo data notes](demo_data/README.md)
 - [Copy-paste examples](../examples/README.md)

diff --git a/autocontext/docs/case-study-recursive-loop.md b/autocontext/docs/case-study-recursive-loop.md
@@ -0,0 +1,99 @@
+# Case study: the recursive loop, closed end to end on local MLX
+
+autocontext's premise is a loop: an agent attempts a task, the verifier scores the attempts,
+the best trajectories train a model, and the _next_ run uses that trained model — with no human
+in the middle. This is that loop running end to end on a single Mac: train a small LoRA adapter
+on a scenario's verifier-scored strategies, publish and auto-activate it in the model registry,
+and have the agent provider auto-resolve and serve it on the next run. The served model proposes
+**41.9% better** strategies than the untrained base, and nothing about which model to serve is
+hardcoded — it is resolved from the registry the training run wrote to.
+
+## Result
+
+`grid_ctf` scenario, base model `mlx-community/Qwen2.5-0.5B-Instruct-4bit`, 8 strategies
+sampled per measurement and scored by the scenario's own verifier:
+
+| Stage                                  | Mean verifier score  | Valid JSON rate |
+| -------------------------------------- | -------------------- | --------------- |
+| **run N** — base model as the agent    | 0.5809               | 75%             |
+| **run N+1** — auto-served LoRA adapter | **0.8241**           | 100%            |
+| delta                                  | **+0.2432 (+41.9%)** |                 |
+
+The adapter was fine-tuned for 80 LoRA steps on the 60 highest-scoring strategies the loop
+accumulated (mean verifier score 0.849). The whole loop — train, publish, auto-resolve, serve,
+re-measure — ran in **43 seconds**. The in-training assessment (0.8565) independently agreed
+with the served-adapter measurement (0.8241), so the metric the training run reports is the
+score the served model actually delivers.
+
+## What "closed loop" means here
+
+The point is not that fine-tuning improves a model — that is expected. The point is that the
+next run picks up the trained model **on its own**:
+
+```
+run N      base Qwen2.5-0.5B-Instruct proposes grid_ctf strategies        -> 0.58
+train      LoRA SFT on the elite verifier-scored strategies               (38s)
+publish    register + activate the adapter; record base_model on it       -> state=active
+bridge     scenario_bound resolver -> plan_local_client -> MLXLMClient     -> auto-selected
+run N+1    AUTOCONTEXT_AGENT_PROVIDER=mlx serves base + adapter            -> 0.82
+```
+
+The `bridge` step is the load-bearing one. The serving run is given no model path. It calls
+`_resolve_local_record(settings, scenario)`, which finds the active record the training run
+published, and `plan_local_client(record)`, which routes an `mlxlm`/`opd` adapter to
+`MLXLMClient(base=record.metadata["base_model"], adapter_path=record.checkpoint_path, ...)`.
+That is why the registry record has to carry the base model the adapter was trained against —
+an adapter checkpoint is useless without it — and why the publish step records it.
+
+## Does it compound? Multi-generation self-improvement
+
+The single step above trains on a curated near-optimal elite (strategies sampled across the
+space and scored by the verifier), so it shows the ceiling the loop can reach with good data.
+The stronger claim is that the loop improves on its OWN output with no external data: each
+generation, the currently-served model proposes, the verifier scores, the best of everything
+proposed so far becomes the next adapter's training set, and the next generation is served by
+that adapter. Three generations bootstrapping from the base model's cold-start proposals:
+
+| Generation     | Mean   | Best   | Valid JSON |
+| -------------- | ------ | ------ | ---------- |
+| gen 0 (base)   | 0.5767 | 0.6885 | 13/20      |
+| gen 1 (served) | 0.5952 | 0.7254 | 20/20      |
+| gen 2 (served) | 0.5998 | 0.7351 | 20/20      |
+| gen 3 (served) | 0.6194 | 0.7369 | 20/20      |
+
+Mean and best both rise monotonically, valid-JSON rate goes 13 -> 20/20, and the whole 3-gen
+run takes 33s. The gain (+7.4%) is far smaller than the single-step +41.9%, and honestly so:
+bootstrapping from the base model's own weak, low-diversity cold-start distribution is slower
+than training on a globally-curated elite. The point is not the magnitude but the shape — the
+loop compounds on its own verifier-scored output, generation over generation, with no human and
+no external data. The only external signal is the verifier, which scores but never generates.
+
+## Reproduce
+
+Requires Apple Silicon with the mlx extra plus mlx-lm (`uv pip install mlx mlx-lm`). The base
+model downloads once from the `mlx-community` Hugging Face repo.
+
+```bash
+uv run python scripts/demo_recursive_loop.py            # single step (train -> serve -> +41.9%)
+uv run python scripts/demo_recursive_loop_multigen.py    # multi-generation self-improvement
+```
+
+The script is self-contained: it builds the elite training set from the scenario's verifier,
+calls `run_mlxlm_training`, publishes via `publish_training_output(..., auto_activate=True)`,
+then resolves and serves the adapter through the exact code path the agent provider uses
+(`scenario_bound_clients`), and prints the before/after verifier scores.
+
+## Two fixes this surfaced
+
+Running the loop on a game scenario exposed two real gaps, both fixed alongside this demo:
+
+1. **Game scenarios produced an empty task prompt.** `ScenarioInterface` scenarios expose
+   `describe_rules` / `describe_strategy_interface` / `describe_evaluation_criteria` but no
+   `get_task_prompt` or `description`, so `resolve_scenario_context` returned `""` — every game
+   scenario was untrainable on the adapter backends. It now composes the `describe_*` methods
+   into a task instruction.
+2. **The in-training assessment fed the model a raw prompt.** `_assess_mlxlm` passed the bare
+   task string to `generate()`, but mlx-lm's LoRA trainer and the serving path both apply the
+   instruct chat template. An instruct model given a raw prompt emits prose, not scorable JSON,
+   so the in-training metric read ~0 even when the adapter was good. Assessment now applies the
+   chat template (`format_assess_prompt`), matching training and serving.
diff --git a/autocontext/scripts/demo_recursive_loop.py b/autocontext/scripts/demo_recursive_loop.py
@@ -0,0 +1,205 @@
+"""Live end-to-end demo of the autocontext recursive loop on local MLX.
+
+Closes the loop the PRs built: train a small mlx-lm LoRA adapter on a scenario's
+verifier-scored strategies, PUBLISH + AUTO-ACTIVATE it in the model registry, then have
+the scenario-bound resolver AUTO-SERVE it as the agent (no hardcoded path) and show the
+served model proposes better strategies than the untrained base.
+
+    run N   = base Qwen2.5-0.5B-Instruct proposing grid_ctf strategies
+    train   = LoRA SFT on the elite (verifier-scored) strategies the loop accumulates
+    publish = register + activate the adapter (records base_model + score_conditioned)
+    bridge  = scenario_bound resolver -> plan_local_client -> MLXLMClient(base, adapter)
+    run N+1 = the AUTO-RESOLVED served adapter proposing grid_ctf strategies
+
+Requires the mlx extra + mlx-lm:  uv pip install mlx mlx-lm
+Run (from the package root):  uv run python scripts/demo_recursive_loop.py
+"""
+
+from __future__ import annotations
+
+import json
+import random
+import statistics
+import tempfile
+import time
+from pathlib import Path
+
+from autocontext.agents.llm_client import MLXLMClient
+from autocontext.agents.scenario_bound_clients import _build_planned_client, _resolve_local_record, plan_local_client
+from autocontext.config.settings import AppSettings
+from autocontext.scenarios import SCENARIO_REGISTRY
+from autocontext.training.autoresearch.mlxlm_backend import (
+    DEFAULT_BASE_MODEL,
+    run_mlxlm_training,
+    scenario_task_prompt,
+)
+from autocontext.training.autoresearch.sequence_format import extract_json_object
+from autocontext.training.backends import default_backend_registry
+from autocontext.training.model_registry import (
+    ModelRegistry,
+    TrainingCompletionOutput,
+    publish_training_output,
+)
+
+SCENARIO = "grid_ctf"
+N_SAMPLES = 8  # strategies generated per measurement
+TRAIN_STEPS = 80
+N_TRAIN_RECORDS = 60
+
+
+def banner(msg: str) -> None:
+    print(f"\n{'=' * 78}\n{msg}\n{'=' * 78}", flush=True)
+
+
+def print_measure(label: str, m: dict) -> None:
+    print(
+        f"{label}: mean={m['mean']:.4f}  best={m['best']:.4f}  valid={m['valid_rate']:.0%}  scores={m['scores']}",
+        flush=True,
+    )
+
+
+def measure(client, scenario, task_prompt: str, *, n: int) -> dict:
+    """Generate n strategies through the REAL agent client, score each via the verifier.
+
+    ``client`` is an MLXLMClient (base-only for run N, base+adapter for run N+1) so the demo
+    exercises the actual serving path -- including format_mlxlm_prompt's chat-template wrap,
+    which an instruct model needs to emit parseable JSON."""
+    scores: list[float] = []
+    valid = 0
+    for i in range(n):
+        try:
+            resp = client.generate(model="", prompt=task_prompt, max_tokens=128, temperature=0.7)
+            strategy = extract_json_object(resp.text)
+            if strategy is None:
+                continue
+            ok, _ = scenario.validate_actions(scenario.initial_state(seed=0), "challenger", strategy)
+            if not ok:
+                continue
+            valid += 1
+            scores.append(scenario.execute_match(strategy, seed=i).score)
+        except Exception as exc:  # noqa: BLE001 - demo: surface and continue
+            print(f"  (sample {i}: {type(exc).__name__})", flush=True)
+            continue
+    return {
+        "mean": statistics.fmean(scores) if scores else 0.0,
+        "best": max(scores) if scores else 0.0,
+        "valid_rate": valid / n,
+        "scores": [round(s, 3) for s in scores],
+    }
+
+
+def build_elite_training_set(scenario, path: Path, *, n_records: int) -> float:
+    """Sample the strategy space, score with the real verifier, keep the elite as training data.
+
+    Represents the verifier-scored trajectories the loop accumulates over generations. Returns
+    the mean score of the kept elite (what the adapter is taught to reproduce)."""
+    rng = random.Random(0)
+    candidates = []
+    for _ in range(n_records * 8):
+        a = round(rng.uniform(0.0, 1.0), 3)
+        d = round(rng.uniform(0.0, min(1.0, 1.4 - a)), 3)  # honor aggression + defense <= 1.4
+        p = round(rng.uniform(0.0, 1.0), 3)
+        strat = {"aggression": a, "defense": d, "path_bias": p}
+        score = scenario.execute_match(strat, seed=0).score
+        candidates.append((score, strat))
+    candidates.sort(key=lambda x: x[0], reverse=True)
+    elite = candidates[:n_records]
+    with path.open("w") as f:
+        for i, (score, strat) in enumerate(elite):
+            f.write(
+                json.dumps({"run_id": f"elite_{i // 10}", "scenario": SCENARIO, "strategy": strat, "score": score, "context": {}})
+                + "\n"
+            )
+    return statistics.fmean(s for s, _ in elite)
+
+
+def main() -> None:
+    t0 = time.time()
+    scenario = SCENARIO_REGISTRY[SCENARIO]()
+    task_prompt = scenario_task_prompt(scenario)
+    base = DEFAULT_BASE_MODEL
+
+    banner(f"autocontext recursive loop — live demo on local MLX\nscenario={SCENARIO}  base={base}")
+    print(f"task prompt:\n  {task_prompt}", flush=True)
+
+    # --- run N: the untrained base model is the agent ---------------------------------------
+    banner("RUN N — baseline: base model proposes strategies (no trained model yet)")
+    before = measure(MLXLMClient(base), scenario, task_prompt, n=N_SAMPLES)
+    print_measure("base model", before)
+
+    with tempfile.TemporaryDirectory() as tmp:
+        tmp_path = Path(tmp)
+        data_path = tmp_path / "training_data.jsonl"
+        out_dir = tmp_path / "mlxlm_run"
+        knowledge_root = tmp_path / "knowledge"
+        knowledge_root.mkdir()
+
+        # --- accumulate the verifier-scored elite, then train an adapter on it ---------------
+        banner(f"TRAIN — LoRA SFT on the elite {N_TRAIN_RECORDS} verifier-scored strategies")
+        elite_mean = build_elite_training_set(scenario, data_path, n_records=N_TRAIN_RECORDS)
+        print(f"elite training set: {N_TRAIN_RECORDS} strategies, mean verifier score={elite_mean:.4f}", flush=True)
+        print(f"fine-tuning {base} for {TRAIN_STEPS} steps ...", flush=True)
+        metrics = run_mlxlm_training(
+            scenario_name=SCENARIO,
+            data_path=data_path,
+            output_dir=out_dir,
+            time_budget=900,
+            memory_limit_mb=16384,
+            train_steps=TRAIN_STEPS,
+            base_model=base,
+            assess_samples=N_SAMPLES,
+            assess_temperature=0.7,
+        )
+        adapter_dir = out_dir / "adapters"
+        print(
+            f"trained. in-training assessment: avg_score={metrics['avg_score']:.4f}  "
+            f"valid_rate={metrics['valid_rate']:.0%}  ({metrics['training_seconds']:.0f}s)",
+            flush=True,
+        )
+
+        # --- publish + auto-activate (the recursive loop's hand-off) -------------------------
+        banner("PUBLISH — register + auto-activate the adapter in the model registry")
+        registry = ModelRegistry(knowledge_root)
+        completion = TrainingCompletionOutput(
+            run_id="demo-run",
+            checkpoint_path=str(adapter_dir),
+            backend="mlxlm",
+            scenario=SCENARIO,
+            scenario_family="game",
+            runtime_types=default_backend_registry().get("mlxlm").supported_runtime_types(),
+            training_metrics={"avg_score": metrics["avg_score"]},
+            metadata={"base_model": base, "score_conditioned": False},
+        )
+        record = publish_training_output(completion, registry, artifacts_root=None, auto_activate=True)
+        print(f"published: artifact={record.artifact_id}  backend={record.backend}  state={record.activation_state}", flush=True)
+        print(f"  runtime_types={record.runtime_types}  base_model={record.metadata.get('base_model')!r}", flush=True)
+
+        # --- the BRIDGE: resolve + route purely from the registry (no hardcoded path) --------
+        banner("BRIDGE — scenario_bound resolver auto-selects the trained adapter")
+        settings = AppSettings(agent_provider="mlx", mlx_model_path="", knowledge_root=knowledge_root)
+        resolved = _resolve_local_record(settings, SCENARIO)
+        assert resolved is not None, "resolver failed to find the active adapter"
+        plan = plan_local_client(resolved)
+        assert plan is not None, "router could not plan a client for the record"
+        print(f"resolved active record -> kind={plan.kind!r}  base={plan.model!r}", flush=True)
+        print(f"  adapter_path={plan.adapter_path}  score_conditioned={plan.score_conditioned}", flush=True)
+        print("  => AUTOCONTEXT_AGENT_PROVIDER=mlx would now serve MLXLMClient(base, adapter)", flush=True)
+
+        # --- run N+1: the auto-served adapter is the agent -----------------------------------
+        banner("RUN N+1 — the auto-resolved served adapter proposes strategies")
+        served_client = _build_planned_client(plan, settings)  # the bridge's real client construction
+        after = measure(served_client, scenario, task_prompt, n=N_SAMPLES)
+        print_measure("served adapter", after)
+
+        # --- verdict -------------------------------------------------------------------------
+        banner("VERDICT")
+        delta = after["mean"] - before["mean"]
+        print(f"  run N   (base model)      mean score = {before['mean']:.4f}", flush=True)
+        print(f"  run N+1 (served adapter)  mean score = {after['mean']:.4f}", flush=True)
+        print(f"  delta                                = {delta:+.4f}  ({delta / max(before['mean'], 1e-9):+.1%})", flush=True)
+        print(f"  loop closed: train -> publish -> auto-resolve -> serve, in {time.time() - t0:.0f}s", flush=True)
+        print(f"\n  {'IMPROVED ✓' if delta > 0 else 'NO IMPROVEMENT'} — N+1 {'>' if delta > 0 else '<='} N", flush=True)
+
+
+if __name__ == "__main__":
+    main()