fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment by jayscambler · Pull Request #1061 · greyhaven-ai/autocontext

jayscambler · 2026-06-09T19:38:23Z

What

A live end-to-end recursive-loop demo on local MLX (train → publish → auto-resolve → serve) surfaced two real gaps on the mlxlm/opd adapter path, both of which silently broke game scenarios (most of SCENARIO_REGISTRY). Both fixed here, with the demo + a writeup.

Fix 1 — game scenarios produced an empty task prompt

ScenarioInterface (game) scenarios expose describe_rules / describe_strategy_interface / describe_evaluation_criteria but no get_task_prompt or description. resolve_scenario_context only checked the latter two, so it returned "" — every game scenario trained and served on an empty prompt. It now composes the describe_* methods into a task instruction.

Fix 2 — the in-training assessment fed the model a raw prompt

_assess_mlxlm passed the bare task string to generate(), but mlx-lm's LoRA trainer and the serving path (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct model given a raw prompt emits prose, not verifier-scorable JSON, so the in-training avg_score read ~0 even for a good adapter. Extracted format_assess_prompt (chat template + optional quality prefix, with a raw fallback for non-instruct tokenizers) and use it in assessment — so train / assess / serve all format identically.

Live result

grid_ctf, base Qwen2.5-0.5B-Instruct-4bit, 8 verifier-scored samples per stage:

Stage	Mean verifier score	Valid JSON
run N — base model as agent	0.5809	75%
run N+1 — auto-served LoRA adapter	0.8241	100%
delta	+0.2432 (+41.9%)

Whole loop (train → publish → auto-resolve → serve → re-measure) in 43s. After Fix 2 the in-training assessment (0.857) agrees with the independent served-adapter measurement (0.824). The serving run is given no model path — the adapter is resolved purely from the registry the training run published to, via the bridge (_resolve_local_record → plan_local_client → MLXLMClient).

Added

scripts/demo_recursive_loop.py — self-contained, runnable end-to-end loop.
docs/case-study-recursive-loop.md — the writeup.

Tests (CI-safe, no mlx)

resolve_scenario_context: describe_* composition, get_task_prompt/description precedence, empty fallback. format_assess_prompt: chat-template application, quality-prefix when score-conditioned, raw fallback when the tokenizer has no chat template. Full touched-area regression green; ruff + mypy clean.

…ngful in-training assessment Surfaced by a live recursive-loop demo on local MLX (train -> publish -> auto-resolve -> serve). Two real gaps on the mlxlm/opd adapter path, both blocking game scenarios: 1. resolve_scenario_context returned '' for ScenarioInterface (game) scenarios -- they expose describe_rules/strategy_interface/evaluation_criteria but no get_task_prompt or description, so every game scenario trained/served on an EMPTY prompt. Now composes the describe_* methods into a task instruction. 2. _assess_mlxlm fed the model a RAW prompt, but mlx-lm's LoRA trainer and the serving path (MLXLMProvider.format_mlxlm_prompt) both apply the instruct chat template. An instruct model given a raw prompt emits prose, not verifier-scorable JSON, so the in-training metric read ~0 even for a good adapter. Extracted format_assess_prompt (chat-template + optional quality prefix, raw fallback for non-instruct tokenizers) and use it in assess. Adds scripts/demo_recursive_loop.py (self-contained end-to-end loop on grid_ctf) and docs/case-study-recursive-loop.md. Live result: base 0.58 -> auto-served adapter 0.82 (+41.9%) in 43s; in-training assessment (0.857) now agrees with the served measurement. Tests: resolve_scenario_context describe_* composition + precedence + empty fallback; format_assess_prompt chat-template + quality-prefix + raw fallback. CI-safe (no mlx).

#1061) The new case-study-recursive-loop.md was orphaned from the public docs path. Link it beside the MLX-training / OPD-case-study entries in the root README, docs/README, and the package README so users discover it. Root README re-applied without the formatter to avoid mangling the synced whats-new block + Surfaces table.

…tudy trajectory Take the loop further: scripts/demo_recursive_loop_multigen.py runs the genuine recursive loop, where each generation trains ONLY on the model's own verifier-curated proposals (no external data, no human). The served model proposes -> verifier scores -> best-so-far becomes the next adapter's training set -> publish + auto-activate -> next gen is served by it. Live result (grid_ctf, Qwen2.5-0.5B, 3 generations, 33s): mean 0.577 -> 0.619 and best 0.689 -> 0.737, monotonic, valid-JSON 13/20 -> 20/20. The +7.4% is smaller than the single step's +41.9% and honestly so -- bootstrapping from a weak cold-start distribution is slower than training on a globally-curated elite; the point is the shape (it compounds on its own output), not the magnitude. Documented as a new section in the case study.

jayscambler added 3 commits June 9, 2026 14:38

jayscambler merged commit f296732 into main Jun 9, 2026
33 of 34 checks passed

jayscambler deleted the demo/recursive-loop branch June 9, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment#1061

fix(adapter-backends): game scenarios trainable/servable + meaningful in-training assessment#1061
jayscambler merged 3 commits into
mainfrom
demo/recursive-loop

jayscambler commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayscambler commented Jun 9, 2026

What

Fix 1 — game scenarios produced an empty task prompt

Fix 2 — the in-training assessment fed the model a raw prompt

Live result

Added

Tests (CI-safe, no mlx)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant