feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios by jayscambler · Pull Request #1064 · greyhaven-ai/autocontext

jayscambler · 2026-06-09T21:51:29Z

What

The self-improving loop (autoctx self-improve, ReST-EM / expert iteration) was hardcoded to the from-scratch mlx backend and game scenarios. This generalizes it along both axes — the two follow-ons chosen after closing the recursive loop.

Backend: `mlx` → `{mlx, mlxlm}`

The loop now drives either SFT backend: mlx (from-scratch GPT) or mlxlm (LoRA on a pretrained base, --backend mlxlm). mlxlm's in-scenario assessment now collects {strategy, score} samples to a collect_samples_path (the adapter analogue of the existing mlx collect path), so ReST-EM can keep the elite and retrain. train.py permits collect_samples_path for mlx+mlxlm and rejects it for the online-RL/distillation backends (grpo/opd/trl), which have no SFT sample stream.

Scenarios: game → game + agent-task

Agent-task scenarios (scored via evaluate_output on free text) are now first-class. _assess_mlxlm collects the raw text as the sample's strategy, and records_to_completions uses a string strategy verbatim as the completion (json.dumps would quote/escape it). Game scenarios (JSON strategies) are unchanged.

CLI

autoctx self-improve gains --backend, --base-model, --fine-tune-type, --num-layers, and --batch-size (lower it for mlxlm on small seeds: mlx-lm requires the validation split to hold ≥ batch_size examples).

Verified live

autoctx self-improve --backend mlxlm on grid_ctf (cached Qwen2.5-0.5B), 2 rounds: dataset grew 24→28, per-round avg_score 0.755 → 0.878. The loop collects samples, filters elite, grows the dataset, and retrains the adapter each round — end to end through the CLI.

Tests (CI-safe, no mlx)

records_to_completions verbatim-text vs JSON-strategy; loop rejects non-SFT backends; loop threads backend + adapter params into run_training (mocked); collect_samples_path rejected for non-SFT backends. Updated the prior MLX-only guard test to the new mlx and mlxlm message. Documents the new flags + backend/scenario coverage in mlx-training.md. ruff + mypy clean.

… scenarios The self-improving loop (autoctx self-improve) was hardcoded to the from-scratch mlx backend and game scenarios. Generalize it along both axes: Backend: the loop now drives either SFT backend -- mlx (from-scratch GPT) or mlxlm (LoRA on a pretrained base). mlxlm's in-scenario assessment now collects {strategy, score} samples to a collect_samples_path (the adapter analogue of the mlx collect path), so ReST-EM can keep the elite and retrain. train.py allows collect_samples_path for mlx + mlxlm and rejects it for the online-RL / distillation backends (grpo/opd/trl), which have no SFT sample stream. Scenarios: agent-task scenarios (scored via evaluate_output on free text) are now first-class. _assess_mlxlm collects the raw text as the sample's strategy, and records_to_completions uses a string strategy verbatim as the completion (json.dumps would quote/escape it). Game scenarios (JSON strategies) are unchanged. CLI: autoctx self-improve gains --backend, --base-model, --fine-tune-type, --num-layers, and --batch-size (lower it for mlxlm on small seeds: mlx-lm requires the validation split to hold >= batch_size examples). Verified live: autoctx self-improve --backend mlxlm on grid_ctf ran 2 rounds, growing the dataset 24->28 and improving avg_score 0.755 -> 0.878 across rounds. Tests (CI-safe): records_to_completions verbatim-text vs JSON strategy; loop rejects non-SFT backends; loop threads backend + adapter params into run_training (mocked); collect_samples_path rejected for non-SFT backends. Documents the new flags + backend/scenario coverage in mlx-training.md.

…d after scoring (review #1064) [P2] The agent-task assessment called scenario.evaluate_output(output=...) WITHOUT the required state arg, so it raised for AgentTaskInterface scenarios, got swallowed by the broad except, and could report valid_rate>0 while collecting nothing. Fixes: - scenario_task_prompt_and_state resolves the prompt AND the state it was built from; the mlxlm assessment passes that same state into evaluate_output(output, state). - valid is incremented only AFTER scoring succeeds (both mlxlm and the mlx/prepare path), so a scoring error can no longer inflate valid_rate. - the from-scratch mlx path (prepare.py) likewise resolves + passes state. [P2] Docs over-claimed agent-task free-text across both SFT backends. The from-scratch mlx backend emits the structured <|...|> token contract (JSON strategies), so free-text agent tasks fit the mlxlm backend (pretrained instruct model). Narrowed mlx-training.md: both backends handle game/JSON-strategy scenarios; free-text agent tasks use --backend mlxlm. Tests: scenario_task_prompt_and_state returns state for agent tasks / None for games.

…K) (#1070) Two bugs found live on a GSM8K GRPO run made every reward 0 -> no gradient (RLVR silently learned nothing), on top of the length issue fixed in #1067: 1. GRPO prompts were raw strings, so TRL skipped the chat template -> the instruct model never entered chat mode, never emitted EOS, and every completion ran to max_completion_length (clipped_ratio=1, mean_terminated_length=0) with no parseable answer. build_prompt_dataset_rows now emits CONVERSATIONAL prompts ([{role:user,content}]) so TRL applies the chat template and completions terminate. 2. score_completions required a JSON construction (extract_json_object) for ALL scenarios, so a free-text GSM8K answer scored 0 even when correct. It is now hybrid for agent tasks: pass the extracted JSON when present (back-compat for JSON-construction agent tasks) else the RAW text, and it accepts conversational (message-list) completions. Tests: free-text agent-task scoring (no JSON), conversational completion handling, conversational prompt rows. Both bugs are the same game-vs-agent-task split fixed earlier in the mlxlm assessment path (#1064).

jayscambler added 2 commits June 9, 2026 16:51

jayscambler merged commit 221c2e4 into main Jun 9, 2026
17 checks passed

jayscambler deleted the feat/self-improve-mlxlm branch June 9, 2026 22:20

jayscambler mentioned this pull request Jun 11, 2026

fix(grpo): make RLVR work on free-text reasoning tasks (GSM8K — non-termination + JSON-only reward) #1070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios#1064

feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios#1064
jayscambler merged 2 commits into
mainfrom
feat/self-improve-mlxlm

jayscambler commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayscambler commented Jun 9, 2026

What

Backend: mlx → {mlx, mlxlm}

Scenarios: game → game + agent-task

CLI

Verified live

Tests (CI-safe, no mlx)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Backend: `mlx` → `{mlx, mlxlm}`