Skip to content

feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios#1064

Merged
jayscambler merged 2 commits into
mainfrom
feat/self-improve-mlxlm
Jun 9, 2026
Merged

feat(self-improve): generalize ReST-EM loop to mlxlm backend + agent-task scenarios#1064
jayscambler merged 2 commits into
mainfrom
feat/self-improve-mlxlm

Conversation

@jayscambler

Copy link
Copy Markdown
Contributor

What

The self-improving loop (autoctx self-improve, ReST-EM / expert iteration) was hardcoded to the from-scratch mlx backend and game scenarios. This generalizes it along both axes — the two follow-ons chosen after closing the recursive loop.

Backend: mlx{mlx, mlxlm}

The loop now drives either SFT backend: mlx (from-scratch GPT) or mlxlm (LoRA on a pretrained base, --backend mlxlm). mlxlm's in-scenario assessment now collects {strategy, score} samples to a collect_samples_path (the adapter analogue of the existing mlx collect path), so ReST-EM can keep the elite and retrain. train.py permits collect_samples_path for mlx+mlxlm and rejects it for the online-RL/distillation backends (grpo/opd/trl), which have no SFT sample stream.

Scenarios: game → game + agent-task

Agent-task scenarios (scored via evaluate_output on free text) are now first-class. _assess_mlxlm collects the raw text as the sample's strategy, and records_to_completions uses a string strategy verbatim as the completion (json.dumps would quote/escape it). Game scenarios (JSON strategies) are unchanged.

CLI

autoctx self-improve gains --backend, --base-model, --fine-tune-type, --num-layers, and --batch-size (lower it for mlxlm on small seeds: mlx-lm requires the validation split to hold ≥ batch_size examples).

Verified live

autoctx self-improve --backend mlxlm on grid_ctf (cached Qwen2.5-0.5B), 2 rounds: dataset grew 24→28, per-round avg_score 0.755 → 0.878. The loop collects samples, filters elite, grows the dataset, and retrains the adapter each round — end to end through the CLI.

Tests (CI-safe, no mlx)

records_to_completions verbatim-text vs JSON-strategy; loop rejects non-SFT backends; loop threads backend + adapter params into run_training (mocked); collect_samples_path rejected for non-SFT backends. Updated the prior MLX-only guard test to the new mlx and mlxlm message. Documents the new flags + backend/scenario coverage in mlx-training.md. ruff + mypy clean.

… scenarios

The self-improving loop (autoctx self-improve) was hardcoded to the from-scratch mlx
backend and game scenarios. Generalize it along both axes:

Backend: the loop now drives either SFT backend -- mlx (from-scratch GPT) or mlxlm (LoRA on
a pretrained base). mlxlm's in-scenario assessment now collects {strategy, score} samples to
a collect_samples_path (the adapter analogue of the mlx collect path), so ReST-EM can keep the
elite and retrain. train.py allows collect_samples_path for mlx + mlxlm and rejects it for the
online-RL / distillation backends (grpo/opd/trl), which have no SFT sample stream.

Scenarios: agent-task scenarios (scored via evaluate_output on free text) are now first-class.
_assess_mlxlm collects the raw text as the sample's strategy, and records_to_completions uses a
string strategy verbatim as the completion (json.dumps would quote/escape it). Game scenarios
(JSON strategies) are unchanged.

CLI: autoctx self-improve gains --backend, --base-model, --fine-tune-type, --num-layers, and
--batch-size (lower it for mlxlm on small seeds: mlx-lm requires the validation split to hold
>= batch_size examples).

Verified live: autoctx self-improve --backend mlxlm on grid_ctf ran 2 rounds, growing the
dataset 24->28 and improving avg_score 0.755 -> 0.878 across rounds.

Tests (CI-safe): records_to_completions verbatim-text vs JSON strategy; loop rejects non-SFT
backends; loop threads backend + adapter params into run_training (mocked); collect_samples_path
rejected for non-SFT backends. Documents the new flags + backend/scenario coverage in mlx-training.md.
…d after scoring (review #1064)

[P2] The agent-task assessment called scenario.evaluate_output(output=...) WITHOUT the
required state arg, so it raised for AgentTaskInterface scenarios, got swallowed by the broad
except, and could report valid_rate>0 while collecting nothing. Fixes:
- scenario_task_prompt_and_state resolves the prompt AND the state it was built from; the
  mlxlm assessment passes that same state into evaluate_output(output, state).
- valid is incremented only AFTER scoring succeeds (both mlxlm and the mlx/prepare path), so a
  scoring error can no longer inflate valid_rate.
- the from-scratch mlx path (prepare.py) likewise resolves + passes state.

[P2] Docs over-claimed agent-task free-text across both SFT backends. The from-scratch mlx
backend emits the structured <|...|> token contract (JSON strategies), so free-text agent
tasks fit the mlxlm backend (pretrained instruct model). Narrowed mlx-training.md: both
backends handle game/JSON-strategy scenarios; free-text agent tasks use --backend mlxlm.

Tests: scenario_task_prompt_and_state returns state for agent tasks / None for games.
@jayscambler jayscambler merged commit 221c2e4 into main Jun 9, 2026
17 checks passed
@jayscambler jayscambler deleted the feat/self-improve-mlxlm branch June 9, 2026 22:20
jayscambler added a commit that referenced this pull request Jun 11, 2026
…K) (#1070)

Two bugs found live on a GSM8K GRPO run made every reward 0 -> no gradient (RLVR silently
learned nothing), on top of the length issue fixed in #1067:

1. GRPO prompts were raw strings, so TRL skipped the chat template -> the instruct model never
   entered chat mode, never emitted EOS, and every completion ran to max_completion_length
   (clipped_ratio=1, mean_terminated_length=0) with no parseable answer. build_prompt_dataset_rows
   now emits CONVERSATIONAL prompts ([{role:user,content}]) so TRL applies the chat template and
   completions terminate.
2. score_completions required a JSON construction (extract_json_object) for ALL scenarios, so a
   free-text GSM8K answer scored 0 even when correct. It is now hybrid for agent tasks: pass the
   extracted JSON when present (back-compat for JSON-construction agent tasks) else the RAW text,
   and it accepts conversational (message-list) completions.

Tests: free-text agent-task scoring (no JSON), conversational completion handling, conversational
prompt rows. Both bugs are the same game-vs-agent-task split fixed earlier in the mlxlm assessment
path (#1064).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant