Skip to content

feat(mlxlm): per-record prompts for dataset-style agent tasks (GSM8K STaR)#1065

Merged
jayscambler merged 2 commits into
mainfrom
feat/gsm8k-star
Jun 10, 2026
Merged

feat(mlxlm): per-record prompts for dataset-style agent tasks (GSM8K STaR)#1065
jayscambler merged 2 commits into
mainfrom
feat/gsm8k-star

Conversation

@jayscambler

Copy link
Copy Markdown
Contributor

What

The mlxlm SFT path shared one scenario-level task_prompt across all training records. That fits single-task scenarios (cap-sets, grid_ctf) but not dataset-style agent tasks like GSM8K, where each record solves a different problem. records_to_completions now uses a record's own "prompt" field when present (falling back to the scenario task_prompt), so each completion trains on its own instruction. This is the missing piece for running STaR / ReST-EM over a problem distribution through run_mlxlm_training.

Tests: per-record prompt routing + fallback; end-to-end through write_completion_dataset.

Motivation + honest finding

This was built to run a STaR self-improvement loop on GSM8K locally (Qwen2.5-0.5B-4bit): the model samples solutions to train problems, the exact-integer verifier keeps the correct chains, LoRA-SFT on them, repeat; eval on a disjoint held-out test split.

The wiring works end to end (the per-problem prompts thread correctly through training). But the result was a clean negative: held-out accuracy regressed 25% → 15% → 12.5% over two rounds. The likely causes are well-understood and point to scale, not a bug:

  • Catastrophic forgetting — a 4-bit 0.5B LoRA-SFT'd on only ~25 of its own chains drifts off the instruct prior.
  • Verifier false positives — GSM8K answers are integers, so lucky-wrong reasoning passes the verifier and trains bad chains.
  • Sub-STaR-regime scale — STaR's results are on far larger models and thousands of problems; 0.5B / 48 problems is well below where it bootstraps.

So this PR ships the reusable infrastructure (per-record prompts) that the STaR attempt validated, independent of the negative local result. A positive GSM8K self-improvement demo needs the regime STaR actually requires (a 3B+ base, thousands of problems, gentler training, false-positive filtering) — i.e. a GPU/Modal-scale run, not local 0.5B.

…SM8K)

The mlxlm SFT path shared one scenario-level task_prompt across all records, which fits
single-task scenarios but not dataset-style agent tasks (GSM8K: each record solves a DIFFERENT
problem). records_to_completions now uses a record's own 'prompt' field when present (falling
back to the scenario task_prompt), so each completion trains on its own instruction. This is
the missing piece for running STaR / ReST-EM over a problem distribution through run_mlxlm_training.

Tests: per-record prompt routing + fallback; end-to-end through write_completion_dataset.
… + prompt-aware dedupe (review #1065)

[P2] Per-record prompts only helped seed records: _assess_mlxlm collected {strategy, score}
without the prompt, and samples_to_records dropped it, so later ReST-EM rounds trained generated
GSM8K-style answers against the fallback scenario prompt, not the problem they were scored on. Now:
- _assess_mlxlm draws a problem PER sample for agent tasks (so the loop explores the dataset) and
  collects {prompt, strategy, score} -- the exact problem each answer was scored against.
- samples_to_records preserves the per-sample prompt (single-task samples carry none, unaffected).

[P2] Curation deduped by strategy only, so two different problems sharing a completion/answer
collapsed under dedupe=True (run_self_improving_loop's default). _strategy_key now includes the
record's prompt when present: dataset records key on (problem, answer); games/single-task key on
strategy alone (prior behavior preserved).

Tests: samples_to_records preserves/omits prompt; dedupe keeps distinct problems with the same
answer but still collapses same-problem duplicates.
@jayscambler jayscambler merged commit 26b36ee into main Jun 10, 2026
17 checks passed
@jayscambler jayscambler deleted the feat/gsm8k-star branch June 10, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant