feat(mlxlm): per-record prompts for dataset-style agent tasks (GSM8K STaR) by jayscambler · Pull Request #1065 · greyhaven-ai/autocontext

jayscambler · 2026-06-10T03:17:34Z

What

The mlxlm SFT path shared one scenario-level task_prompt across all training records. That fits single-task scenarios (cap-sets, grid_ctf) but not dataset-style agent tasks like GSM8K, where each record solves a different problem. records_to_completions now uses a record's own "prompt" field when present (falling back to the scenario task_prompt), so each completion trains on its own instruction. This is the missing piece for running STaR / ReST-EM over a problem distribution through run_mlxlm_training.

Tests: per-record prompt routing + fallback; end-to-end through write_completion_dataset.

Motivation + honest finding

This was built to run a STaR self-improvement loop on GSM8K locally (Qwen2.5-0.5B-4bit): the model samples solutions to train problems, the exact-integer verifier keeps the correct chains, LoRA-SFT on them, repeat; eval on a disjoint held-out test split.

The wiring works end to end (the per-problem prompts thread correctly through training). But the result was a clean negative: held-out accuracy regressed 25% → 15% → 12.5% over two rounds. The likely causes are well-understood and point to scale, not a bug:

Catastrophic forgetting — a 4-bit 0.5B LoRA-SFT'd on only ~25 of its own chains drifts off the instruct prior.
Verifier false positives — GSM8K answers are integers, so lucky-wrong reasoning passes the verifier and trains bad chains.
Sub-STaR-regime scale — STaR's results are on far larger models and thousands of problems; 0.5B / 48 problems is well below where it bootstraps.

So this PR ships the reusable infrastructure (per-record prompts) that the STaR attempt validated, independent of the negative local result. A positive GSM8K self-improvement demo needs the regime STaR actually requires (a 3B+ base, thousands of problems, gentler training, false-positive filtering) — i.e. a GPU/Modal-scale run, not local 0.5B.

…SM8K) The mlxlm SFT path shared one scenario-level task_prompt across all records, which fits single-task scenarios but not dataset-style agent tasks (GSM8K: each record solves a DIFFERENT problem). records_to_completions now uses a record's own 'prompt' field when present (falling back to the scenario task_prompt), so each completion trains on its own instruction. This is the missing piece for running STaR / ReST-EM over a problem distribution through run_mlxlm_training. Tests: per-record prompt routing + fallback; end-to-end through write_completion_dataset.

… + prompt-aware dedupe (review #1065) [P2] Per-record prompts only helped seed records: _assess_mlxlm collected {strategy, score} without the prompt, and samples_to_records dropped it, so later ReST-EM rounds trained generated GSM8K-style answers against the fallback scenario prompt, not the problem they were scored on. Now: - _assess_mlxlm draws a problem PER sample for agent tasks (so the loop explores the dataset) and collects {prompt, strategy, score} -- the exact problem each answer was scored against. - samples_to_records preserves the per-sample prompt (single-task samples carry none, unaffected). [P2] Curation deduped by strategy only, so two different problems sharing a completion/answer collapsed under dedupe=True (run_self_improving_loop's default). _strategy_key now includes the record's prompt when present: dataset records key on (problem, answer); games/single-task key on strategy alone (prior behavior preserved). Tests: samples_to_records preserves/omits prompt; dedupe keeps distinct problems with the same answer but still collapses same-problem duplicates.

jayscambler added 2 commits June 9, 2026 22:17

jayscambler merged commit 26b36ee into main Jun 10, 2026
17 checks passed

jayscambler deleted the feat/gsm8k-star branch June 10, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mlxlm): per-record prompts for dataset-style agent tasks (GSM8K STaR)#1065

feat(mlxlm): per-record prompts for dataset-style agent tasks (GSM8K STaR)#1065
jayscambler merged 2 commits into
mainfrom
feat/gsm8k-star

jayscambler commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jayscambler commented Jun 10, 2026

What

Motivation + honest finding

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant