Conversation
…eline - sweep_agent.py: fix script target (train_gpt_mlx.py → train_gpt_mlx_kl.py), fix parser key (int8_zlib → int6_zstd), update DEFAULT_CONFIG to budget-safe 11L/2xMLP/BIGRAM_HASH_SIZE=6144, replace WEIGHT_DECAY with USE_SMEARGATE in SEARCH_SPACE, add EVAL_MODE=standard + VAL_LOSS_EVERY=0 to SMOKE_ENV for speed - run_ablation.sh: fix three grep/regex keys from int8_zlib to int6_zstd in both shell section and embedded Python summary parser - train_gpt_mlx_kl.py: add USE_SMEARGATE alias for SMEAR_ENABLED env var so smoke tests can disable SmearGate; add USE_SWA/SWA_DECAY Hyperparameters fields and implement SWA accumulation loop starting at 60% of iterations using EMABuffer with swa_decay; SWA takes priority over EMA for eval and final save when enabled; update innovations log line to include use_swa/swa_decay Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port all Phase 1 innovations from train_gpt_mlx_kl.py into train_gpt_kl.py: Hyperparameters: - warmdown_iters: 1200 → 3500 [OFFICIAL-PROVEN -0.0002 BPB] - bigram_hash_size: 10240 → 16384 - Add USE_SMEARGATE env alias for smear_enabled - Add rope_dims=16 (partial RoPE) - Add ln_scale_enabled=True (1/sqrt(layer+1) depth scale) - Add use_swa=False + swa_decay=0.4 - Add muon_weight_decay=0.04 + adam_weight_decay=0.04 - Add late_qat_threshold=0.15 (replaces qat_start_frac logic) - Add use_gptq_lite=True Architecture: - Block: add ln_scale_factor (1/sqrt(layer_idx+1)) applied before attn+MLP norm - CausalSelfAttention: add rope_dims param for partial RoPE (rotates first N dims only) - GPT constructor: pass rope_dims + ln_scale_enabled through to Block/Attn Training loop: - QAT now triggered by lr_mul < late_qat_threshold (warmdown-relative) not fixed fraction - SWA buffer init at 60% of training when USE_SWA=1; update each step; preferred over EMA at eval/save - Adam → AdamW with weight_decay for all scalar/embed/head optimizers - Muon: add weight_decay support to step() Quantizer: - Add quantize_float_tensor_gptq_lite(): 5-percentile MSE sweep per row [-0.0006 BPB OFFICIAL-PROVEN] - quantize_state_dict_int6() accepts use_gptq_lite flag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update train_gpt_kl.py defaults to frontier settings: - TRAIN_SEQ_LEN: 1024 → 2048 - TRAIN_BATCH_TOKENS: 524288 → 786432 - MUON_MOMENTUM: 0.95 → 0.99 - MUON_MOMENTUM_WARMUP_STEPS: 500 → 1500 - MATRIX_LR: 0.04 → 0.02 - SCALAR_LR: 0.04 → 0.02 - TIED_EMBED_LR: 0.05 → 0.03 - GRAD_CLIP_NORM: 0.0 → 0.3 All values env-overridable. Script: 67KB, leaving 14.72MB for model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…learned int6 scales
Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/4932a578-a9a9-478a-8b7b-88b67ca885ab Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
…int8/int6+brotli Architecture: 640d, 11L, 10Q/5KV heads, MLP_MULT=4, depth recurrence L3-5×2 Quantization: Mixed int8 (embeddings) + int6 (weights) + brotli No QAT (LATE_QAT_THRESHOLD=0.0), GPTQ-lite post-hoc Score-first TTT (rollback if worse), SmearGate, partial RoPE, XSA val_bpb: ~1.16 (WIP, training in progress)
There was a problem hiding this comment.
Pull request overview
Adds KaiLean’s “Parameter Golf” training stack and associated tooling/documentation to the repo, including a CUDA/PyTorch training entrypoint with advanced model/quantization techniques plus local sweep/ablation helpers.
Changes:
- Added two CUDA/PyTorch training scripts (
train_gpt_kl.py,train_gpt_kl_v2.py) implementing multiple modeling + quantization techniques (EMA/SWA, int6 packing, bigram hash bias, etc.). - Added local experimentation utilities (
sweep_agent.py,run_ablation.sh) and supporting MLX innovation notes (kl_innovations.py). - Added/updated submission and project docs (
SUBMISSION.md,CLAUDE.md) and expanded.gitignore.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
train_gpt_kl_v2.py |
New CUDA/PyTorch training script variant including EMA + int6 quantization/roundtrip eval. |
train_gpt_kl.py |
New primary CUDA/PyTorch training script with additional features (depth recurrence, parallel residuals, TTT eval, multiple artifact formats). |
sweep_agent.py |
Adds a greedy hyperparameter sweep runner intended for MLX training script. |
run_ablation.sh |
Adds an overnight ablation runner and summary generator. |
kl_innovations.py |
Adds an MLX-focused “innovation stack” helper module. |
SUBMISSION.md |
Documents intended model, training, and quantization approach for the submission. |
CLAUDE.md |
Project usage notes/commands for local runs. |
.gitignore |
Ignores logs and various infra/experimental artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Collect Q and V weight matrices for LoRA | ||
| qv_pairs: list[tuple[str, str, int, int]] = [] # (q_name, v_name, out_f, in_f) | ||
| for block in base_model.blocks: | ||
| q_w = block.attn.c_q.weight # (dim, dim) | ||
| v_w = block.attn.c_v.weight # (kv_dim, dim) |
There was a problem hiding this comment.
qv_pairs is annotated/commented as containing names and dimensions, but it actually stores weight tensors (block.attn.c_q.weight, block.attn.c_v.weight). The current type hint list[tuple[str, str, int, int]] is incorrect and q_w/v_w locals are unused. Update the annotation/comment to match the real contents (e.g., a list of (Tensor, Tensor)), and drop the unused locals to avoid confusion.
| # Collect Q and V weight matrices for LoRA | |
| qv_pairs: list[tuple[str, str, int, int]] = [] # (q_name, v_name, out_f, in_f) | |
| for block in base_model.blocks: | |
| q_w = block.attn.c_q.weight # (dim, dim) | |
| v_w = block.attn.c_v.weight # (kv_dim, dim) | |
| # Collect Q and V weight tensor pairs for LoRA | |
| qv_pairs: list[tuple[Tensor, Tensor]] = [] | |
| for block in base_model.blocks: |
| ent_weight = args.ttt_entropy_weight | ||
| max_ent = math.log(args.vocab_size) # max possible entropy |
There was a problem hiding this comment.
ent_weight and max_ent are computed but never used in the current TTT implementation (loss is standard NLL). Consider removing these unused variables (or reintroduce entropy-weighting) to keep the evaluation code clear.
| ent_weight = args.ttt_entropy_weight | |
| max_ent = math.log(args.vocab_size) # max possible entropy |
| MAX_WALLCLOCK_SECONDS=0 \ | ||
| WARMDOWN_ITERS=100 \ | ||
| "$@" \ | ||
| python3 train_gpt_mlx_kl.py |
There was a problem hiding this comment.
run_one executes python3 train_gpt_mlx_kl.py, but train_gpt_mlx_kl.py is not present in the repository. This makes the ablation script fail on a clean checkout; update it to call an existing entrypoint (e.g., train_gpt_mlx.py) or add/commit the missing file.
| python3 train_gpt_mlx_kl.py | |
| python3 train_gpt_mlx.py |
| Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py | ||
|
|
||
| ## Commands | ||
| - Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py |
There was a problem hiding this comment.
This doc references train_gpt_mlx_kl.py as the main training script and in the smoke-test command, but that file is not present in the repository. Update the documentation to point at the actual entrypoint(s) that exist (e.g., train_gpt_mlx.py / train_gpt_kl.py) or add/commit train_gpt_mlx_kl.py so these commands work.
| Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py | |
| ## Commands | |
| - Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py | |
| Baseline: train_gpt_mlx.py (stock), my version: train_gpt_kl.py | |
| ## Commands | |
| - Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_kl.py |
| ## Quantization | ||
| - **Mixed int8/int6 + brotli**: int8 for embedding matrices (tok_emb/lm_head), int6 packed per-row for all other weights | ||
| - **GPTQ-lite**: Per-row clip percentile search for optimal quantization | ||
| - **Competition format**: int6+brotli roundtrip (primary), also outputs int8+zlib and int6+zstd for comparison |
There was a problem hiding this comment.
The submission doc claims the primary artifact is mixed int8/int6 + brotli (int8 embeddings + int6 for the rest), but train_gpt_kl.py currently writes (1) final_model.int8.ptz (int8 for all weights) and (2) final_model.int6.brotli.ptz (int6 for all weights). Either update the doc to match the produced artifacts, or update serialization to actually use the mixed quantization path (e.g., quantize_state_dict_int6(..., embed_bits=8) with brotli).
|
|
||
| def make_lora_forward(w, a, b, orig_fwd): | ||
| def lora_forward(x): | ||
| return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype) |
There was a problem hiding this comment.
make_lora_forward accepts orig_fwd but never uses it. Either remove the unused parameter/argument, or use it to preserve any special behavior from CastedLinear.forward (e.g., QAT branch, bias handling) and add the LoRA delta on top.
| return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype) | |
| base = orig_fwd(x) | |
| delta = (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype) | |
| return base + delta |
| result = subprocess.run( | ||
| ["python3", "train_gpt_mlx_kl.py"], | ||
| env=env, capture_output=True, text=True, |
There was a problem hiding this comment.
This script invokes train_gpt_mlx_kl.py, but that file is not present in the repository (only train_gpt_mlx.py, train_gpt_kl.py, and train_gpt_kl_v2.py exist). As-is, the sweep will fail on a clean checkout. Update the command to point at an existing training entrypoint or add/commit the missing train_gpt_mlx_kl.py implementation.
| cd ~/parameter-golf | ||
| source ~/pg_env/bin/activate | ||
|
|
There was a problem hiding this comment.
This script hard-codes local machine paths (cd ~/parameter-golf and source ~/pg_env/bin/activate). That will break for anyone running from a different checkout location or venv path. Consider making these relative to the script location (e.g., cd "$(dirname "$0")") and/or guarding activation behind an env var or a check so the script is portable.
| cd ~/parameter-golf | |
| source ~/pg_env/bin/activate | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | |
| cd "$SCRIPT_DIR" | |
| if [ -n "${VENV_PATH:-}" ] && [ -f "$VENV_PATH/bin/activate" ]; then | |
| # Allow callers to specify a virtualenv explicitly. | |
| # shellcheck disable=SC1090 | |
| source "$VENV_PATH/bin/activate" | |
| elif [ -f "$SCRIPT_DIR/.venv/bin/activate" ]; then | |
| # Prefer a repo-local virtualenv when present. | |
| # shellcheck disable=SC1091 | |
| source "$SCRIPT_DIR/.venv/bin/activate" | |
| fi |
| import zstandard | ||
| from pathlib import Path | ||
|
|
There was a problem hiding this comment.
This file imports zstandard, but requirements.txt in this repo currently doesn’t include it. On a clean install, the script will fail at import time. Either add zstandard to dependencies or make the int6+zstd serialization path optional.
| import zstandard | |
| from pathlib import Path | |
| from pathlib import Path | |
| try: | |
| import zstandard | |
| except ModuleNotFoundError: | |
| class _MissingZstandard: | |
| def __getattr__(self, name): | |
| raise ModuleNotFoundError( | |
| "Optional dependency 'zstandard' is required for the int6+zstd " | |
| "serialization path. Install it to enable zstd-compressed serialization." | |
| ) | |
| zstandard = _MissingZstandard() |
| # Toggle QAT when lr_mul drops below late_qat_threshold (warmdown-triggered) | ||
| if not use_qat_active and scale < args.late_qat_threshold: | ||
| use_qat_active = True |
There was a problem hiding this comment.
scale is referenced before it is assigned. In this loop, if not use_qat_active and scale < args.late_qat_threshold: executes before scale = lr_mul(step, elapsed_ms), which will raise UnboundLocalError on the first iteration. Compute scale before this check (and update the log line accordingly), or move the QAT-toggle block to after the scale = ... assignment (similar to train_gpt_kl.py).
No description provided.