Kailean/submission v2 by kailean · Pull Request #1599 · openai/parameter-golf

kailean · 2026-04-13T16:54:52Z

No description provided.

…eline - sweep_agent.py: fix script target (train_gpt_mlx.py → train_gpt_mlx_kl.py), fix parser key (int8_zlib → int6_zstd), update DEFAULT_CONFIG to budget-safe 11L/2xMLP/BIGRAM_HASH_SIZE=6144, replace WEIGHT_DECAY with USE_SMEARGATE in SEARCH_SPACE, add EVAL_MODE=standard + VAL_LOSS_EVERY=0 to SMOKE_ENV for speed - run_ablation.sh: fix three grep/regex keys from int8_zlib to int6_zstd in both shell section and embedded Python summary parser - train_gpt_mlx_kl.py: add USE_SMEARGATE alias for SMEAR_ENABLED env var so smoke tests can disable SmearGate; add USE_SWA/SWA_DECAY Hyperparameters fields and implement SWA accumulation loop starting at 60% of iterations using EMABuffer with swa_decay; SWA takes priority over EMA for eval and final save when enabled; update innovations log line to include use_swa/swa_decay Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Port all Phase 1 innovations from train_gpt_mlx_kl.py into train_gpt_kl.py: Hyperparameters: - warmdown_iters: 1200 → 3500 [OFFICIAL-PROVEN -0.0002 BPB] - bigram_hash_size: 10240 → 16384 - Add USE_SMEARGATE env alias for smear_enabled - Add rope_dims=16 (partial RoPE) - Add ln_scale_enabled=True (1/sqrt(layer+1) depth scale) - Add use_swa=False + swa_decay=0.4 - Add muon_weight_decay=0.04 + adam_weight_decay=0.04 - Add late_qat_threshold=0.15 (replaces qat_start_frac logic) - Add use_gptq_lite=True Architecture: - Block: add ln_scale_factor (1/sqrt(layer_idx+1)) applied before attn+MLP norm - CausalSelfAttention: add rope_dims param for partial RoPE (rotates first N dims only) - GPT constructor: pass rope_dims + ln_scale_enabled through to Block/Attn Training loop: - QAT now triggered by lr_mul < late_qat_threshold (warmdown-relative) not fixed fraction - SWA buffer init at 60% of training when USE_SWA=1; update each step; preferred over EMA at eval/save - Adam → AdamW with weight_decay for all scalar/embed/head optimizers - Muon: add weight_decay support to step() Quantizer: - Add quantize_float_tensor_gptq_lite(): 5-percentile MSE sweep per row [-0.0006 BPB OFFICIAL-PROVEN] - quantize_state_dict_int6() accepts use_gptq_lite flag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update train_gpt_kl.py defaults to frontier settings: - TRAIN_SEQ_LEN: 1024 → 2048 - TRAIN_BATCH_TOKENS: 524288 → 786432 - MUON_MOMENTUM: 0.95 → 0.99 - MUON_MOMENTUM_WARMUP_STEPS: 500 → 1500 - MATRIX_LR: 0.04 → 0.02 - SCALAR_LR: 0.04 → 0.02 - TIED_EMBED_LR: 0.05 → 0.03 - GRAD_CLIP_NORM: 0.0 → 0.3 All values env-overridable. Script: 67KB, leaving 14.72MB for model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…learned int6 scales

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/4932a578-a9a9-478a-8b7b-88b67ca885ab Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>

…int8/int6+brotli Architecture: 640d, 11L, 10Q/5KV heads, MLP_MULT=4, depth recurrence L3-5×2 Quantization: Mixed int8 (embeddings) + int6 (weights) + brotli No QAT (LATE_QAT_THRESHOLD=0.0), GPTQ-lite post-hoc Score-first TTT (rollback if worse), SmearGate, partial RoPE, XSA val_bpb: ~1.16 (WIP, training in progress)

Copilot

Pull request overview

Adds KaiLean’s “Parameter Golf” training stack and associated tooling/documentation to the repo, including a CUDA/PyTorch training entrypoint with advanced model/quantization techniques plus local sweep/ablation helpers.

Changes:

Added two CUDA/PyTorch training scripts (train_gpt_kl.py, train_gpt_kl_v2.py) implementing multiple modeling + quantization techniques (EMA/SWA, int6 packing, bigram hash bias, etc.).
Added local experimentation utilities (sweep_agent.py, run_ablation.sh) and supporting MLX innovation notes (kl_innovations.py).
Added/updated submission and project docs (SUBMISSION.md, CLAUDE.md) and expanded .gitignore.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
`train_gpt_kl_v2.py`	New CUDA/PyTorch training script variant including EMA + int6 quantization/roundtrip eval.
`train_gpt_kl.py`	New primary CUDA/PyTorch training script with additional features (depth recurrence, parallel residuals, TTT eval, multiple artifact formats).
`sweep_agent.py`	Adds a greedy hyperparameter sweep runner intended for MLX training script.
`run_ablation.sh`	Adds an overnight ablation runner and summary generator.
`kl_innovations.py`	Adds an MLX-focused “innovation stack” helper module.
`SUBMISSION.md`	Documents intended model, training, and quantization approach for the submission.
`CLAUDE.md`	Project usage notes/commands for local runs.
`.gitignore`	Ignores logs and various infra/experimental artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T16:59:30Z

train_gpt_kl.py

+    # Collect Q and V weight matrices for LoRA
+    qv_pairs: list[tuple[str, str, int, int]] = []  # (q_name, v_name, out_f, in_f)
+    for block in base_model.blocks:
+        q_w = block.attn.c_q.weight  # (dim, dim)
+        v_w = block.attn.c_v.weight  # (kv_dim, dim)


qv_pairs is annotated/commented as containing names and dimensions, but it actually stores weight tensors (block.attn.c_q.weight, block.attn.c_v.weight). The current type hint list[tuple[str, str, int, int]] is incorrect and q_w/v_w locals are unused. Update the annotation/comment to match the real contents (e.g., a list of (Tensor, Tensor)), and drop the unused locals to avoid confusion.

Suggested change

# Collect Q and V weight matrices for LoRA

qv_pairs: list[tuple[str, str, int, int]] = [] # (q_name, v_name, out_f, in_f)

for block in base_model.blocks:

q_w = block.attn.c_q.weight # (dim, dim)

v_w = block.attn.c_v.weight # (kv_dim, dim)

# Collect Q and V weight tensor pairs for LoRA

qv_pairs: list[tuple[Tensor, Tensor]] = []

for block in base_model.blocks:

Copilot · 2026-04-13T16:59:30Z

train_gpt_kl.py

+    ent_weight = args.ttt_entropy_weight
+    max_ent = math.log(args.vocab_size)  # max possible entropy


ent_weight and max_ent are computed but never used in the current TTT implementation (loss is standard NLL). Consider removing these unused variables (or reintroduce entropy-weighting) to keep the evaluation code clear.

Suggested change

ent_weight = args.ttt_entropy_weight

max_ent = math.log(args.vocab_size) # max possible entropy

Copilot · 2026-04-13T16:59:31Z

run_ablation.sh

+        MAX_WALLCLOCK_SECONDS=0 \
+        WARMDOWN_ITERS=100 \
+        "$@" \
+        python3 train_gpt_mlx_kl.py


run_one executes python3 train_gpt_mlx_kl.py, but train_gpt_mlx_kl.py is not present in the repository. This makes the ablation script fail on a clean checkout; update it to call an existing entrypoint (e.g., train_gpt_mlx.py) or add/commit the missing file.

Suggested change

python3 train_gpt_mlx_kl.py

python3 train_gpt_mlx.py

Copilot · 2026-04-13T16:59:31Z

CLAUDE.md

+Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py
+
+## Commands
+- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py


This doc references train_gpt_mlx_kl.py as the main training script and in the smoke-test command, but that file is not present in the repository. Update the documentation to point at the actual entrypoint(s) that exist (e.g., train_gpt_mlx.py / train_gpt_kl.py) or add/commit train_gpt_mlx_kl.py so these commands work.

Suggested change

Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py

## Commands

- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py

Baseline: train_gpt_mlx.py (stock), my version: train_gpt_kl.py

## Commands

- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_kl.py

Copilot · 2026-04-13T16:59:31Z

SUBMISSION.md

+## Quantization
+- **Mixed int8/int6 + brotli**: int8 for embedding matrices (tok_emb/lm_head), int6 packed per-row for all other weights
+- **GPTQ-lite**: Per-row clip percentile search for optimal quantization
+- **Competition format**: int6+brotli roundtrip (primary), also outputs int8+zlib and int6+zstd for comparison


The submission doc claims the primary artifact is mixed int8/int6 + brotli (int8 embeddings + int6 for the rest), but train_gpt_kl.py currently writes (1) final_model.int8.ptz (int8 for all weights) and (2) final_model.int6.brotli.ptz (int6 for all weights). Either update the doc to match the produced artifacts, or update serialization to actually use the mixed quantization path (e.g., quantize_state_dict_int6(..., embed_bits=8) with brotli).

Copilot · 2026-04-13T16:59:32Z

train_gpt_kl.py

+
+            def make_lora_forward(w, a, b, orig_fwd):
+                def lora_forward(x):
+                    return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)


make_lora_forward accepts orig_fwd but never uses it. Either remove the unused parameter/argument, or use it to preserve any special behavior from CastedLinear.forward (e.g., QAT branch, bias handling) and add the LoRA delta on top.

Suggested change

return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)

base = orig_fwd(x)

delta = (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)

return base + delta

Copilot · 2026-04-13T16:59:32Z

sweep_agent.py

+        result = subprocess.run(
+            ["python3", "train_gpt_mlx_kl.py"],
+            env=env, capture_output=True, text=True,


This script invokes train_gpt_mlx_kl.py, but that file is not present in the repository (only train_gpt_mlx.py, train_gpt_kl.py, and train_gpt_kl_v2.py exist). As-is, the sweep will fail on a clean checkout. Update the command to point at an existing training entrypoint or add/commit the missing train_gpt_mlx_kl.py implementation.

Copilot · 2026-04-13T16:59:33Z

run_ablation.sh

+cd ~/parameter-golf
+source ~/pg_env/bin/activate
+


This script hard-codes local machine paths (cd ~/parameter-golf and source ~/pg_env/bin/activate). That will break for anyone running from a different checkout location or venv path. Consider making these relative to the script location (e.g., cd "$(dirname "$0")") and/or guarding activation behind an env var or a check so the script is portable.

Suggested change

cd ~/parameter-golf

source ~/pg_env/bin/activate

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

cd "$SCRIPT_DIR"

if [ -n "${VENV_PATH:-}" ] && [ -f "$VENV_PATH/bin/activate" ]; then

# Allow callers to specify a virtualenv explicitly.

# shellcheck disable=SC1090

source "$VENV_PATH/bin/activate"

elif [ -f "$SCRIPT_DIR/.venv/bin/activate" ]; then

# Prefer a repo-local virtualenv when present.

# shellcheck disable=SC1091

source "$SCRIPT_DIR/.venv/bin/activate"

fi

Copilot · 2026-04-13T16:59:33Z

train_gpt_kl_v2.py

+import zstandard
+from pathlib import Path
+


This file imports zstandard, but requirements.txt in this repo currently doesn’t include it. On a clean install, the script will fail at import time. Either add zstandard to dependencies or make the int6+zstd serialization path optional.

Suggested change

import zstandard

from pathlib import Path

from pathlib import Path

try:

import zstandard

except ModuleNotFoundError:

class _MissingZstandard:

def __getattr__(self, name):

raise ModuleNotFoundError(

"Optional dependency 'zstandard' is required for the int6+zstd "

"serialization path. Install it to enable zstd-compressed serialization."

)

zstandard = _MissingZstandard()

Copilot · 2026-04-13T16:59:33Z

train_gpt_kl_v2.py

+        # Toggle QAT when lr_mul drops below late_qat_threshold (warmdown-triggered)
+        if not use_qat_active and scale < args.late_qat_threshold:
+            use_qat_active = True


scale is referenced before it is assigned. In this loop, if not use_qat_active and scale < args.late_qat_threshold: executes before scale = lr_mul(step, elapsed_ms), which will raise UnboundLocalError on the first iteration. Compute scale before this check (and update the log line accordingly), or move the QAT-toggle block to after the scale = ... assignment (similar to train_gpt_kl.py).

kailean and others added 9 commits March 27, 2026 21:31

feat: KL innovations — BigramHash6144 + SmearGate + SWA + sweep agent

9aa2fd0

feat: complete MLX training script with all KL innovations including …

477f19d

…learned int6 scales

Initial plan

cfbbd42

Add train_gpt_kl_v2.py with M4/MPS support and 6 new training features

09af615

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/4932a578-a9a9-478a-8b7b-88b67ca885ab Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>

Merge pull request #5 from kailean/copilot/implement-smoke-test-upgrades

b86e3fa

Copilot AI review requested due to automatic review settings April 13, 2026 16:54

Copilot started reviewing on behalf of kailean April 13, 2026 16:55 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kailean/submission v2#1599

Kailean/submission v2#1599
kailean wants to merge 9 commits intoopenai:mainfrom
kailean:kailean/submission-v2

kailean commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		ent_weight = args.ttt_entropy_weight
		max_ent = math.log(args.vocab_size) # max possible entropy

-                    return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)
+                    base = orig_fwd(x)
+                    delta = (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)
+                    return base + delta

-cd ~/parameter-golf
-source ~/pg_env/bin/activate
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$SCRIPT_DIR"
+if [ -n "${VENV_PATH:-}" ] && [ -f "$VENV_PATH/bin/activate" ]; then
+    # Allow callers to specify a virtualenv explicitly.
+    # shellcheck disable=SC1090
+    source "$VENV_PATH/bin/activate"
+elif [ -f "$SCRIPT_DIR/.venv/bin/activate" ]; then
+    # Prefer a repo-local virtualenv when present.
+    # shellcheck disable=SC1091
+    source "$SCRIPT_DIR/.venv/bin/activate"
+fi

-import zstandard
-from pathlib import Path
+from pathlib import Path
+try:
+    import zstandard
+except ModuleNotFoundError:
+    class _MissingZstandard:
+        def __getattr__(self, name):
+            raise ModuleNotFoundError(
+                "Optional dependency 'zstandard' is required for the int6+zstd "
+                "serialization path. Install it to enable zstd-compressed serialization."
+            )
+    zstandard = _MissingZstandard()

Conversation

kailean commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants