Skip to content

Kailean/submission v2#1599

Open
kailean wants to merge 9 commits intoopenai:mainfrom
kailean:kailean/submission-v2
Open

Kailean/submission v2#1599
kailean wants to merge 9 commits intoopenai:mainfrom
kailean:kailean/submission-v2

Conversation

@kailean
Copy link
Copy Markdown

@kailean kailean commented Apr 13, 2026

No description provided.

kailean and others added 9 commits March 27, 2026 21:31
…eline

- sweep_agent.py: fix script target (train_gpt_mlx.py → train_gpt_mlx_kl.py),
  fix parser key (int8_zlib → int6_zstd), update DEFAULT_CONFIG to budget-safe
  11L/2xMLP/BIGRAM_HASH_SIZE=6144, replace WEIGHT_DECAY with USE_SMEARGATE in
  SEARCH_SPACE, add EVAL_MODE=standard + VAL_LOSS_EVERY=0 to SMOKE_ENV for speed

- run_ablation.sh: fix three grep/regex keys from int8_zlib to int6_zstd in both
  shell section and embedded Python summary parser

- train_gpt_mlx_kl.py: add USE_SMEARGATE alias for SMEAR_ENABLED env var so smoke
  tests can disable SmearGate; add USE_SWA/SWA_DECAY Hyperparameters fields and
  implement SWA accumulation loop starting at 60% of iterations using EMABuffer
  with swa_decay; SWA takes priority over EMA for eval and final save when enabled;
  update innovations log line to include use_swa/swa_decay

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port all Phase 1 innovations from train_gpt_mlx_kl.py into train_gpt_kl.py:

Hyperparameters:
- warmdown_iters: 1200 → 3500 [OFFICIAL-PROVEN -0.0002 BPB]
- bigram_hash_size: 10240 → 16384
- Add USE_SMEARGATE env alias for smear_enabled
- Add rope_dims=16 (partial RoPE)
- Add ln_scale_enabled=True (1/sqrt(layer+1) depth scale)
- Add use_swa=False + swa_decay=0.4
- Add muon_weight_decay=0.04 + adam_weight_decay=0.04
- Add late_qat_threshold=0.15 (replaces qat_start_frac logic)
- Add use_gptq_lite=True

Architecture:
- Block: add ln_scale_factor (1/sqrt(layer_idx+1)) applied before attn+MLP norm
- CausalSelfAttention: add rope_dims param for partial RoPE (rotates first N dims only)
- GPT constructor: pass rope_dims + ln_scale_enabled through to Block/Attn

Training loop:
- QAT now triggered by lr_mul < late_qat_threshold (warmdown-relative) not fixed fraction
- SWA buffer init at 60% of training when USE_SWA=1; update each step; preferred over EMA at eval/save
- Adam → AdamW with weight_decay for all scalar/embed/head optimizers
- Muon: add weight_decay support to step()

Quantizer:
- Add quantize_float_tensor_gptq_lite(): 5-percentile MSE sweep per row [-0.0006 BPB OFFICIAL-PROVEN]
- quantize_state_dict_int6() accepts use_gptq_lite flag

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update train_gpt_kl.py defaults to frontier settings:
- TRAIN_SEQ_LEN: 1024 → 2048
- TRAIN_BATCH_TOKENS: 524288 → 786432
- MUON_MOMENTUM: 0.95 → 0.99
- MUON_MOMENTUM_WARMUP_STEPS: 500 → 1500
- MATRIX_LR: 0.04 → 0.02
- SCALAR_LR: 0.04 → 0.02
- TIED_EMBED_LR: 0.05 → 0.03
- GRAD_CLIP_NORM: 0.0 → 0.3

All values env-overridable. Script: 67KB, leaving 14.72MB for model.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…int8/int6+brotli

Architecture: 640d, 11L, 10Q/5KV heads, MLP_MULT=4, depth recurrence L3-5×2
Quantization: Mixed int8 (embeddings) + int6 (weights) + brotli
No QAT (LATE_QAT_THRESHOLD=0.0), GPTQ-lite post-hoc
Score-first TTT (rollback if worse), SmearGate, partial RoPE, XSA
val_bpb: ~1.16 (WIP, training in progress)
Copilot AI review requested due to automatic review settings April 13, 2026 16:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds KaiLean’s “Parameter Golf” training stack and associated tooling/documentation to the repo, including a CUDA/PyTorch training entrypoint with advanced model/quantization techniques plus local sweep/ablation helpers.

Changes:

  • Added two CUDA/PyTorch training scripts (train_gpt_kl.py, train_gpt_kl_v2.py) implementing multiple modeling + quantization techniques (EMA/SWA, int6 packing, bigram hash bias, etc.).
  • Added local experimentation utilities (sweep_agent.py, run_ablation.sh) and supporting MLX innovation notes (kl_innovations.py).
  • Added/updated submission and project docs (SUBMISSION.md, CLAUDE.md) and expanded .gitignore.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
train_gpt_kl_v2.py New CUDA/PyTorch training script variant including EMA + int6 quantization/roundtrip eval.
train_gpt_kl.py New primary CUDA/PyTorch training script with additional features (depth recurrence, parallel residuals, TTT eval, multiple artifact formats).
sweep_agent.py Adds a greedy hyperparameter sweep runner intended for MLX training script.
run_ablation.sh Adds an overnight ablation runner and summary generator.
kl_innovations.py Adds an MLX-focused “innovation stack” helper module.
SUBMISSION.md Documents intended model, training, and quantization approach for the submission.
CLAUDE.md Project usage notes/commands for local runs.
.gitignore Ignores logs and various infra/experimental artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +560 to +564
# Collect Q and V weight matrices for LoRA
qv_pairs: list[tuple[str, str, int, int]] = [] # (q_name, v_name, out_f, in_f)
for block in base_model.blocks:
q_w = block.attn.c_q.weight # (dim, dim)
v_w = block.attn.c_v.weight # (kv_dim, dim)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qv_pairs is annotated/commented as containing names and dimensions, but it actually stores weight tensors (block.attn.c_q.weight, block.attn.c_v.weight). The current type hint list[tuple[str, str, int, int]] is incorrect and q_w/v_w locals are unused. Update the annotation/comment to match the real contents (e.g., a list of (Tensor, Tensor)), and drop the unused locals to avoid confusion.

Suggested change
# Collect Q and V weight matrices for LoRA
qv_pairs: list[tuple[str, str, int, int]] = [] # (q_name, v_name, out_f, in_f)
for block in base_model.blocks:
q_w = block.attn.c_q.weight # (dim, dim)
v_w = block.attn.c_v.weight # (kv_dim, dim)
# Collect Q and V weight tensor pairs for LoRA
qv_pairs: list[tuple[Tensor, Tensor]] = []
for block in base_model.blocks:

Copilot uses AI. Check for mistakes.
Comment on lines +570 to +571
ent_weight = args.ttt_entropy_weight
max_ent = math.log(args.vocab_size) # max possible entropy
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ent_weight and max_ent are computed but never used in the current TTT implementation (loss is standard NLL). Consider removing these unused variables (or reintroduce entropy-weighting) to keep the evaluation code clear.

Suggested change
ent_weight = args.ttt_entropy_weight
max_ent = math.log(args.vocab_size) # max possible entropy

Copilot uses AI. Check for mistakes.
MAX_WALLCLOCK_SECONDS=0 \
WARMDOWN_ITERS=100 \
"$@" \
python3 train_gpt_mlx_kl.py
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_one executes python3 train_gpt_mlx_kl.py, but train_gpt_mlx_kl.py is not present in the repository. This makes the ablation script fail on a clean checkout; update it to call an existing entrypoint (e.g., train_gpt_mlx.py) or add/commit the missing file.

Suggested change
python3 train_gpt_mlx_kl.py
python3 train_gpt_mlx.py

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +8
Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py

## Commands
- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc references train_gpt_mlx_kl.py as the main training script and in the smoke-test command, but that file is not present in the repository. Update the documentation to point at the actual entrypoint(s) that exist (e.g., train_gpt_mlx.py / train_gpt_kl.py) or add/commit train_gpt_mlx_kl.py so these commands work.

Suggested change
Baseline: train_gpt_mlx.py (stock), my version: train_gpt_mlx_kl.py
## Commands
- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_mlx_kl.py
Baseline: train_gpt_mlx.py (stock), my version: train_gpt_kl.py
## Commands
- Smoke test: RUN_ID=test ITERATIONS=100 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 WARMUP_STEPS=3 python3 train_gpt_kl.py

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +23
## Quantization
- **Mixed int8/int6 + brotli**: int8 for embedding matrices (tok_emb/lm_head), int6 packed per-row for all other weights
- **GPTQ-lite**: Per-row clip percentile search for optimal quantization
- **Competition format**: int6+brotli roundtrip (primary), also outputs int8+zlib and int6+zstd for comparison
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The submission doc claims the primary artifact is mixed int8/int6 + brotli (int8 embeddings + int6 for the rest), but train_gpt_kl.py currently writes (1) final_model.int8.ptz (int8 for all weights) and (2) final_model.int6.brotli.ptz (int6 for all weights). Either update the doc to match the produced artifacts, or update serialization to actually use the mixed quantization path (e.g., quantize_state_dict_int6(..., embed_bits=8) with brotli).

Copilot uses AI. Check for mistakes.

def make_lora_forward(w, a, b, orig_fwd):
def lora_forward(x):
return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_lora_forward accepts orig_fwd but never uses it. Either remove the unused parameter/argument, or use it to preserve any special behavior from CastedLinear.forward (e.g., QAT branch, bias handling) and add the LoRA delta on top.

Suggested change
return F.linear(x, w.to(x.dtype), None) + (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)
base = orig_fwd(x)
delta = (x @ b.T.to(x.dtype)) @ a.T.to(x.dtype)
return base + delta

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +41
result = subprocess.run(
["python3", "train_gpt_mlx_kl.py"],
env=env, capture_output=True, text=True,
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script invokes train_gpt_mlx_kl.py, but that file is not present in the repository (only train_gpt_mlx.py, train_gpt_kl.py, and train_gpt_kl_v2.py exist). As-is, the sweep will fail on a clean checkout. Update the command to point at an existing training entrypoint or add/commit the missing train_gpt_mlx_kl.py implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +7
cd ~/parameter-golf
source ~/pg_env/bin/activate

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script hard-codes local machine paths (cd ~/parameter-golf and source ~/pg_env/bin/activate). That will break for anyone running from a different checkout location or venv path. Consider making these relative to the script location (e.g., cd "$(dirname "$0")") and/or guarding activation behind an env var or a check so the script is portable.

Suggested change
cd ~/parameter-golf
source ~/pg_env/bin/activate
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "$SCRIPT_DIR"
if [ -n "${VENV_PATH:-}" ] && [ -f "$VENV_PATH/bin/activate" ]; then
# Allow callers to specify a virtualenv explicitly.
# shellcheck disable=SC1090
source "$VENV_PATH/bin/activate"
elif [ -f "$SCRIPT_DIR/.venv/bin/activate" ]; then
# Prefer a repo-local virtualenv when present.
# shellcheck disable=SC1091
source "$SCRIPT_DIR/.venv/bin/activate"
fi

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +30
import zstandard
from pathlib import Path

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file imports zstandard, but requirements.txt in this repo currently doesn’t include it. On a clean install, the script will fail at import time. Either add zstandard to dependencies or make the int6+zstd serialization path optional.

Suggested change
import zstandard
from pathlib import Path
from pathlib import Path
try:
import zstandard
except ModuleNotFoundError:
class _MissingZstandard:
def __getattr__(self, name):
raise ModuleNotFoundError(
"Optional dependency 'zstandard' is required for the int6+zstd "
"serialization path. Install it to enable zstd-compressed serialization."
)
zstandard = _MissingZstandard()

Copilot uses AI. Check for mistakes.
Comment on lines +1268 to +1270
# Toggle QAT when lr_mul drops below late_qat_threshold (warmdown-triggered)
if not use_qat_active and scale < args.late_qat_threshold:
use_qat_active = True
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale is referenced before it is assigned. In this loop, if not use_qat_active and scale < args.late_qat_threshold: executes before scale = lr_mul(step, elapsed_ms), which will raise UnboundLocalError on the first iteration. Compute scale before this check (and update the log line accordingly), or move the QAT-toggle block to after the scale = ... assignment (similar to train_gpt_kl.py).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants