SKC-600 ternary: split engram (packed-lite + eval) and harden build pipeline#1609
Open
Akhilesh-Gogikar wants to merge 65 commits intoopenai:mainfrom
Open
SKC-600 ternary: split engram (packed-lite + eval) and harden build pipeline#1609Akhilesh-Gogikar wants to merge 65 commits intoopenai:mainfrom
Akhilesh-Gogikar wants to merge 65 commits intoopenai:mainfrom
Conversation
12L/768d ternary U-Net with iterative Hadamard-gated backward semantic correction, recurrent capsule state, XSA, LeakyReLU², and full eval stack. Author: Aki Gogikar (OneNewAI) Co-Authored-By: Claude Opus 4.6 <[email protected]>
- train_gpt_mlx.py: Full MLX port of Ternary Reasoner architecture - Ternary STE quantization, U-Net encoder-decoder, feedback adapters - Recurrent capsule bank, XSA, VRL, BigramHash, Partial RoPE - LeakyReLU², LN Scale Damping, EMA, GPTQ-lite - Fixed mx.eval(model.state) → mx.eval(*params) for MLX ≥ 0.22 - MAX_VAL_TOKENS env var to limit validation data for smoke tests - Tested end-to-end on MLX 0.31.1 arm64 (Apple Silicon) - run_mlx_reasoner.sh: Launch script with Mac-appropriate defaults - Auto-detects arm64 Python (/opt/homebrew/bin/python3) - MAX_VAL_TOKENS=65536 default for quick local smoke tests Smoke test results (4L/256d, 10 iterations, all features enabled): - Training: 10 steps in ~1.3s, loss 6.93→6.60 - Validation BPB: 4.10 (initial), 4.12 (after 10 steps) - Artifact: 1.48MB / 16MB budget Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…indow carry First-principles innovations added to both CUDA and MLX versions: - KoopmanDynamics: diagonal + low-rank (D + UV^T) stable linear dynamics in capsule space. Predicts next-pass capsule state instead of naive gated blending. 640 params total (0.008% of budget). Initialized at ρ(D)=0.9 for critical damping stability. - Capsule consistency loss (λ=0.005): auxiliary MSE loss training the Koopman module to predict capsule evolution accurately. Analogous to Anderson acceleration for iterative solvers. - Adaptive halting (eval only): capsule convergence norm δ < 0.05 decides when to stop correction passes. Easy tokens get 1 pass, hard tokens get 2-3. - Cross-window capsule carry: exponential decay (0.8) persistence of capsule state across sliding eval windows. 4KB runtime overhead. - Koopman diagonal stability clamping after each optimizer step (|D_i| < 0.999) in both CUDA and MLX. - CUDA capsule_enabled default changed to 1 (was 0). MLX 200-step ablation: KoopCaps-HRM achieves val_bpb 2.6973 vs plain ternary 2.8615 (+0.164 BPB improvement). Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add dedicated Adam optimizer for Koopman diagonal at LR=0.01 (stability-critical) - Fix adaptive halting to only trigger at correction_pass >= 2 (always run blind + 1 feedback) - Add model.eval()/model.train() in eval_val to enable halting - Log Koopman config (rank, diag_init, consistency weight, halting, carry) 3-seed validation (200 steps, sp1024): mean val_bpb=2.6919, std=0.0088 Co-Authored-By: Claude Opus 4.6 <[email protected]>
…am memory Engram upgrade (from DeepSeek Engram paper): - Multi-head hashing: 4 hash heads per n-gram order, distinct primes per head reduces collision rate vs single-hash BigramHash - Multi-order: bigram + trigram tables (2 orders × 4 heads = 8 tables) - Context-aware gating: sigmoid gate from RMSNorm(hidden) · RMSNorm(key) suppresses noisy/collision-affected lookups using transformer context - Internal-layer injection at encoder layer 1 (not just input layer) gives richer hidden state for gating decisions 3-seed validation (200 steps, sp1024): mean val_bpb = 2.6532, std = 0.0098 +0.039 BPB over KoopCaps-HRM alone (2.6919) +0.208 BPB over plain ternary baseline (2.8615) Co-Authored-By: Claude Opus 4.6 <[email protected]>
TurboQuant (Zandieh et al. 2025): Hadamard rotation before ternary quantization provably reduces MSE for post-training quantization. However, it HURTS convergence during STE training because the model adapts weights to the standard ternary grid — rotation disrupts this learned alignment. 3-seed result: TurboQuant ON = 2.9635 val_bpb vs OFF = 2.6532 (+0.31 worse) Code preserved (TURBO_QUANT=0 default) for potential export-only use. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Split turbo_quant into train/export flags: - TURBO_QUANT_EXPORT=1 (default ON): Hadamard-rotate weights before ternary quantization at export time. Proven 29.3% MSE reduction. - TURBO_QUANT_TRAIN=0 (default OFF): Training-time Hadamard hurts STE convergence (+0.31 BPB in 3-seed test). 3-seed validation (200 steps, 4L/256d): KoopCaps + TurboQuant(export): 2.6502 ± 0.015 BPB KoopCaps + TurboQuant(train): 2.9635 ± 0.031 BPB Plain ternary baseline: 2.8615 BPB Co-Authored-By: Claude Opus 4.6 <[email protected]>
… to CUDA - Replace BigramHash with multi-head, multi-order EngramHash (4 heads, bigram+trigram) - Add TurboQuant Hadamard rotation for export quantization and optional STE training - Add capsule carry with exponential decay in sliding-window TTT eval - Add FEEDBACK_EVERY config for feedback adapter interleaving - Add Koopman diagonal separate optimizer routing (lower LR) - Update run scripts with new env vars MLX 3-seed validation (200 steps, 4L/256d): Seeds 42/1337/7: val_bpb 2.66 ± 0.005 50-step ablation (4L/256d): A (plain ternary): 3.0115 B (ternary + cheap): 3.0089 C (full KoopCaps-HRM): 3.0066 Co-Authored-By: Claude Opus 4.6 <[email protected]>
Two fixes that turn capsules from net negative to net positive: 1. Cold-start gate: alpha init -5 (sigmoid≈0.007) instead of 0 (sigmoid=0.5) - Capsules start OFF and learn when to activate - Prevents early random noise injection into forward pass 2. Hadamard rotation on Koopman dynamics: - Rotate capsules before diagonal+low-rank evolution, rotate back after - Equalizes variance across dimensions so diagonal operator uses full capacity - Zero extra parameters, H is self-inverse Ablation results (200 steps, 4L/256d, seed=42): No capsules: val_bpb 2.3769 KoopCaps old init: val_bpb 2.3817 (+0.005, capsules hurt) KoopCaps + cold-start: val_bpb 2.3754 (-0.002, marginal) KoopCaps + cold-start + Hadamard: val_bpb 2.3641 (-0.013, capsules help) Co-Authored-By: Claude Opus 4.6 <[email protected]>
…10min) Validated E_Shatter_Expectations_Final config on MacBook MLX: - 10-min run: val_bpb=2.1518, 0.87MB artifact (hybrid_mac_benchmark.log) - 1-hr marathon: val_bpb=1.9653, confirming strong post-10min convergence - Curriculum seq 256→512→1024 causes temporary regression at ~1800 steps - Recovers and breaks through, hitting 1.97 at step 3100, 1.9653 final - submission.json updated to 1-hr result: val_loss=3.3287, val_bpb=1.9653 Architecture: 8-layer hybrid (4 attention + 4 Koopman-SSM), shared_blocks=2, MoE (3 experts top-k=1), KoopCaps speculator, 16-capsule bank, EngramHash (4h×3o), curriculum, stochastic depth, ternary noise, EMA, TTT, sliding eval, ngram cache. 2.14M ternary params, 0.87MB artifact. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Systematic 15-min ablation across 4 configs revealed: - Lower LR (0.03 vs 0.04): smoother curriculum transitions, better calibration - Lower ternary noise (0.02 vs 0.05): cleaner gradients, T=1.00 vs T=1.05 - Fast curriculum (phase1=15%, phase2=40%): more time at seq=1024 - Shorter warmdown (35% vs 50%): more full-LR steps at long sequences - EMA start at 40% (vs 30%): avoids accumulating curriculum-shock parameters 15-min sweep results: A=2.0972, B=2.1183, C=2.1202, D=2.0664 1hr D config final: val_bpb=1.9589, sliding=1.9601 (prev best: 1.9653/1.9911) Sliding gap closed from 0.026 to 0.001 — much better calibration. Next: LR=0.035 to resolve plateau seen in seq=1024 phase (steps 2500-3100). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
LR sweep confirms 0.035 as optimum for fast-curriculum 1hr runs: LR=0.04 (baseline): 1.9653 final / 1.9911 sliding LR=0.03 (D config): 1.9589 final / 1.9601 sliding LR=0.035 (this): 1.9538 final / 1.9564 sliding ← new best Final config: fast curriculum (15/40 split), MATRIX_LR=0.035, TIED_EMBED_LR=0.052, TERNARY_NOISE_SCALE=0.02, WARMDOWN_FRACTION=0.35, EMA_START_FRACTION=0.40. Sliding gap closed to 0.003 (was 0.026). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Critical fixes: - Added 6 missing Hyperparameters attrs (koopman_speculator_weight, koopman_speculator_enabled, koopman_speculator_steps, koopman_consistency_weight, adaptive_halt_enabled/threshold) that caused NameError crashes on all CUDA runs - Made ~60 hardcoded attrs env-var configurable via _e() for runtime tuning - Auto-detect bfloat16 on compute capability >= 8 (H100/A100), fallback to float32 for older GPUs (GTX 1650 Ti) - Default data paths changed from Windows C:\ to relative ./data/ - Defaults updated to match proven TKA-H v5 config from MLX sweep New: run_runpod_8xh100.sh - 8×H100 SXM competition launch script - MODEL_DIM=768 (6× larger than MacBook's 128) - 786K batch tokens, seq=2048, torch.compile=default - Full convergence stack: MoE(3,1), curriculum(15/40), stochastic depth, ternary noise, EMA, KoopCaps, EngramHash, TTT, ngram cache - LR=0.025 (scaled down from 0.035 for larger model via mu-param) Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Export proxy eval during training: selects best checkpoint by round-trip BPB - Per-tensor threshold/scale calibration (calibrate_ternary) with grid search - Fixed threshold range from 0.2-0.5 to 0.02-0.15 (previous range was too aggressive, caused 0 calibrated tensors) - Export-aligned training phase: uses calibrated (thr, scale_mult) in forward pass - strict=False in round-trip reload to handle lm_head weight tying - uint16 fix in proxy eval: cast to int64 before model forward - MoE restricted to upper 1/3 of layers (moe_layer_frac) - SpectralTernaryAuxLoss wired into training loss - Capsule carry across training windows - Muon dead momentum code removed - Various run scripts and experiment logs Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- torch.backends.cuda.matmul.allow_tf32 = True (was False) - torch.backends.cudnn.allow_tf32 = True (was False) - Remove .float() forced on model at construction and on all nn.Linear modules - Wrap training forward passes (warmup + main loop) in torch.autocast(bfloat16) - Eval was already under autocast; training was not — this closes the gap This is the highest-ROI throughput change: on H100/A40 with bfloat16 + TF32, matmul throughput is 2-4x higher than float32. Net effect: more steps per 10-minute budget = better final BPB. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
train_gpt.py: - DistributedTokenLoader: cast to int64 on CPU before pin_memory (avoids GPU kernel for dtype cast); add dedicated CUDA copy stream for host→device transfers; add double-buffered prefetch so next batch loads while current step computes run_runpod_8xh100.sh: - TRAIN_BATCH_TOKENS: 16384 → 524288 (32 seqs/GPU on H100 80GB) 16K tokens/step = 1 seq/GPU = severely under-utilizes H100 tensor cores with bf16 512K tokens/step fills compute pipeline while staying well within VRAM budget - Enable export fidelity pipeline: EXPORT_ALIGNED_TRAIN=1, TERNARY_THRESHOLD_SEARCH=1, TERNARY_SCALE_SEARCH=1 with fixed ranges (0.02-0.15), EXPORT_PROXY_EVAL every 2000 steps Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Try flash_attn_interface (v3) first, then flash_attn v2 (which is on PyPI as flash-attn 2.x), then fall back to F.scaled_dot_product_attention. flash_attn v2 uses (B, T, H, D) layout natively — no transpose needed. GQA handled via repeat_interleave before calling v2 func. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Architecture: - H100 script: MODEL_DIM 320 → 2304, NUM_HEADS 8 → 16, NUM_KV_HEADS 4 → 8 Empirically verified: 8L/D=2304/8MoE → 92M params → 14.91MB artifact (fits 16MB) Previous D=320 was only 2.73MB — 5.5x under budget Calibration overhaul (calibrate_ternary): - Rank candidates by proxy ΔBPB sensitivity, not tensor size (sensitive tensors = most BPB impact when miscalibrated) - Two-pass search: pass 1 greedy forward, pass 2 re-optimize selected tensors with full calib context (joint rather than independent) - Probe step uses grid midpoint to measure sensitivity before committing Verification printout at serialization: - Logs ternary_candidates count/params, fp params, estimated raw/compressed MB Catches config drift (wrong D, layers, experts) immediately Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Rank 0 runs calibration (serial inference for 100s-600s) while rank 1 continues the training loop and hits dist.all_reduce on gradients, causing NCCL watchdog timeout after 600s. Fix: add dist.barrier() before and after the calibration block so all ranks park at the barrier while rank 0 calibrates. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sensitivity ranking was running proxy evals on ALL ternary candidates (~50 tensors at D=2304), taking 250+ seconds just for ranking. Fix: sort candidates by size first, only run sensitivity evals on top_n*4 largest tensors. With TOP_N=8 this is 32 evals vs 50+. For H100 submission with TOP_N=5: 20 evals instead of 50+. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Set NCCL timeout via init_process_group(timeout=timedelta(seconds=7200)) controlled by TORCH_NCCL_TIMEOUT_SEC env var (default 7200s) - Remove broken/incorrect dist.barrier() code around calibration (barriers would deadlock: rank 1 hits every iteration, rank 0 only once) - Fix EMA handling during calibration: properly restore EMA shadow - Calibration now runs safely under 2h NCCL timeout Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Root cause: calibrate_ternary() runs inside the training loop on rank 0 only. Rank 1 continues iterating and issuing gradient all-reduces. Since rank 0 is stuck in calibration, the NCCL collective sequence numbers diverge, causing NCCL watchdog timeout (~600s default). Fix: remove calibration from the training loop entirely. Add it to the post-training section after both ranks have exited the loop and hit the sync barrier. Rank 0 calibrates; rank 1 waits at a second barrier before proceeding to EMA + serialization. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
D=2304 on 8xH100 yields only ~1200 steps in 599s — insufficient updates for quality. D=1536 gives ~2x more optimizer steps at same wall-clock. Changes per analysis: - MODEL_DIM: 2304 → 1536 (head_dim=64, KV=6=num_heads/4) - MLP_MULT: 4 → 3 (better compute/quality tradeoff) - MoE: 8 experts top-4 → 4 experts top-1 (lighter) - TRAIN_BATCH_TOKENS: 524288 → 262144 (more frequent updates) - BIGRAM_HASH_DIM: 128 → 48 (scaled for D=1536) - ENGRAM_NUM_ORDERS: 3 → 2 (collision-heavy with V=1024 small vocab) - BIGRAM_HASH_BUCKETS: 16384 → 8192 (sane floor for V=1024) - SKC: 8×64 → 24×64 (capsule_num ≈ model_dim/64) - TORCH_NCCL_TIMEOUT_SEC=7200 for post-training calibration Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- EMBED_DIM=256: FP embedding in 192-256 band for V=1024 (small vocab) - PARTIAL_ROPE_DIMS=32: sufficient for seq_len=2048 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Previous implementation ran unbounded proxy evals: 1 baseline + N sensitivity probes + top_n × grid_size × 2 passes = 200+ evals at 5s each = 20+ minutes. This held the pod hostage after training. New design: - CALIB_MAX_EVALS=32 hard eval limit (default) - CALIB_MAX_SECONDS=90 hard time limit (default) - CALIB_PROXY_MAX_TOK=4096 proxy token budget (was 32768) - CALIB_PREFILTER_MULT=2 / CALIB_MAX_CANDIDATES=12 size prefilter - CALIB_SECOND_PASS=0 disabled by default - Budget check wraps every _proxy_roundtrip_bpb call - Falls back to size ordering if budget hit during sensitivity ranking - Prints budget summary at end Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The scheduling knob (MoE gating, elapsed_fraction-based activation) was effectively always 1.0 due to: 1. _run_block unique-block mode: missing elapsed_fraction in block call 2. _run_block shared-block mode: SKC and MLP calls missing elapsed_fraction 3. _decoder_pass: called without elapsed_fraction, silently defaulted to 1.0 4. _decoder_pass call site in _compute_hidden: missing elapsed_fraction Also reduce MOE_ROUTER_AUX_LOSS_COEF default: 0.01 → 0.001 (reduces turbulence in displayed loss from aux term contamination) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…redundant .to(device) 1. elapsed_frac: was using time.perf_counter()-t0 (resets after validation); now uses cumulative training_time_ms + elapsed since last checkpoint — matches LR scheduler 2. DDP find_unused_parameters: now True when moe_enabled; sparse top-k routing leaves non-selected expert params unused every step, causing DDP hangs without this 3. spec_aux.item(): removed — caused host-device sync every training step; add directly 4. torch.compile: now passes mode=args.compile_mode; was ignoring the tuning knob 5. Redundant .to(device): loader already moves batches to GPU; removed double-copy in training loop and warmup loop Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… batches The chunked scan reshapes (B, T, S) -> (B, T//W, W, S) which requires T to be divisible by W=32. Val eval and proxy calibration can produce batches with T not divisible by 32 (e.g. T=511 = 15*32+31 for the final partial batch). Fix: pad T up to next multiple of 32, run scan, strip padding from output. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
inference_mode creates tensors (e.g. RoPE cos/sin cache) that cannot be saved for backward. When the model returns to training mode after proxy eval, the next forward pass crashes with 'Inference tensors cannot be saved for backward'. Switch to no_grad which doesn't have this restriction. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
1. Silent label corruption: replaced safe_targets clamp+manual gather with F.cross_entropy() which raises on out-of-range targets instead of silently backpropping wrong-label gradients. 2. Silent token corruption in loader: removed hardcoded clamp(0, 1023) — bad token IDs should fail fast, not silently corrupt inputs with wrong-vocab data. 3. Dead duplicate curriculum config: removed the first (unused) curriculum_enabled block (lines 175-179). The training loop uses curr_enabled/curr_p* exclusively; the earlier block had different defaults and was dead code. 4. COMPILER_WARMUP_STEPS now wired through: added compiler_warmup_steps param, warmup loop uses it when nonzero, falls back to warmup_steps. Shell contract now matches Python behavior. Bonus: TernaryLinear.forward now asserts weight divisibility by group_size at runtime, surfacing architecture misconfigs as hard errors instead of silent reshapes. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
1. DDP distributed divergence (showstopper): _EXPORT_CALIB is a Python global read by TernaryLinear.forward on every rank. Mid-training calibration only ran on rank 0, leaving all other ranks with empty thresholds. After the calibration trigger step, rank 0 was training with calibrated quantizer while ranks 1-7 were training with none — making all-reduced gradients mathematically inconsistent. Fix: after rank 0 runs calibrate_ternary, broadcast the result to all ranks via dist.broadcast_object_list before resuming training. Every rank sets _EXPORT_CALIB from the broadcast result and logs confirmation, then all ranks sync at a final barrier. 2. Launcher EXPORT_PROXY_EVAL double-assignment: run_runpod_8xh100.sh set EXPORT_PROXY_EVAL=0 at line 192 with comment "purely post-training", then silently overrode it to 1 at line 217. The first assignment and its lying comment are removed; the single canonical assignment lives in the ternary calibration block with an accurate description. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…gram blindness
1. SKCLayer amnesia (critical): _run_block dropped prev_capsules so Koopman
dynamics always received None, resetting to zero on every forward pass and
completely nullifying cross-window temporal recurrence. Fixed by:
- Adding prev_capsules param to _run_block; routing it to SKCLayer in both
unique-block mode (isinstance check) and shared-block mode (skc branch)
- Adding prev_capsules param to _decoder_pass and threading carry_capsules
through from _compute_hidden to every _run_block call in both encoder
loop and decoder pass
SKCLayer already accepted prev_capsules in its signature — it just never
received a non-None value until now.
2. SKCLayer validation memory leak: spectral_aux_loss was set during training
but never cleared on eval transitions, permanently holding a reference to
the last training micro-batch's full computational graph for the entire
validation loop. Fixed by explicitly setting it to None in the else branch.
3. Proxy eval GPU memory spike: _proxy_roundtrip_bpb cloned the entire model
state dict on GPU before loading the ternary round-trip state, doubling
resident VRAM at the worst possible moment. Fixed by offloading backup to
CPU (.cpu().clone()); load_state_dict restores to GPU automatically.
4. N-gram cache boundary blindness: cache.update(chunk[1:]) never saw the
context spanning consecutive chunk boundaries, so every seam token was
predicted from an empty n-gram context, depressing augmented BPB at every
window edge (every seq_len tokens). Fixed by maintaining _ngram_boundary_ctx
(last max_order-1 tokens from previous chunk) and prepending it before each
update call so the cache learns cross-boundary transitions.
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…eedbackAdapter/Muon Bug 5 — Newton-Schulz NaN on near-zero gradients: ns_orth normalised by X.norm() + eps, so for sparse MoE gradients where norm ≈ 0, X explodes and NS diverges to NaN within a few iterations. Fix: check norm < eps before dividing; return the zero tensor unchanged — no update is correct when there is no gradient signal (unused expert weights). Bug 8 — torch.compile recompilation on curriculum shape changes: every active_seq_len change calls torch._dynamo.reset() and triggers a full recompile, burning wall-clock budget that should go to weight updates. Fix: add dynamic=True to torch.compile so varying sequence lengths and batch sizes are handled without recompilation. Bug 10 — lzma overhead in proxy eval: lzma.compress at even preset=1 takes seconds per call on CPU; when export_proxy_every is ~300-2000 steps, this directly eats training time. Proxy eval only needs quantisation round-trip fidelity, not submission-format compression. Fix: drop lzma for proxy, use raw torch.save/load (the q_sd → deq_sd exercise is preserved, only the size estimate is skipped). Bug 2 — FeedbackAdapter: memory_matrix is correctly initialised to zero per forward call for full-sequence non-autoregressive processing. Clarify with a comment explaining why cross-call persistence is impractical (batch-variable (B,D,D) buffer) and why zero-init is the correct design here. Bug 1 — Muon all_reduce: clarify in a comment that the 'distributed' guard is not data-dependent, so all ranks take the same branch and there is no deadlock risk. MoE sparse-routing zero-grad slots are handled correctly by SUM reduction (zero contributes nothing). Already resolved: Bug 3 (SKC carry threading, fd0bdaa), Bug 4 (pad=T_pad-T, bec80f2), Bug 6 (LR timer excludes validation time by design), Bug 7 (LUT range check, c4e8dad), Bug 9 (p.ndim < 2 routes 1D params to AdamW). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… SKC SkipInit, TTT carry, spectral decay 1. Block.forward crash (fatal): mlp was called with elapsed_fraction=elapsed_fraction unconditionally, crashing with TypeError when moe_enabled=False and self.mlp is plain MLP. Fixed with isinstance(self.mlp, TernaryMoE) guard, matching _run_block. 2. Muon destroys dormant MoE experts: weight decay was applied to every parameter unconditionally. Experts receiving no routing tokens have p.grad=None, so their update is zero — but the mul_(1-lr*wd) still fires every step, iteratively decaying them to zero permanently. Fixed by tracking has_grad (per-rank and post-all_reduce via update norm) and only applying weight decay when the parameter had an active gradient this step. 3. GPU data race on first batch: _load_raw issues a non-blocking PCIe copy on _copy_stream but next_batch returned immediately on the first call (no prefetch yet), so the main stream began the forward pass before the host→device transfer completed, training on uninitialized GPU memory. Fixed by adding wait_stream in the else branch matching the existing wait in the prefetch branch. 4. FeedbackPooler zero-pad sketch corruption: adaptive_avg_pool1d pooled over the entire seq_len including zero-padded positions, contaminating the sketch with sentinel <unk> embeddings and corrupting all token predictions via FeedbackAdapter broadcast. Fixed by adding valid_len param to FeedbackPooler.forward (truncate before pooling) and threading it through _compute_hidden / forward_logits / forward_logits_with_carry into eval_val_sliding. 5. Untied embeddings frozen: lm_head.weight was excluded from opt_head at init (tied weight). requires_grad_(True) enabled gradients at untie time but the optimizer was blind to it — the head never received any weight updates. Fixed by appending lm_head.weight to opt_head.param_groups[0]['params'] at untie. 6. Shared SKC blocks break SkipInit: all layers sharing one SKCLayer block used its internal self.skc_scale — the same residual scale at every depth, collapsing SkipInit's depth-wise gradient invariance. Fixed by adding per_layer_skc_scales (zero-init ParameterList) in GPT.__init__ for shared-block mode, and passing the per-layer scale as external_skc_scale to SKCLayer.forward to override the internal scale. SKCLayer.forward now accepts external_skc_scale=None (None = use internal, preserving non-shared mode behaviour unchanged). 7. TTT carry state stagnation: the TTT sub-batch loop called base_model(x, y, ...) which returns only the loss and discards capsule_state. Every sub-batch received the same frozen carry from the previous chunk end — Koopman acted as a static anchor instead of flowing temporal state, breaking BPTT sequence continuity. Fixed by using forward_logits_with_carry in the TTT loop, computing the loss from logits, and updating carry_capsules sequentially from returned capsule_state. 8. Causal spectral decay is a step function: prefix_states was broadcast across the block_sz dimension and multiplied by a single decay value, so every token in the block received prefix * decay regardless of its time offset — a flat bias, not a causal scan. Fixed by computing time_decay = decay ** arange(1, block_sz+1) and broadcasting (block_sz, D) so token t sees prefix * decay^(t+1), implementing the correct exponential decay of inter-block state across intra-block time steps. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…h, dead flag, eval batch 1. SyntaxError — parse blocker: _EXPORT_CALIB was assigned then declared global twice more in the function body. Python treats a second global declaration for a name already assigned in scope as a SyntaxError, preventing the module from loading. Fixed by hoisting global _TURBO_QUANT_TRAIN, _TURBO_QUANT_KV, _EXPORT_CALIB into a single declaration at the top of main(); removed both later duplicate global _EXPORT_CALIB lines. 2. Duplicate Hyperparameters defaults: val_batch_size, val_loss_every, and train_log_every were each defined twice. The first definitions (lines 78-80) were silently overwritten by the second set (lines 214-216), misleading anyone reading the top of the config block and making the apparent defaults wrong. Removed the dead early definitions and added a comment pointing to the canonical definitions further down. 3. torch.cuda.synchronize() crash on CPU/non-CUDA builds: every synchronize call was unconditional. On CPU-only or non-CUDA builds this is a hard crash. All 13 call sites are now guarded with if device.type == "cuda": 4. _aligned_phase_started never set to True: the flag was initialized False and the export path checked it to skip redundant final calibration, but the flag was never flipped, making the "already calibrated" branch permanently dead and always burning extra calibration wall-clock at export. Fixed by setting _aligned_phase_started = True immediately after mid-training calibration completes and _EXPORT_CALIB is populated. 5. eval_val batch size tied to grad_accum_steps: local_batch_tokens was divided by (world_size * grad_accum_steps). grad_accum_steps is a training microbatch concern with no relevance to evaluation batching. This silently shrank eval throughput as accumulation increased, wasting wall-clock and adding noise to tuning decisions. Fixed by dividing by world_size only. 6. Bug 2 (Block.forward moe_enabled=False crash) was already fixed in ae16ca2; verified isinstance guard is in place. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…dynamo resets Train↔export FP_STORAGE mismatch (roundtrip BPB): q_sd() FP4/FP8-quantises any 2D tensor not in fp16_names when fp_storage is enabled. Several plain nn.Linear layers (MoE router, embed_proj, embed_proj_rev, gate_k in EngramHash) were never wrapped in QATLinear, so they trained without STE exposure but received post-training FP compression at export — a direct train↔export mismatch degrading roundtrip BPB. Two-pronged fix: 1. export_fp16_param_names() now walks named_modules() and protects every plain nn.Linear (identified by absence of fp_storage attribute) by adding its weight to the fp16_names bypass set. Conservative policy: if the layer did not opt in to QAT, it should not be compressed at export either. 2. EngramHash.gate_k converted from nn.Linear to QATLinear so it trains through the same FP approximation it will see at export when fp_storage is enabled. MoE router and embed_proj/embed_proj_rev are correctly handled by (1) above — router weights participate in softmax routing and should not be STE-distorted during training; the export bypass is sufficient. Redundant torch._dynamo.reset() calls removed: Two reset() calls fired on curriculum seq-len changes and the one-time seq-len switch. With compile(..., dynamic=True) already in place these were no-ops for correctness but still triggered graph cache invalidation, burning compile budget inside the 599-second training window. Removed both; added comments explaining why. The untie reset is kept (structural graph change — new requires_grad param, not a shape change) with a comment noting it fires at most once per run. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
1. Async pinned-memory race (data corruption): DistributedTokenLoader used a single pinned staging buffer. After wait_stream() synchronised the GPU, the CPU immediately called _load_raw() for the next prefetch, which copied new data into the same physical pages while the previous DMA transfer may still have been in flight. Replaced single buffer with a double-buffer pair (_pinned_buffers[0/1]). _get_pinned_buf() toggles the write index on every call so the CPU always writes to the buffer the GPU is NOT currently reading, eliminating the race without adding any synchronisation overhead. 2. SKC SkipInit MLP gap: the previous fix extracted external_skc_scale for the spectral attention-equivalent branch but left SKCLayer.mlp_scale shared across all layers reusing the same block. Deeper layers could not independently gate their MLP contributions, breaking SkipInit depth-wise variance and bottlenecking gradient flow. Fixed by adding external_mlp_scale parameter to SKCLayer.forward (uses internal self.mlp_scale when None, preserving non-shared mode) and passing per_layer_mlp_scales[layer_idx] from _run_block alongside the existing per_layer_skc_scales[layer_idx]. 3. VRL alpha initialization shock: vrl_alpha = tensor(0.5) → sigmoid(0.5) ≈ 0.62, blending 38% V0 into every value computation from step 0 and shocking early momentum buffers before the model has learned when to open the shortcut. Changed to tensor(4.0) → sigmoid(4.0) ≈ 0.98, keeping the network grounded at init while still allowing the gate to open organically during training. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
gate_scale was an unconstrained nn.Parameter initialised at 1.0. If the optimiser drives it large, sigmoid(logits * gate_scale) saturates toward hard 0/1 and kills gradient flow through the Engram path entirely. Replace with _gate_scale_raw: effective_scale = clamp(softplus(raw)+0.1, max=4.0). - Lower bound 0.1: gate never fully collapses, always passes some gradient. - Upper bound 4.0: gate cannot saturate hard enough to block learning. - softplus ensures the effective value is always positive. - Initial raw value 0.541 → softplus(0.541)+0.1 ≈ 1.0, preserving old default. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- TernaryMoE: skip routing_weights normalization when top_k=1; division forced weight to 1.0 exactly, zeroing CE gradient through the router - Adaptive halting: rollback grounded_capsule_state to prev_capsule_state on halt so decoder features and capsule state stay in sync (pass N) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… NCCL smoke deadlock - VRL: remove .detach() on v0 capture so deep layers can backprop into Layer 0 representation via the value residual super-highway - KoopmanBlock: add elapsed_fraction param and guard MoE call behind isinstance(TernaryMoE) so Koopman-layer experts obey moe_start_fraction curriculum instead of being live from step 0 - eval_val smoke test: replace wall-clock time breakout with deterministic batch-count limit (FAST_SMOKE_BATCHES, default 128); time-based breaks are unsynchronized across ranks and cause NCCL deadlock at all_reduce Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…psules - eval_val_sliding_ttt: switch from temporal rank-split to document-level parallelism (ci % world_size == rank owns the full chunk). Remove dist.all_reduce on TTT gradients — cross-rank grad averaging was mixing future-chunk gradients into past-chunk optimizer steps, contaminating BPB - _compute_hidden/_decoder_pass: pass prev_capsules=None to _run_block. carry_capsules is the global CapsuleBank summary (high-level manifold) and must not be injected into per-layer SKCLayer Koopman dynamics, which live on a different low-level state manifold Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- gptq_lite_clip_search: measure MSE against original t, not t_clipped; measuring clipped vs recon made catastrophic clipping appear optimal - eval_val_sliding_ttt: partition val_tokens into contiguous rank segments (rank_start/rank_end) instead of round-robin chunks; interleaved chunks break Koopman carry continuity and corrupt TTT convergence across chunks - GPT.__init__: if layer 0 is not attn (e.g. SKC), disable vrl_enabled on all attention blocks; layer 0 always returns v_out=None so vrl_alpha params in deeper layers are dead weight that never participate in VRL Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… DDP zero-fill - TernaryMoE/Block/KoopmanBlock/SKCLayer: return aux losses functionally as 3rd return element instead of storing in self.aux_loss / self.spectral_aux_loss. _run_block → _decoder_pass → _compute_hidden propagate and aggregate them; GPT.forward reads block_aux from _compute_hidden return — no module state mutation inside compiled graph, no Dynamo side-effect trap/recompile. MoE aux is pre-scaled by aux_loss_coef (== moe_router_aux_loss_coef) set at TernaryMoE construction; SKC spectral aux is pre-scaled by weight=0.01 in SpectralTernaryAuxLoss — both added directly to ce_loss without extra scaling. - TernaryLinear: register calib_thr / calib_scale_mult as nn.buffers (persistent=False). Add GPT.apply_export_calib() to write values in-place; forward reads buffers instead of _EXPORT_CALIB global dict. Dynamo tracks buffer tensors dynamically; global dict is baked into the CUDA graph at step 0 and stays stale after mid-training calibration. - Muon.step: guard active_param_mask with p.grad.norm() > 1e-7 instead of is not None. DDP find_unused_parameters=True zero-fills gradients for dormant experts so is not None is always True — magnitude check distinguishes real from zero-fill. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…asts - eval_val_sliding_ttt: hoist SGD optimizer construction outside chunk loop so momentum buffers persist across chunks; per-chunk recreation silently reset accumulated curvature and degraded TTT convergence - EngramHash.forward: remove redundant .float() casts on hidden/memory before cosine-similarity gate; normalize in native compute dtype (bf16/fp16) since cosine similarity is scale-invariant, cast gate_scale to match instead Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ault, post-training overhead run_runpod_8xh100.sh: - TURBO_QUANT_TRAIN=1 (was 0) — must match EXPORT=1; mismatched Hadamard rotation corrupts ternary round-trip weight space and roundtrip BPB - Fix curriculum comment: CURRICULUM_ENABLED does drive active_seq_len in current trainer; it is intentionally disabled here, not dead code run_small_skc_2gpu.sh: - TURBO_QUANT_TRAIN=1 default (was 0) — same round-trip fidelity fix - WEIGHT_SHARING=0, INSIDE_OUT_TRAINING=0 — both are in the trainer's unsupported-flag reject list and cause immediate startup failure - NPROC_PER_NODE default 2 (was 1) — "2-GPU script" defaulted to 1 GPU - TTT_ENABLED=0, SLIDING_EVAL=0, TEMP_SCALING=0, NGRAM_CACHE_ENABLED=0 defaults for proxy run — post-training eval paths extend total wall-clock beyond 10 minutes even though they run outside the 599s training budget Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
After final_evaluation applies EMA shadow, the original weights were never restored before the export path called apply_shadow again. The second call saved shadow-weights as _ema_original and re-applied shadow (no-op), leaking the true originals. Restore via _final_eval_ema_orig so the export path applies EMA exactly once from the true trained weights. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Both are unimplemented experimental flags not consumed by the trainer; explicit =0 prevents trainer defaults from accidentally enabling them. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…librate_ternary On slow GPUs (RTX 4090), a single _proxy_roundtrip_bpb eval can exceed the calib_max_seconds soft budget, leaving rank 1 hung at broadcast_object_list indefinitely. Wrap the calibrate_ternary call in a daemon thread with a hard wall-clock deadline (calib_max_seconds + 30s grace) so rank 0 always returns to the collective within the NCCL timeout. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Phase 0-8 implementation of the SKC competition branch: Tokenizer/data: - tokenizer_specs.json: add sp_bpe_8192 (fineweb10B_sp8192, vocab=8192) train_gpt.py: - New config flags: EXPORT_MODE, RECURRENCE_DEPTH, RECURRENCE_START_FRACTION, SKC_PARALLEL_RESIDUAL - Architecture: 'skc_competition' alias for 'skc' (enables competition context) - ParallelSKCBlock: parallel SKC+MLP paths merge before residual add (parallel lanes read same normed input; zero-init merge scalars) - Dynamic recurrence scheduling: training_depth_recurrence activates at RECURRENCE_START_FRACTION of wall-clock (stable early, virtual depth mid-run) - TTT scope expansion: 'skc_safe' scope includes decay_rates, resid_mix, mixer_conv, per_layer_skc_scales — competition-grade beyond feedback-only - Competition export path: EXPORT_MODE=competition_gptq uses brotli-first compression (fallback to LZMA), writes final_model.competition.ptz - Startup banner logs tokenizer regime, arch, recurrence, export, TTT mode Shell scripts: - run_skc_competition_8xh100.sh: SP8192, skc_competition, parallel residual, depth recurrence, frontier HP, legal score-first TTT (skc_safe), competition export — the canonical competition launcher - run_skc_competition_2gpu_proxy.sh: same config at D=640 for ablation proxies - best_known_competition.env: documented best-known operating point - competition_ablation_manifest.sh: mandatory 7-ablation program with sweep grid Branch state: - skc_research_ternary_sp1024 tag: freezes pre-competition research baseline - skc_competition_sp8192 branch: this competition stack Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…T scope Phase 1 — MuonEq-R optimizer (train_gpt.py ns_orth): Add per-row L2 normalization before global normalization in Newton-Schulz orthogonalization. Clamps row-wise variance from KoopmanTokenMixer/SKCLayer spectral ops, allowing higher sustained matrix LRs without divergence. Row-norm clamp(min=eps) also handles dormant MoE expert zero-gradient rows. Phase 2 — Entropy reduction / capacity reinvestment (all run scripts): MUON_WD + ADAM_WD: 0.04 → 0.090 across all scripts (forces weight sparsity, dramatically improves LZMA/brotli compression efficiency). MLP_MULT: 2/3 → 4 across all scripts (reinvests artifact headroom into MLP with lrelu2 activation to prevent dead neurons under extreme regularization). Phase 3 — EngramLite migration (train_gpt.py + competition scripts): Default bigram_hash_buckets: 4096 → 3072 Default bigram_hash_dim: 64 → 112 (28 per head at 4 total heads) EngramHash.forward(): add unrolled static fast-path for num_orders=2, num_heads=2 (4 total heads). Pre-allocates single h_indices tensor with two explicit writes (bigrams cols 0:2, trigrams cols 2:4) — eliminates Python loop and dynamic tensor slices that risk torch.compile graph breaks. Falls back to existing generalized loop for other configurations. Competition scripts: ENGRAM_NUM_HEADS=2 (2×2=4 total → fast-path active). Phase 4 — TTT capsule_bank scope + SGD verification (train_gpt.py): Add "capsule_bank" as TTT scope: most restrictive option, updates only capsule_bank.* params, zero risk of SP8192 embedding-table OOM. Scope guard ensures capsule_bank mode skips all blocks.* and per_layer_*_scales.* checks. Verified: ttt_optimizer remains torch.optim.SGD (not AdamW). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Experiment run directories (small_skc_multigpu_20260410_*):
13 proxy run records from 2026-04-10 sweeps including logs, GPU candidate
lists, orchestrator state, and one complete run artifact with submission.json.
Sweep scripts:
sweep_batch_size_small_skc.sh — batch size ablation sweep
sweep_compiler_modes_small_skc.sh — torch.compile mode comparison
sweep_performance_full.sh — full performance sweep harness
Orchestrate improvements (orchestrate_small_skc_multigpu_runpod.sh):
- GPU watchdog (watch_gpu_utilization): detects stuck-at-0% GPU utilization
after 10-minute compile grace period; terminates pod after 3 consecutive
zero-util polls to avoid burning credits on a hung run
- BALANCE_FLOOR: 30 → 10 (less aggressive early termination)
- SMOKE_TIMEOUT_SECONDS: 180 → 600 (accommodate longer compile warmup)
- die() now cleans up both GPU_WATCH_PID and BALANCE_WATCH_PID before exit
- Forward additional env vars: EMA_START_FRACTION, MOE_ROUTER_AUX_LOSS_COEF,
WARMDOWN_FRACTION
sweep.pid: updated to reflect current sweep state
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add EvalEngram module (VRAM-resident, val-stream-populated, Laplace-smoothed) that complements the packed EngramHash via additive entropy-gated logit correction, designed to stack with Legal TTT. Gated behind EVAL_ENGRAM_ENABLED (default off). Add FREEZE_PACKED_ENGRAM snapshot/restore around eval so TTT cannot drift the exported tables. Also harden the extraction pipeline: add --dce flag to build_submission.py, prune unused imports post-DCE, and fail fast via py_compile if extraction produces invalid source. Delete stale train_gpt_mlx.py / train_gpt_min.py / train_gpt_extracted.py from root. Archive latest SKC competition ablation records and orchestration tweaks. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Align EvalEngram population with the competition's Legal TTT rule ("you may
only test-time train on val tokens you've already evaluated"). Absorb happens
strictly after a chunk's SCORE phase completes under torch.no_grad(), never
mid-chunk. Last chunk is scored but not absorbed (no consumer). This makes
EvalEngram complementary to Legal TTT's SGD adapter-update on the same
already-graded chunks.
Also snapshot the packed EngramHash tables on eval start and restore them in
finally (warn-and-restore in prod, raise under FREEZE_CHECK_STRICT=1 in dev)
so Legal TTT SGD cannot drift the artifact-exported tables.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EvalEngram(VRAM-resident, val-stream-populated, Laplace-smoothed) that complements the packedEngramHashvia entropy-gated additive logit correction, designed to stack with Legal TTT. Gated behindEVAL_ENGRAM_ENABLED(default off).FREEZE_PACKED_ENGRAMsnapshot/restore around eval so TTT cannot drift the exported tables (strict-assert in dev viaFREEZE_CHECK_STRICT=1, warn-and-restore in prod).build_submission.py --dceflag, unused-import pruning, py_compile smoke post-extraction.train_gpt_mlx.py,train_gpt_min.py,train_gpt_extracted.py).Budget impact (SKC-600, 12L, ternary + FP4)
PackedEngramLiteatBUCKETS=4096fits cleanly with ~128 KB headroom.Test plan
train_gpt.py🤖 Generated with Claude Code