Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark) by AlirezaAlampour · Pull Request #1606 · openai/parameter-golf

AlirezaAlampour · 2026-04-13T21:35:51Z

Non-Record v2: 7-Layer UNet + Int8 QAT + EMA + 4-Hour Training

Track: non_record_16mb
Hardware: NVIDIA DGX Spark (1× GB10 Blackwell, 128GB unified memory)
Artifact size: ~15.5 MB (int8+zlib)
Improvement over v1 (PR #1486): -0.27 BPB (1.6656 → 1.3969)

Results

Seed	val_bpb (int8+zlib roundtrip)	Steps	Artifact
1337	1.39982649	1033	15,518,077 B
42	1.39564112	1041	15,559,281 B
314	1.39532841	1040	15,565,024 B
Mean	1.39693

Seed range: 0.0045 BPB (~0.32% of mean) — very stable configuration.

What changed from v1

My v1 submission (PR #1486) hit 1.6656 BPB at only ~320 steps due to a
restrictive 80-minute wallclock. v2 addresses the main bottleneck —
training length — and tunes the configuration for the ~1000-step regime
the Spark can reach in 4 hours.

Config changes:

9 layers → 7 layers (frees budget for wider MLP)
MLP mult 2 → MLP mult 4 (more capacity per layer)
Matrix LR 0.04 → 0.0826 (higher LR works better with fewer, wider layers)
Muon momentum 0.95 → 0.9383
Warmdown iters 1200 → 1558
Head LR 0.008 → 0.0 (tied embeddings handle the output projection)
Training wallclock 80 min → 4 hours (~1040 steps vs ~320)
Kept: Int8 QAT, EMA (0.997), Muon optimizer, LeakyReLU², U-Net skips

The configuration was found through Optuna hyperparameter sweeps and
cross-validated on the DGX Spark's step-time characteristics (~13.8s/step).

Techniques

U-Net skip connections with learned per-block residual mixing
7 layers, d=512, 8 heads, 4 KV heads (GQA), MLP 4×
LeakyReLU² activation (negative_slope=0.5, squared)
Muon optimizer with Newton-Schulz orthogonalization + Adam for embeddings
Int8 per-row QAT with straight-through estimator
EMA weight averaging (decay=0.997)
zlib compression on serialized checkpoint
Logit softcap (tanh, cap=30), RoPE, tied embeddings
SentencePiece BPE vocab 1024, seq_len 1024
~20.7M parameters

What's still interesting

Still developed from zero ML background using AI-assisted coding across
Claude, GPT, and Gemini. All training on a single consumer Blackwell GB10
GPU (DGX Spark) — no H100 access. The Spark runs at ~13.8s/step vs ~87ms
on 8×H100, limiting us to ~1040 steps in 4 hours.

The 0.27 BPB improvement came from two insights: (1) the previous
submission was step-starved, not config-bad, and (2) 7 wider layers
outperform 9 narrower layers at low step counts on this hardware.

3-seed validation on DGX Spark (GB10 Blackwell, single GPU): seed 1337: val_bpb 1.39982649 (step 1033) seed 42: val_bpb 1.39564112 (step 1041) seed 314: val_bpb 1.39532841 (step 1040) Mean 1.39693 BPB, range 0.00450 BPB across seeds. 0.27 BPB improvement over 2026-04-07 v1 submission (1.6656 -> 1.3969) via deeper 7-layer model, 4x MLP multiplier, and 4-hour training budget (vs ~80 min in v1). Artifact mean 15.55 MB (under 16 MB cap). Same U-Net + Muon + EMA + LeakyReLU^2 + int8 QAT recipe as v1, retuned hyperparameters. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)#1606

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)#1606
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/v2-int8-7l-4xmlp-ema

AlirezaAlampour commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlirezaAlampour commented Apr 13, 2026

Non-Record v2: 7-Layer UNet + Int8 QAT + EMA + 4-Hour Training

Results

What changed from v1

Techniques

What's still interesting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant