Skip to content

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)#1606

Open
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/v2-int8-7l-4xmlp-ema
Open

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark)#1606
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/v2-int8-7l-4xmlp-ema

Conversation

@AlirezaAlampour
Copy link
Copy Markdown

Non-Record v2: 7-Layer UNet + Int8 QAT + EMA + 4-Hour Training

Track: non_record_16mb
Hardware: NVIDIA DGX Spark (1× GB10 Blackwell, 128GB unified memory)
Artifact size: ~15.5 MB (int8+zlib)
Improvement over v1 (PR #1486): -0.27 BPB (1.6656 → 1.3969)

Results

Seed val_bpb (int8+zlib roundtrip) Steps Artifact
1337 1.39982649 1033 15,518,077 B
42 1.39564112 1041 15,559,281 B
314 1.39532841 1040 15,565,024 B
Mean 1.39693

Seed range: 0.0045 BPB (~0.32% of mean) — very stable configuration.

What changed from v1

My v1 submission (PR #1486) hit 1.6656 BPB at only ~320 steps due to a
restrictive 80-minute wallclock. v2 addresses the main bottleneck —
training length — and tunes the configuration for the ~1000-step regime
the Spark can reach in 4 hours.

Config changes:

  • 9 layers → 7 layers (frees budget for wider MLP)
  • MLP mult 2 → MLP mult 4 (more capacity per layer)
  • Matrix LR 0.04 → 0.0826 (higher LR works better with fewer, wider layers)
  • Muon momentum 0.95 → 0.9383
  • Warmdown iters 1200 → 1558
  • Head LR 0.008 → 0.0 (tied embeddings handle the output projection)
  • Training wallclock 80 min → 4 hours (~1040 steps vs ~320)
  • Kept: Int8 QAT, EMA (0.997), Muon optimizer, LeakyReLU², U-Net skips

The configuration was found through Optuna hyperparameter sweeps and
cross-validated on the DGX Spark's step-time characteristics (~13.8s/step).

Techniques

  • U-Net skip connections with learned per-block residual mixing
  • 7 layers, d=512, 8 heads, 4 KV heads (GQA), MLP 4×
  • LeakyReLU² activation (negative_slope=0.5, squared)
  • Muon optimizer with Newton-Schulz orthogonalization + Adam for embeddings
  • Int8 per-row QAT with straight-through estimator
  • EMA weight averaging (decay=0.997)
  • zlib compression on serialized checkpoint
  • Logit softcap (tanh, cap=30), RoPE, tied embeddings
  • SentencePiece BPE vocab 1024, seq_len 1024
  • ~20.7M parameters

What's still interesting

Still developed from zero ML background using AI-assisted coding across
Claude, GPT, and Gemini. All training on a single consumer Blackwell GB10
GPU (DGX Spark) — no H100 access. The Spark runs at ~13.8s/step vs ~87ms
on 8×H100, limiting us to ~1040 steps in 4 hours.

The 0.27 BPB improvement came from two insights: (1) the previous
submission was step-starved, not config-bad, and (2) 7 wider layers
outperform 9 narrower layers at low step counts on this hardware.

3-seed validation on DGX Spark (GB10 Blackwell, single GPU):
  seed 1337: val_bpb 1.39982649 (step 1033)
  seed 42:   val_bpb 1.39564112 (step 1041)
  seed 314:  val_bpb 1.39532841 (step 1040)

Mean 1.39693 BPB, range 0.00450 BPB across seeds. 0.27 BPB improvement
over 2026-04-07 v1 submission (1.6656 -> 1.3969) via deeper 7-layer
model, 4x MLP multiplier, and 4-hour training budget (vs ~80 min in v1).

Artifact mean 15.55 MB (under 16 MB cap). Same U-Net + Muon + EMA +
LeakyReLU^2 + int8 QAT recipe as v1, retuned hyperparameters.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant