Skip to content

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607

Open
inin-zou wants to merge 1 commit intoopenai:mainfrom
inin-zou:submission/nemotron-h-mamba3-depth-recurrence
Open

Non-record: Nemotron-H Mamba-3 Hybrid + First SSM Depth Recurrence (1.4765 BPB)#1607
inin-zou wants to merge 1 commit intoopenai:mainfrom
inin-zou:submission/nemotron-h-mamba3-depth-recurrence

Conversation

@inin-zou
Copy link
Copy Markdown

@inin-zou inin-zou commented Apr 14, 2026

Summary

  • First Mamba depth recurrence in the competition (checks off "State-space models" from Requests for PRs)
  • Nemotron-H inspired hybrid: 7 Mamba-3 SISO + 1 Attention (8 physical layers → 12 virtual via hinge-point recurrence)
  • Novel hinge-point multi-recurrence: layers 3,4 repeated 2x at U-Net hinge, outperforms spread recurrence
  • val_bpb: 1.4765 post-quant (1000 steps, 1xH100, GPTQ int6+LZMA, 8.2MB artifact)
  • Systematic ablation of 6 recurrence configs, 3 quantization strategies, and 3 architectural variants

Key Findings

Finding Detail
Mamba depth recurrence works -0.0092 bpb vs no recurrence (first-ever SSM recurrence result)
Focused > spread recurrence Hinge ×2 (1.2824) beats 4-layer ×1 (1.2864) at same virtual depth
Ternary Mamba not viable at 26M +0.397 bpb worse (literature confirms min ~1.3B needed)
Q-Mamba DSQ not needed Standard Full Hessian GPTQ already handles SSM outliers (0.082 vs 0.148 quant loss)
RoPE removal hurts at small scale +0.072 worse (unlike Jamba 1.3B where it's neutral)

Architecture

Physical: [Mamba3_0, Mamba3_1, Mamba3_2, Mamba3_3, Attn_4, Mamba3_5, Mamba3_6, Mamba3_7]
Virtual:  [M0, M1, M2, M3, A4, M3, A4, M3, A4, M5, M6, M7]  (12 layers, 0 extra params)

Credits

Built on PR #1355 (best SSM) pipeline. Inspired by NVIDIA Nemotron-H (arXiv 2504.03624), Mamba-3 (ICLR 2026), and PR #1204 (depth recurrence concept).

Test plan

  • Verify script runs: torchrun --standalone --nproc_per_node=1 train_nemotron_hybrid.py with env vars from README
  • Check artifact < 16MB (currently 8.2MB)
  • Pending: 8xH100 10-min run (awaiting OpenAI compute grant)

…(1.4765 BPB)

First Mamba depth recurrence in Parameter Golf.
7 Mamba-3 + 1 Attention hybrid with hinge-point multi-recurrence
(12 virtual layers from 8 physical, zero extra params).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant