Ablation results for all 7 OpenAI-requested research architectures #1500

dentity007 · 2026-04-09T15:28:43Z

dentity007
Apr 9, 2026

Sharing early results from an overnight ablation study on all 7 of the "Requests for PRs" architectures from the README. These are non-record submissions (PRs #1191-#1197) but I wanted to go deeper than the initial proof-of-concept runs.

All tests run on a single NVIDIA DGX Spark GB10 (128GB unified memory, no torch.compile), sp1024 data, 200 training steps, SEED=42. Hardware is slower than 8xH100 but the relative ordering between configurations should hold. No TTT, no SLOT, no eval-time tricks.

Results So Far (13 of 22 runs complete)

1. Universal Transformer (PR #1193)

Config	Iterations	Params	val_bpb	ms/step
UT-1: 1 block x 6	6	4.55M	3.2483	707
UT-2: 1 block x 24	24	4.60M	3.2490	2,734

Finding: More iterations does not help. 24 iterations is 4x slower per step than 6, and achieves virtually identical BPB (3.2490 vs 3.2483). The shared-weight architecture hits a ceiling quickly. This aligns with PR #363s earlier findings. The practical approach is mini depth recurrence (repeat 2-3 specific layers) rather than full weight sharing, which is what PR #1204 and PR #1334 ended up doing.

2. Text Diffusion (PR #1194)

Config	AR/Diff Split	val_bpb	ms/step
DIFF-1: 70% AR / 30% diff	70/30	2.4195	1,388
DIFF-2: 50% AR / 50% diff	50/50	2.4194	997
DIFF-3: 100% AR (reference)	100/0	2.4194	997

Finding: The diffusion loss contributes nothing. All three configurations produce identical BPB at 200 steps. The diffusion head learns to predict masked tokens during training, but since eval is purely autoregressive (causal, left-to-right), none of that knowledge transfers. The 70/30 split is actually slower (1388ms vs 997ms) because the diffusion forward pass adds overhead. Diffusion for text compression appears to be a dead end unless the eval protocol changes.

3. Random Linear Map Adapters (PR #1195)

Config	Trainable	val_bpb	ms/step
RND-1: Default (0.5%)	~600K	2.5123	894
RND-2: Wider adapters (rank 8)	~480K	2.6323	881
RND-3: 5% unfrozen	~600K	2.5120	895
RND-4: Progressive unfreeze	~600K	2.5122	894

Finding: Random projections provide a surprisingly strong starting point. All configurations land near 2.51 BPB regardless of adapter configuration. Wider adapters (RND-2) actually hurt, suggesting the default diagonal scale+shift is already well-matched to random orthogonal projections. The progressive unfreezing strategy (RND-4) shows no benefit at 200 steps. The gap from 2.51 to the trained baseline (~1.57 at 200 steps) represents the value of learning actual projection directions, not just scales.

4. JEPA (PR #1196)

Config	JEPA Weight	val_bpb	ms/step
JEPA-1: 10% JEPA	0.1	2.2323	686
JEPA-2: 30% JEPA	0.3	2.2322	685
JEPA-3: 50% JEPA	0.5	2.2322	686

Finding: JEPA weight has zero effect. Whether the JEPA auxiliary loss is 10%, 30%, or 50% of the total, val_bpb is identical to 4 decimal places. The JEPA predictor learns something (its loss decreases during training), but that knowledge does not transfer to the AR objective within 200 steps. This could change with longer training or with JEPA as a pre-training stage rather than a concurrent auxiliary.

5. Mamba SSM Hybrid (PR #1197) - still running

Config	Status	val_bpb (step 100)	ms/step
SSM-1: 1:1 ratio	Running	2.2066	35,744

Note: Pure PyTorch SSM is extremely slow without custom CUDA kernels (35s per step vs ~700ms for attention). The BPB at step 100 (2.2066) is actually competitive with JEPA, suggesting SSMs have real potential if the speed issue is solved with Triton/CUDA selective scan kernels.

Still Running (9 more runs)

SSM-2 (1:3 attention-heavy), SSM-3 (pure SSM), SSM-4 (larger state)
HNET-1 (default chunking), HNET-2 (larger chunker), HNET-3 (boundary regularizer)
MEGA-1 (baseline reference), MEGA-2 (larger model), MEGA-3 (deeper model)

Will update this thread when they finish.

Early Conclusions

Depth recurrence plateaus fast. 6 iterations matches 24 at a fraction of the cost. Mini recurrence (2-3 layers) as in PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 is the right approach.
Diffusion is incompatible with causal eval. The knowledge from bidirectional masked prediction does not help left-to-right scoring. This is a fundamental mismatch, not a tuning problem.
Random projections are surprisingly capable. Diagonal adapters on frozen random orthogonal matrices reach 2.51 BPB. That is far better than chance and suggests the transformer architecture itself (attention patterns, residual connections, normalization) provides most of the inductive bias. The learned weights add ~1 BPB of improvement on top.
JEPA is neutral as an auxiliary loss. Neither helps nor hurts at any weight. May need a curriculum (pre-train with JEPA, then switch to AR) rather than concurrent training.
SSMs are promising but bottlenecked by implementation. The 2.21 BPB at step 100 (with barely any training due to 35s/step) hints that a fast SSM implementation could be competitive. Someone with Triton skills could make this work.

Hardware Note

All runs on a DGX Spark GB10 (single GPU, 128GB unified memory, ARM architecture). No torch.compile (Triton/inductor unsupported on aarch64), no flash attention (SDPA fallback). This hardware is ~6x slower than 8xH100 per step, so absolute BPB numbers are higher than competition runs. The relative rankings between configurations are what matter here.

Full logs available on request. PRs: #1191, #1192, #1193, #1194, #1195, #1196, #1197.

Corrections and feedback welcome, especially if anyone has tried these architectures with different configurations.

dentity007 · 2026-04-10T21:26:36Z

dentity007
Apr 10, 2026
Author

Final Results (All 22 Runs Complete)

All runs finished on the DGX Spark. Here is the complete data across all 7 architectures. Full logs and CSV available as a public gist: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

Combined Results Table (Sorted by val_bpb)

Rank	Run	Arch	Params	val_bpb	step_ms	size_mb	Status
1	SSM-1 (1:1 ratio)	Mamba SSM	26.5M	2.0295	37,492	11.42	complete
2	HNET-1 (default)	H-Net	17.3M	2.0558	513	7.37	complete
3	HNET-3 (boundary reg)	H-Net	17.3M	2.0558	514	7.37	complete
4	HNET-2 (large chunker)	H-Net	17.3M	2.0559	513	7.37	complete
5	MEGA-2 (d=640)	Megakernels	26.5M	2.1592	903	10.25	complete
6	SSM-4 (state=64)	Mamba SSM	31.3M	2.1816	55,824	-	partial
7	MEGA-3 (11 layers)	Megakernels	20.8M	2.1961	714	8.61	complete
8	MEGA-1 (baseline)	Megakernels	17.1M	2.2147	584	7.13	complete
9	JEPA-2 (30%)	JEPA	18.1M	2.2322	496	7.04	complete
10	JEPA-3 (50%)	JEPA	18.1M	2.2322	497	7.04	complete
11	JEPA-1 (10%)	JEPA	18.1M	2.2323	498	7.04	complete
12	DIFF-2 (50/50)	Text Diffusion	17.1M	2.4194	997	6.90	complete
13	DIFF-3 (pure AR)	Text Diffusion	17.1M	2.4194	997	6.90	complete
14	DIFF-1 (70/30)	Text Diffusion	17.1M	2.4195	1,388	6.90	complete
15	RND-3 (5% unfrozen)	Random Adapters	0.6M	2.5120	895	10.49	complete
16	RND-4 (progressive)	Random Adapters	0.6M	2.5122	894	10.49	complete
17	RND-1 (default 0.5%)	Random Adapters	0.6M	2.5123	894	10.49	complete
18	RND-2 (rank 8)	Random Adapters	0.6M	2.6323	881	1.36	complete
19	UT-1 (6 iters)	Universal Trans	4.5M	3.2483	707	-	complete
20	UT-2 (24 iters)	Universal Trans	4.6M	3.2490	2,734	1.45	complete
-	SSM-2 (1:3)	Mamba SSM	-	-	-	-	crashed (env var bug)
-	SSM-3 (pure SSM)	Mamba SSM	-	-	-	-	crashed (env var bug)
-	UT-3 (no dense)	Universal Trans	-	-	-	-	crashed (env var bug)

Final Findings Per Architecture

1. Universal Transformer: Doubling iterations from 6 to 24 (4x compute per step) produces identical BPB (3.2483 vs 3.2490). Full weight sharing plateaus immediately. Mini depth recurrence is the way.

2. Text Diffusion: All three AR/diffusion ratios produce identical BPB to 4 decimal places. The diffusion loss contributes literally nothing to causal eval. Fundamental mismatch, not a tuning problem.

3. Random Adapters: Frozen random orthogonal projections with ~600K trainable params reach 2.51 BPB. That is 1.5 BPB better than chance and suggests transformer architecture itself (attention, residuals, norms) carries most of the inductive bias. Wider adapters (RND-2) actually hurt, landing at 2.63.

4. JEPA: Three different JEPA weights (10%, 30%, 50%) produce identical BPB. JEPA as a concurrent auxiliary loss has zero effect at 200 steps. Might need a pre-training stage curriculum.

5. Mamba SSM: Winner on raw BPB at 2.0295, but with a catch. Pure PyTorch SSM runs at 37 seconds per step (50x slower than attention). At that speed, only ~5 effective training steps completed in 200 iterations, yet it still reached the lowest BPB. Strong signal that fast SSM kernels (Triton/CUDA selective scan) would be competitive. SSM-4 with larger state also hit 2.18 but never completed training.

6. H-Net: Fastest architecture at 513ms per step while still reaching 2.06 BPB. But the chunker configuration makes zero difference. All three variants (default, large chunker, boundary regularizer) produce identical BPB to 4 decimal places. The chunker is learning the identity function regardless of what you do to it.

7. Megakernels: Without Triton kernels (unavailable on ARM), we tested PyTorch-equivalent configurations. MEGA-2 with d=640 hit 2.16 BPB, beating MEGA-3 with 11 layers at 2.20. Wider is better than deeper for this model size. Note: the actual Triton kernel speedup would be additive on H100.

Cross-Architecture Insights

Speed vs Quality tradeoff is not what you expect. The slowest architecture (SSM) has the best BPB. The fastest architecture (H-Net) is second. Attention-based baselines sit in the middle of the pack.
Hyperparameters inside an architecture often do nothing. JEPA weight, H-Net chunker config, diffusion ratio all produced identical results within their architecture family. This suggests these architectures either work or they do not, and tuning does not rescue them.
Parameter count is not predictive. RND with 600K params (2.51 BPB) is not that much worse than DIFF with 17M params (2.42 BPB). The 28x parameter gap buys only 0.09 BPB. Architecture matters more than raw capacity at this scale.
Crash pattern reveals env var fragility. SSM-2, SSM-3, and UT-3 all crashed because the ablation env vars (SSM_RATIO, SSM_ONLY, SPARSE_CURRICULUM) were not wired into the scripts. Worth fixing for future ablations.

Caveats

200 training steps is very short. Longer runs would likely re-order some results.
sp1024 vocab is not the competition standard sp4096. Relative ordering should hold, absolute BPB would shift.
DGX Spark GB10 is ~6x slower per step than 8xH100. No torch.compile, no flash attention.
The 3 crashed runs leave gaps in the ablation coverage. Happy to rerun if someone wants specific configurations.

Raw Data Access

Everything is in the gist: https://gist.github.com/dentity007/324ac35505c27acd18e7ffb468f4fa08

README.md with study parameters and hardware details
results_summary.csv - Per-run metrics (importable to any tool)
results_summary.json - Same data as JSON
23 individual .log files with full training output, final eval, and quantization results

Public domain. Use for any purpose. If you rerun with different configurations or longer training, please share back - interested to see how the ordering shifts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ablation results for all 7 OpenAI-requested research architectures #1500

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ablation results for all 7 OpenAI-requested research architectures #1500

Uh oh!

dentity007 Apr 9, 2026

Results So Far (13 of 22 runs complete)

1. Universal Transformer (PR #1193)

2. Text Diffusion (PR #1194)

3. Random Linear Map Adapters (PR #1195)

4. JEPA (PR #1196)

5. Mamba SSM Hybrid (PR #1197) - still running

Still Running (9 more runs)

Early Conclusions

Hardware Note

Replies: 1 comment

Uh oh!

dentity007 Apr 10, 2026 Author

Final Results (All 22 Runs Complete)

Combined Results Table (Sorted by val_bpb)

Final Findings Per Architecture

Cross-Architecture Insights

Caveats

Raw Data Access

dentity007
Apr 9, 2026

dentity007
Apr 10, 2026
Author