Add config files to experiment on the performance between dense and MOE blocks in attention layers #18

Adefioye · 2025-09-27T14:13:25Z

Experimental Results for MOE vs Dense

MLA Variants

Config	Perplexity	SST Accuracy	Hours Run
MLA	34.90	85.44	1h 42m
n=8, k=2, moe=128	35.09	86.58	3h 26m
n=4, k=2, moe=256	33.94	86.12	2h 15m

MLA-O Variants

Config	Perplexity	SST Accuracy	Hours Run
MLA-O	34.98	86.47	1h 41m
n=4, k=2, moe=256	36.16	84.52	2h 18m

NOTE: n = number of routed experts, k = number of routed experts per token and moe = MOE intermediate size. The goal was to try and find the combinations that help do two things, balance parameter budget and a combination of parameter and compute budget.

Insights:

For MLA, using MOE on 1 GPU increased training time. However, for n=4, k=2, moe=256, perplexity reduced and SST accuracy increased.
For MLA-O, Training time consistently increased. However, for n=4, k=2, moe=256, perplexity increased and SST accuracy decreased.

My thoughts:

I think with respect to MLA-O where MOE led to decreased performance for both perplexity and SST accuracy, there is a possibility that the performance can be inverted to be consistent with MLA results if multiple GPUs are used for running the experiments as literature suggests MOEs do well on multi-gpu settings.

chrisjmccormick · 2025-10-03T16:12:01Z

@Adefioye: Hey Koko! For the config files, can you update the filenames and the "shorthand" and "notes" fields to reflect the experiment?

For the shorthand (see the example below), there's a term "mlp.1024"--that refers to dense FFNs with an intermediate size of 1024.

"shorthand": "seqlen.128 - mla-on.96.64.96 - mlp.1024 - model.256.lyr.6 - ah.8.32",

For yours, you could change this to something like moe.4.256 to represent 4 experts with intermediate size 256?

You can make a similar tweak to the filenames, and the output directories as well.

chrisjmccormick · 2025-10-23T23:16:20Z

@Adefioye - Since we moved this DeepSeekV3-based implementation into its own directory, maybe what makes the most sense here is to add your experiment results to the deepseek/README.md? We could add your config files there as well.

Adefioye added 4 commits September 24, 2025 10:55

Add configs for seqlen 128 for mla and mla-o

85d8026

Change config to make all layers dense

6e5aaa9

Tweaked params in config

7907471

Make some cleanup

0fc0729

Adefioye requested a review from chrisjmccormick September 27, 2025 14:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add config files to experiment on the performance between dense and MOE blocks in attention layers #18

Add config files to experiment on the performance between dense and MOE blocks in attention layers #18

Uh oh!

Adefioye commented Sep 27, 2025

Uh oh!

chrisjmccormick commented Oct 3, 2025 •

edited

Loading

Uh oh!

chrisjmccormick commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add config files to experiment on the performance between dense and MOE blocks in attention layers #18

Are you sure you want to change the base?

Add config files to experiment on the performance between dense and MOE blocks in attention layers #18

Uh oh!

Conversation

Adefioye commented Sep 27, 2025

Experimental Results for MOE vs Dense

Uh oh!

chrisjmccormick commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjmccormick commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisjmccormick commented Oct 3, 2025 •

edited

Loading