Skip to content

Conversation

@Adefioye
Copy link
Collaborator

Experimental Results for MOE vs Dense

MLA Variants

Config Perplexity SST Accuracy Hours Run
MLA 34.90 85.44 1h 42m
n=8, k=2, moe=128 35.09 86.58 3h 26m
n=4, k=2, moe=256 33.94 86.12 2h 15m

MLA-O Variants

Config Perplexity SST Accuracy Hours Run
MLA-O 34.98 86.47 1h 41m
n=4, k=2, moe=256 36.16 84.52 2h 18m

NOTE: n = number of routed experts, k = number of routed experts per token and moe = MOE intermediate size. The goal was to try and find the combinations that help do two things, balance parameter budget and a combination of parameter and compute budget.

Insights:

  • For MLA, using MOE on 1 GPU increased training time. However, for n=4, k=2, moe=256, perplexity reduced and SST accuracy increased.
  • For MLA-O, Training time consistently increased. However, for n=4, k=2, moe=256, perplexity increased and SST accuracy decreased.

My thoughts:

  • I think with respect to MLA-O where MOE led to decreased performance for both perplexity and SST accuracy, there is a possibility that the performance can be inverted to be consistent with MLA results if multiple GPUs are used for running the experiments as literature suggests MOEs do well on multi-gpu settings.

@chrisjmccormick
Copy link
Owner

chrisjmccormick commented Oct 3, 2025

@Adefioye: Hey Koko! For the config files, can you update the filenames and the "shorthand" and "notes" fields to reflect the experiment?

For the shorthand (see the example below), there's a term "mlp.1024"--that refers to dense FFNs with an intermediate size of 1024.

"shorthand": "seqlen.128 - mla-on.96.64.96 - mlp.1024 - model.256.lyr.6 - ah.8.32",

For yours, you could change this to something like moe.4.256 to represent 4 experts with intermediate size 256?

You can make a similar tweak to the filenames, and the output directories as well.

@chrisjmccormick
Copy link
Owner

@Adefioye - Since we moved this DeepSeekV3-based implementation into its own directory, maybe what makes the most sense here is to add your experiment results to the deepseek/README.md? We could add your config files there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants