-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Investigate the relative impact of Query latent space vs Output latent space decomposition.
Decomposing the query space is an already-established technique in DeepSeek-V3.
Let's say we find that decomposing the Query space:
- Doesn't improve throughput.
- Hurts performance on the benchmark task.
(We might expect to see this given that DeepSeek-V2-Lite doesn't use a query latent space!)
If that's the case, it would be helpful context for our results.
In particular, it would be interesting to see the relative impact between the two. Does one seem more beneficial than the other?
Approach
- We could evaluate this in the Encoder model, the Decoder, or both.
- There are 8 possible variations, and we've already done 3.
-
No decompositions (standard MHA)
-
KV only
-
Query only
-
Output only
-
Query and KV (standard MLA)
-
Query and Output
-
KV and Output
-
Query, KV, and Output (MLA-o)Tasks:* [ ] Some variants to compare (we already have some of these):
-
I think that comparing "Query only" and "Output only" and standard MHA might be the most direct comparison.