Overfit on benchmarks?

Hi!

First, thank you for releasing the model and benchmark scripts.

I have been testing DFlash for Qwen3-8B with the SGLang backend. On the benchmark datasets provided in the repo, namely GSM8K and MATH500, I observe a significant speedup compared to EAGLE-3, roughly more than 2x in throughput.

Dflash:

Backend:          sglang
Dataset:          gsm8k
Num prompts:      128
Concurrency:      1
Latency:          84.8s
Output tokens:    39253
Throughput:       462.95 tok/s
Accept length:    6.360
Spec verify ct:   6428

Backend:          sglang
Dataset:          math500
Num prompts:      128
Concurrency:      1
Latency:          156.8s
Output tokens:    94809
Throughput:       604.57 tok/s
Accept length:    8.049
Spec verify ct:   12077



EAGLE-3:

Backend:          sglang
Dataset:          gsm8k
Num prompts:      128
Concurrency:      1
Latency:          146.2s
Output tokens:    40818
Throughput:       279.11 tok/s
Accept length:    4.817
Spec verify ct:   8483

Backend:          sglang
Dataset:          math500
Num prompts:      128
Concurrency:      1
Latency:          368.7s
Output tokens:    92960
Throughput:       252.10 tok/s
Accept length:    4.392
Spec verify ct:   21743


But I then added two additional datasets from Hugging Face, Alpaca and IFEval, using the same setup. On these datasets, I no longer observe the same acceleration.

Dflash:

Backend:          sglang
Dataset:          alpaca
Num prompts:      128
Concurrency:      1
Latency:          183.5s
Output tokens:    41673
Throughput:       227.09 tok/s
Accept length:    3.621
Spec verify ct:   14249

Backend:          sglang
Dataset:          ifeval
Num prompts:      128
Concurrency:      1
Latency:          260.5s
Output tokens:    46532
Throughput:       178.60 tok/s
Accept length:    2.288
Spec verify ct:   20334

EAGLE-3:

Backend:          sglang
Dataset:          alpaca
Num prompts:      128
Concurrency:      1
Latency:          167.2s
Output tokens:    41493
Throughput:       248.21 tok/s
Accept length:    4.271
Spec verify ct:   9778


Backend:          sglang
Dataset:          ifeval
Num prompts:      128
Concurrency:      1
Latency:          264.7s
Output tokens:    46298
Throughput:       174.92 tok/s
Accept length:    3.245
Spec verify ct:   15527


On GSM8K and MATH500, DFlash is clearly faster than EAGLE-3 and has a higher accept length. However, on Alpaca and IFEval, the advantage disappears:

* On Alpaca, EAGLE-3 is slightly faster and has a higher accept length.
* On IFEval, throughput is similar, but DFlash has a lower accept length and a higher speculative verification count.

This makes me wonder whether the current DFlash model is particularly optimized for math/reasoning-style datasets, or whether there may be some dataset-specific effect in the benchmark setup.

Questions

1. Is the released DFlash model expected to generalize to instruction-following datasets such as Alpaca and IFEval?
2. Were GSM8K and MATH500, or similar math/reasoning datasets, used during training or tuning of the draft/speculative model?
3. Do you have recommended settings for non-math instruction-following workloads?
4. Are there any additional benchmarks on more general instruction datasets that you would recommend comparing against?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overfit on benchmarks? #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Overfit on benchmarks? #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions