Skip to content

Overfit on benchmarks? #119

@Kamizm

Description

@Kamizm

Hi!

First, thank you for releasing the model and benchmark scripts.

I have been testing DFlash for Qwen3-8B with the SGLang backend. On the benchmark datasets provided in the repo, namely GSM8K and MATH500, I observe a significant speedup compared to EAGLE-3, roughly more than 2x in throughput.

Dflash:

Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 84.8s
Output tokens: 39253
Throughput: 462.95 tok/s
Accept length: 6.360
Spec verify ct: 6428

Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 156.8s
Output tokens: 94809
Throughput: 604.57 tok/s
Accept length: 8.049
Spec verify ct: 12077

EAGLE-3:

Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 146.2s
Output tokens: 40818
Throughput: 279.11 tok/s
Accept length: 4.817
Spec verify ct: 8483

Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 368.7s
Output tokens: 92960
Throughput: 252.10 tok/s
Accept length: 4.392
Spec verify ct: 21743

But I then added two additional datasets from Hugging Face, Alpaca and IFEval, using the same setup. On these datasets, I no longer observe the same acceleration.

Dflash:

Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 183.5s
Output tokens: 41673
Throughput: 227.09 tok/s
Accept length: 3.621
Spec verify ct: 14249

Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 260.5s
Output tokens: 46532
Throughput: 178.60 tok/s
Accept length: 2.288
Spec verify ct: 20334

EAGLE-3:

Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 167.2s
Output tokens: 41493
Throughput: 248.21 tok/s
Accept length: 4.271
Spec verify ct: 9778

Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 264.7s
Output tokens: 46298
Throughput: 174.92 tok/s
Accept length: 3.245
Spec verify ct: 15527

On GSM8K and MATH500, DFlash is clearly faster than EAGLE-3 and has a higher accept length. However, on Alpaca and IFEval, the advantage disappears:

  • On Alpaca, EAGLE-3 is slightly faster and has a higher accept length.
  • On IFEval, throughput is similar, but DFlash has a lower accept length and a higher speculative verification count.

This makes me wonder whether the current DFlash model is particularly optimized for math/reasoning-style datasets, or whether there may be some dataset-specific effect in the benchmark setup.

Questions

  1. Is the released DFlash model expected to generalize to instruction-following datasets such as Alpaca and IFEval?
  2. Were GSM8K and MATH500, or similar math/reasoning datasets, used during training or tuning of the draft/speculative model?
  3. Do you have recommended settings for non-math instruction-following workloads?
  4. Are there any additional benchmarks on more general instruction datasets that you would recommend comparing against?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions