Hi!
First, thank you for releasing the model and benchmark scripts.
I have been testing DFlash for Qwen3-8B with the SGLang backend. On the benchmark datasets provided in the repo, namely GSM8K and MATH500, I observe a significant speedup compared to EAGLE-3, roughly more than 2x in throughput.
Dflash:
Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 84.8s
Output tokens: 39253
Throughput: 462.95 tok/s
Accept length: 6.360
Spec verify ct: 6428
Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 156.8s
Output tokens: 94809
Throughput: 604.57 tok/s
Accept length: 8.049
Spec verify ct: 12077
EAGLE-3:
Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 146.2s
Output tokens: 40818
Throughput: 279.11 tok/s
Accept length: 4.817
Spec verify ct: 8483
Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 368.7s
Output tokens: 92960
Throughput: 252.10 tok/s
Accept length: 4.392
Spec verify ct: 21743
But I then added two additional datasets from Hugging Face, Alpaca and IFEval, using the same setup. On these datasets, I no longer observe the same acceleration.
Dflash:
Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 183.5s
Output tokens: 41673
Throughput: 227.09 tok/s
Accept length: 3.621
Spec verify ct: 14249
Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 260.5s
Output tokens: 46532
Throughput: 178.60 tok/s
Accept length: 2.288
Spec verify ct: 20334
EAGLE-3:
Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 167.2s
Output tokens: 41493
Throughput: 248.21 tok/s
Accept length: 4.271
Spec verify ct: 9778
Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 264.7s
Output tokens: 46298
Throughput: 174.92 tok/s
Accept length: 3.245
Spec verify ct: 15527
On GSM8K and MATH500, DFlash is clearly faster than EAGLE-3 and has a higher accept length. However, on Alpaca and IFEval, the advantage disappears:
- On Alpaca, EAGLE-3 is slightly faster and has a higher accept length.
- On IFEval, throughput is similar, but DFlash has a lower accept length and a higher speculative verification count.
This makes me wonder whether the current DFlash model is particularly optimized for math/reasoning-style datasets, or whether there may be some dataset-specific effect in the benchmark setup.
Questions
- Is the released DFlash model expected to generalize to instruction-following datasets such as Alpaca and IFEval?
- Were GSM8K and MATH500, or similar math/reasoning datasets, used during training or tuning of the draft/speculative model?
- Do you have recommended settings for non-math instruction-following workloads?
- Are there any additional benchmarks on more general instruction datasets that you would recommend comparing against?
Hi!
First, thank you for releasing the model and benchmark scripts.
I have been testing DFlash for Qwen3-8B with the SGLang backend. On the benchmark datasets provided in the repo, namely GSM8K and MATH500, I observe a significant speedup compared to EAGLE-3, roughly more than 2x in throughput.
Dflash:
Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 84.8s
Output tokens: 39253
Throughput: 462.95 tok/s
Accept length: 6.360
Spec verify ct: 6428
Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 156.8s
Output tokens: 94809
Throughput: 604.57 tok/s
Accept length: 8.049
Spec verify ct: 12077
EAGLE-3:
Backend: sglang
Dataset: gsm8k
Num prompts: 128
Concurrency: 1
Latency: 146.2s
Output tokens: 40818
Throughput: 279.11 tok/s
Accept length: 4.817
Spec verify ct: 8483
Backend: sglang
Dataset: math500
Num prompts: 128
Concurrency: 1
Latency: 368.7s
Output tokens: 92960
Throughput: 252.10 tok/s
Accept length: 4.392
Spec verify ct: 21743
But I then added two additional datasets from Hugging Face, Alpaca and IFEval, using the same setup. On these datasets, I no longer observe the same acceleration.
Dflash:
Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 183.5s
Output tokens: 41673
Throughput: 227.09 tok/s
Accept length: 3.621
Spec verify ct: 14249
Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 260.5s
Output tokens: 46532
Throughput: 178.60 tok/s
Accept length: 2.288
Spec verify ct: 20334
EAGLE-3:
Backend: sglang
Dataset: alpaca
Num prompts: 128
Concurrency: 1
Latency: 167.2s
Output tokens: 41493
Throughput: 248.21 tok/s
Accept length: 4.271
Spec verify ct: 9778
Backend: sglang
Dataset: ifeval
Num prompts: 128
Concurrency: 1
Latency: 264.7s
Output tokens: 46298
Throughput: 174.92 tok/s
Accept length: 3.245
Spec verify ct: 15527
On GSM8K and MATH500, DFlash is clearly faster than EAGLE-3 and has a higher accept length. However, on Alpaca and IFEval, the advantage disappears:
This makes me wonder whether the current DFlash model is particularly optimized for math/reasoning-style datasets, or whether there may be some dataset-specific effect in the benchmark setup.
Questions