Skip to content

feat: support post-norm EAGLE + add speculative decoding docs#174

Open
Dogacel wants to merge 1 commit into
lightseekorg:mainfrom
Dogacel:post-norm-eagle
Open

feat: support post-norm EAGLE + add speculative decoding docs#174
Dogacel wants to merge 1 commit into
lightseekorg:mainfrom
Dogacel:post-norm-eagle

Conversation

@Dogacel
Copy link
Copy Markdown

@Dogacel Dogacel commented May 17, 2026

Summary

  1. Add documentation for speculative decoding recipes for running and benchmarking (currently tested Llama 3.1 8B and GPT-oss 20B).
  2. Add mt-bench to the bench script and report mean acceptance length.
  3. Support post-norm architecture speculative decoding models as published with Attention Drift paper.
  4. Warn user if any model weight is not loaded properly for EAGLE-3 speculative decoding models.

Test Plan

  1. Test standard speculative decoding with one of the existing checkpoints (lmsys's llama 3.1 8b is used).
  2. Test speculative decoding with one of the post-norm checkpoints (SpecDrift collection used).

Commands (2 terminals):

Running the model (Replace draft model with lmsys/sglang-EAGLE3-LLaMA3.1-Instruct-8B to test standard architecture models)

tokenspeed serve nreHieW/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path Dogacel/specdrift-llama3-8b-eagle3-post-norm --speculative-num-steps 7 --host 0.0.0.0 --dtype bfloat16 --kvstore-size 16 --port 8999

tokenspeed bench serve   --backend openai-chat   --endpoint /v1/chat/completions   --host 127.0.0.1 --port 8999   --dataset-name mtbench   --input-len 2048   --output-len 2048   --num-prompts 80   --max-concurrency 32   --save-result --save-detailed --result-dir results/   --extra-body '{"temperature": 0}'

Results:

============ Serving Benchmark Result ============
Successful requests:                     80
Failed requests:                         0
Maximum request concurrency:             32
Benchmark duration (s):                  17.74
Total input tokens:                      6053
Total generated tokens:                  34407
Request throughput (req/s):              4.51
Output token throughput (tok/s):         1939.40
Peak output token throughput (tok/s):    962.00
Peak concurrent requests:                38.00
Total token throughput (tok/s):          2280.58
Mean accept length (tok/step):           3.61
---------------Time to First Token----------------
Mean TTFT (ms):                          399.01
Median TTFT (ms):                        258.22
P99 TTFT (ms):                           788.44
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.76
Median TPOT (ms):                        9.70
P99 TPOT (ms):                           18.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.50
Median ITL (ms):                         25.36
P99 ITL (ms):                            153.76
==================================================

@Dogacel Dogacel requested a review from a team as a code owner May 17, 2026 19:39
@Dogacel Dogacel force-pushed the post-norm-eagle branch from 2177e75 to 0327176 Compare May 17, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant