Skip to content

[vLLM-ATOM] Support Eagle 3.1 spec decoding#1201

Open
kliuae wants to merge 11 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_enable_eagle31_merge
Open

[vLLM-ATOM] Support Eagle 3.1 spec decoding#1201
kliuae wants to merge 11 commits into
ROCm:mainfrom
kliuae:kliuae/plugin_enable_eagle31_merge

Conversation

@kliuae

@kliuae kliuae commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Motivation

This PR adds EAGLE 3/3.1 speculative decoding support to vLLM-ATOM and its support for Kimi-K2.6.
Supported combinations are MLA target with MLA/MHA draft models. In the case of MLA target with MHA draft, separate KV cache pools are built so that vLLM doesn't attempt to unify the page sizes.

Technical Details

Test Plan

Target model: amd/Kimi-K2.6-MXFP4
Draft model: lightseekorg/kimi-k2.6-eagle3.1-mla (MLA), lightseekorg/kimi-k2.6-eagle3 (MHA)

Server command

vllm serve amd/Kimi-K2.6-MXFP4 \
    --host localhost \
    --port 8000 \
    --async-scheduling \
    --trust-remote-code \
    --gpu_memory_utilization 0.8 \
    --kv-cache-dtype fp8 \
    --no-enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2  \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --tensor-parallel-size 8 \
    --speculative-config.method eagle3 \
    --speculative-config.model "$DRAFT_MODEL" \
    --speculative-config.num_speculative_tokens 3

lm_eval

lm_eval --model local-completions --model_args model=amd/Kimi-K2.6-MXFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False,trust_remote_code=True --tasks gsm8k --num_fewshot 5

Test Result

lm_eval with gsm8k

lightseekorg/kimi-k2.6-eagle3.1-mla

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.9416 _ 0.0065
strict-match 5 exact_match _ 0.9409 _ 0.0065

Per-position acceptance rate: 0.800, 0.602, 0.429, Avg Draft acceptance rate: 61.0%

lightseekorg/kimi-k2.6-eagle3

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match _ 0.9348 _ 0.0068
strict-match 5 exact_match _ 0.9356 _ 0.0068

Per-position acceptance rate: 0.819, 0.605, 0.388, Avg Draft acceptance rate: 60.4%

Submission Checklist

kliuae added 5 commits June 13, 2026 00:18
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@zejunchen-zejun zejunchen-zejun requested review from PerryZhang01, whx-sjtu and zejunchen-zejun and removed request for whx-sjtu June 14, 2026 13:49
from atom.plugin.vllm.model_wrapper import ATOMModelBase
import vllm.v1.spec_decode.llm_base_proposer as llm_base_proposer
except Exception:
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to raise some warning here if import path doesn't exist when upgrading vllm to avoid the silent return

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a warning message.

kliuae added 6 commits June 18, 2026 02:14
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: miloice <kuanfu.liu@embeddedllm.com>
Signed-off-by: miloice <kuanfu.liu@embeddedllm.com>
Signed-off-by: miloice <kuanfu.liu@embeddedllm.com>
@kliuae kliuae marked this pull request as ready for review June 25, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants