Skip to content

Conversation

luccafong
Copy link
Collaborator

@luccafong luccafong commented Sep 30, 2025

Purpose

  1. Cudagraph + MTP V32 is still under progress, enable eager for now on MTP part by default which verified acceptance and eval and piecewise used with MTP enabled
  2. Fix Eagle tests as well MTP tests (@njhill )

Test Plan

VLLM_SKIP_DEEP_GEMM_WARMUP=1 vllm serve "deepseek-ai/DeepSeek-V3.2-Exp" --max_model_len=20000 --gpu_memory_utilization=0.9 --tensor_parallel_size 8 --max_num_seqs=256 --speculative_config '{"num_speculative_tokens":1, "method":"mtp"}'  --no-enable-prefix-caching --compilation_config
lm_eval --model local-completions --tasks gsm8k     --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=200,max_retries=3,tokenized_requests=False --batch_size 32 --num_fewshot 20 

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 20 exact_match 0.9507 ± 0.006
strict-match 20 exact_match 0.9500 ± 0.006

Note: it takes 9:58 with MTP, w/o MTP it is 12:08 (25% speedup)

^[[1;36m(APIServer pid=1007953)^[[0;0m INFO 09-30 12:31:58 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.92, Accepted throughput: 80.90 tokens/s, Drafted throughput: 88.00 tokens/s, Accepted: 809 tokens, Drafted: 880 tokens, Per-position acceptance rate: 0.919, Avg Draft acceptance rate: 91.9%

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Lu Fang <[email protected]>
@luccafong luccafong force-pushed the v3.2_mtp_enforce_eager branch from 69b3255 to f3536c6 Compare September 30, 2025 19:52
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fallback to eager execution mode for DeepSeek v32 models when using Medusa-style Token Prediction (MTP) for speculative decoding. This is a temporary measure to address an issue where CUDA graph is not yet supported for this specific combination.

The changes are well-contained and logical:

  1. A new enforce_eager flag is added to SpeculativeConfig to allow overriding the default execution mode.
  2. This flag is automatically enabled for DeepSeek v32 models when MTP is used.
  3. The EagleProposer, which handles MTP, now checks this flag to disable CUDA graph when necessary.

The implementation is correct and effectively resolves the issue described. The changes are specific to the problematic configuration and should not affect other models or execution paths. Overall, this is a good, targeted fix.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks!

@luccafong luccafong added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
Signed-off-by: Lu Fang <[email protected]>
@heheda12345 heheda12345 added this to the v0.11.0 Cherry Picks milestone Sep 30, 2025
Signed-off-by: Lu Fang <[email protected]>
@luccafong luccafong force-pushed the v3.2_mtp_enforce_eager branch from 276e573 to ef7e8d4 Compare September 30, 2025 22:29
@luccafong luccafong enabled auto-merge (squash) September 30, 2025 22:30
Signed-off-by: Nick Hill <[email protected]>
@njhill
Copy link
Member

njhill commented Oct 1, 2025

@luccafong fyi I pushed a similar fix that was needed to test_mtp.py.

@luccafong
Copy link
Collaborator Author

@luccafong fyi I pushed a similar fix that was needed to test_mtp.py.

thx!

@luccafong luccafong merged commit 001e50c into vllm-project:main Oct 1, 2025
48 checks passed
simon-mo pushed a commit that referenced this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants