-
Notifications
You must be signed in to change notification settings - Fork 176
Add DSR1 FP4 B300 vLLM TokenSpeed benchmark #1562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,95 @@ | ||||||||||||||||||||
| #!/usr/bin/env bash | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # DeepSeek-R1 FP4 B300 vLLM run for the TokenSpeed MLA and fastokens paths | ||||||||||||||||||||
| # introduced in vLLM v0.21.0. TokenSpeed MLA requires Blackwell and FP8 KV. | ||||||||||||||||||||
|
|
||||||||||||||||||||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||||||||||||||||||||
|
|
||||||||||||||||||||
| check_env_vars \ | ||||||||||||||||||||
| MODEL \ | ||||||||||||||||||||
| TP \ | ||||||||||||||||||||
| CONC \ | ||||||||||||||||||||
| ISL \ | ||||||||||||||||||||
| OSL \ | ||||||||||||||||||||
| MAX_MODEL_LEN \ | ||||||||||||||||||||
| RANDOM_RANGE_RATIO \ | ||||||||||||||||||||
| RESULT_FILENAME \ | ||||||||||||||||||||
| EP_SIZE | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||||||||||||||||||||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||||||||||||||||||||
| fi | ||||||||||||||||||||
|
|
||||||||||||||||||||
| nvidia-smi | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||||||||||||||||||||
|
|
||||||||||||||||||||
| SERVER_LOG=/workspace/server.log | ||||||||||||||||||||
| PORT=${PORT:-8888} | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Model loading can exceed the default timeout for the 394B FP4 checkpoint. | ||||||||||||||||||||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||||||||||||||||||||
|
|
||||||||||||||||||||
| EP_ARGS=() | ||||||||||||||||||||
| if [ "$EP_SIZE" -gt 1 ]; then | ||||||||||||||||||||
| EP_ARGS=(--enable-expert-parallel) | ||||||||||||||||||||
| fi | ||||||||||||||||||||
|
|
||||||||||||||||||||
| BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN" | ||||||||||||||||||||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||||||||||||||||||||
| EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN") | ||||||||||||||||||||
| export EVAL_MAX_MODEL_LEN | ||||||||||||||||||||
| SERVE_MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||||||||||||||||||||
| else | ||||||||||||||||||||
| SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN" | ||||||||||||||||||||
| fi | ||||||||||||||||||||
|
|
||||||||||||||||||||
| MAX_NUM_BATCHED_TOKENS=$(( ISL * 2 )) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # vLLM v0.21.0 integrates the pre-v0.2 fastokens patching API as an optional package. | ||||||||||||||||||||
| pip install -q 'fastokens>=0.1.1,<0.2.0' datasets pandas | ||||||||||||||||||||
|
Comment on lines
+49
to
+50
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 Line 50 Extended reasoning...Supply-chain risk: unverified third-party package loaded into the serving processWhat the bug is
Why
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| # Start GPU monitoring (power, temperature, clocks every second) | ||||||||||||||||||||
| start_gpu_monitor | ||||||||||||||||||||
|
|
||||||||||||||||||||
| set -x | ||||||||||||||||||||
| vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \ | ||||||||||||||||||||
| --tensor-parallel-size "$TP" \ | ||||||||||||||||||||
| --pipeline-parallel-size 1 \ | ||||||||||||||||||||
| --kv-cache-dtype fp8 \ | ||||||||||||||||||||
| --trust-remote-code \ | ||||||||||||||||||||
| --no-enable-prefix-caching \ | ||||||||||||||||||||
| "${EP_ARGS[@]}" \ | ||||||||||||||||||||
| --attention-backend TOKENSPEED_MLA \ | ||||||||||||||||||||
| --attention-config.mla_prefill_backend TOKENSPEED_MLA \ | ||||||||||||||||||||
|
Comment on lines
+63
to
+64
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The script selects Extended reasoning...What the bug isLines 63-64 of --attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \This combines two distinct problems: a backend name that does not exist in vLLM, and a dotted-namespace prefix that uses the wrong separator. Problem 1 —
|
||||||||||||||||||||
| File | Backend |
|---|---|
benchmarks/single_node/dsr1_fp4_b300.sh:52 |
trtllm_mla |
benchmarks/single_node/dsr1_fp4_b200.sh:48 |
trtllm_mla |
benchmarks/single_node/dsr1_fp4_b200_mtp.sh:63 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b300.sh:84 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b300_mtp.sh:88 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b200.sh:80 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:84 |
trtllm_mla |
benchmarks/single_node/agentic/dsr1_fp4_b200.sh:58 |
trtllm_mla |
vLLM's published MLA backend set is {trtllm_mla, flashinfer_mla, flash_attn_mla, triton_mla, cutlass_mla, flashmla}. The PR description's claim that "vLLM v0.21.0 includes TokenSpeed MLA for DeepSeek-R1-shaped MLA on Blackwell" has no upstream support — the same vllm/vllm-openai:v0.21.0 image is already used by other DSR1 scripts in this repo and they all select conventional backends. The companion --tokenizer-mode fastokens + pip install fastokens claim is similarly absent from vLLM's valid --tokenizer-mode values (auto/slow/mistral/custom), reinforcing that this recipe was not actually run against a vLLM server.
Problem 2 — --attention-config.X should be --attention_config.X
vLLM's nested-config CLI uses the Python attribute name as the prefix. The pydantic field is attention_config (underscore). Every other call site in the repo follows that:
benchmarks/single_node/dsv4_fp4_b300_vllm.sh:80—--attention_config.use_fp4_indexer_cache Truebenchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh:73benchmarks/single_node/dsv4_fp4_b200_vllm.sh:88benchmarks/single_node/dsv4_fp4_b200_vllm_mtp.sh:84benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:132benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh:132perf-changelog.yaml:1750, 1951
Only this new file uses --attention-config.mla_prefill_backend (hyphen between attention and config, but underscore inside the sub-key). Mixing styles inside a single dotted token will not match the registered prefix; argparse will either reject it as unknown or silently drop it, leaving the prefill backend at whatever --attention-backend selected.
Step-by-step proof of failure
- The CI launcher invokes the script with
MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, setsTP,EP_SIZE, etc. from thedsr1-fp4-b300-vllmentry innvidia-master.yaml. - The script runs
pip install -q 'fastokens>=0.1.1,<0.2.0'— this either fails (no such package on the official index for vLLM) or installs a third-party package unrelated to vLLM. - The script execs
vllm serve … --attention-backend TOKENSPEED_MLA …. - vLLM validates
--attention-backendagainst a closed enum during engine init.TOKENSPEED_MLAis not in the enum → engine raises and exits with an unknown-backend error before any model weights are loaded. wait_for_server_readytimes out (or sees the dead PID) → the benchmark never reachesrun_benchmark_serving.- Even in the counterfactual where the backend existed, step 3 would still fail to apply the prefill setting because
--attention-config.mla_prefill_backenddoes not match the registeredattention_configprefix.
Impact
The entire dsr1-fp4-b300-vllm recipe is non-functional as written. It will fail at server-start time in CI and produce no benchmark data. Since the PR's stated purpose is to add this benchmark, the recipe needs working flags before it provides any value.
How to fix
Replace the two flags with a real backend and the correct dotted prefix, mirroring the existing DSR1 FP4 B300 recipe:
--attention-backend trtllm_mla \
--attention_config.mla_prefill_backend trtllm_mla \Also drop --tokenizer-mode fastokens and the pip install fastokens line unless a real upstream feature is identified — the default tokenizer mode is auto and works with this checkpoint. The PR description, nvidia-master.yaml comment, and perf-changelog.yaml entry should be updated to reflect the actual backend used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The comment on line 30 says "the 394B FP4 checkpoint," but the model loaded here (
nvidia/DeepSeek-R1-0528-FP4-V2) is DeepSeek-R1, a 671B-parameter MoE — the repo's own multi-node SGLang recipe atbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55cites "671B FP4 weights" for the same checkpoint family. The 3600sVLLM_ENGINE_READY_TIMEOUT_Sexport still works, so this is purely a comment fix, but worth correcting so the justification for the bumped timeout does not mislead future readers tuning this for related models.Extended reasoning...
What the bug is\n\n
benchmarks/single_node/dsr1_fp4_b300_vllm.sh:30contains the comment:\n\nbash\n# Model loading can exceed the default timeout for the 394B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nThe parameter count is wrong. The script loadsnvidia/DeepSeek-R1-0528-FP4-V2(set via theMODELenv var; thedsr1-fp4-b300-vllmentry in.github/configs/nvidia-master.yamlselects exactly this checkpoint). DeepSeek-R1 / R1-0528 is the publicly documented DeepSeek-V3-architecture MoE with 671B total parameters (37B activated), not 394B.\n\n### How it manifests\n\nNo runtime impact —VLLM_ENGINE_READY_TIMEOUT_S=3600is exported regardless of the comment text. The bug is purely documentation: any future reader trying to understand why the timeout was bumped to 3600s will see a parameter count that does not match the checkpoint, and may either lose confidence in the comment or copy the same incorrect figure into related scripts.\n\n### Why existing code does not prevent it\n\nNothing else in the script references the parameter count; the comment is free-form text. There is also a clear in-repo convention for citing this number correctly:\n\n-benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55references "671B FP4 weights" for the same DSR1 FP4 checkpoint family and uses that to justify a similar loading-timeout note.\n\nSo the 394B figure looks like a copy-paste artifact from a different model variant rather than an intentional choice.\n\n### Step-by-step proof\n\n1. Openbenchmarks/single_node/dsr1_fp4_b300_vllm.sh, line 30: comment says# Model loading can exceed the default timeout for the 394B FP4 checkpoint.\n2. The script readsMODELfrom the environment. In.github/configs/nvidia-master.yaml(the diff in this PR, lines added at 1894–1916), the newdsr1-fp4-b300-vllmentry setsmodel: nvidia/DeepSeek-R1-0528-FP4-V2.\n3.nvidia/DeepSeek-R1-0528-FP4-V2is an FP4 quantization of DeepSeek-R1-0528, which is the DeepSeek-V3-architecture MoE with 671B total parameters.\n4. Cross-check the convention in this same repo:benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55says "671B FP4 weights" for the same family.\n5. Therefore the 394B figure does not match the loaded checkpoint.\n\n### Impact\n\nNone at runtime. Misleading documentation only. Severity isnit— but worth fixing because the comment is the only justification for the 3600s timeout and a wrong parameter count weakens that justification for anyone porting this recipe to other DeepSeek variants.\n\n### How to fix\n\nReplace394Bwith671B(matching the repo's own convention) on line 30, e.g.:\n\nbash\n# Model loading can exceed the default timeout for the 671B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nAlternatively, generalize to "large FP4 checkpoint" to avoid the same drift when the recipe is adapted to other variants.