Add DSR1 FP4 B300 vLLM TokenSpeed benchmark#1562
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2da063a to
895c106
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26383934297 |
| # vLLM v0.21.0 integrates the pre-v0.2 fastokens patching API as an optional package. | ||
| pip install -q 'fastokens>=0.1.1,<0.2.0' datasets pandas |
There was a problem hiding this comment.
🔴 Line 50 pip installs the unattributed fastokens>=0.1.1,<0.2.0 package from PyPI on every CI run and then loads it into the running vLLM server process via --tokenizer-mode fastokens (line 65, alongside --trust-remote-code). The package has empty PyPI metadata (no author, email, maintainer, license, or description) and a homepage at github.com/Atero-ai/fastokens — a namespace with no apparent affiliation with vllm-project — making this the only package in the entire benchmarks/** directory installed from an unfamiliar/unattributed source. Please pin by hash and document provenance (or vendor under vllm-project) before landing.
Extended reasoning...
Supply-chain risk: unverified third-party package loaded into the serving process
What the bug is
benchmarks/single_node/dsr1_fp4_b300_vllm.sh:49-50 performs an unhashed PyPI install of fastokens>=0.1.1,<0.2.0, and line 65 activates the package's code path inside the running vllm serve process via --tokenizer-mode fastokens. Because the same vllm serve invocation also passes --trust-remote-code (line 60), the unverified tokenizer code is loaded into a context that already trusts arbitrary remote Python — there is no sandbox or process boundary between the third-party tokenizer and the rest of the model server.
Why fastokens is not a normal dependency
Direct PyPI metadata inspection of fastokens shows:
author= null,author_email= null,maintainer= nulllicense= null,description= empty string,summary= nullhome_page=https://github.com/Atero-ai/fastokens— an org with no apparent affiliation withvllm-project- Version history:
0.1.1(2026-04-18),0.1.2(2026-05-07),0.2.0(2026-05-17) — a brand-new package with no track record, shipping cp313 + cp39-abi3 native-extension wheels
By contrast, every other pip install across benchmarks/**/*.sh (transformers, huggingface-hub, datasets, pandas, flashinfer_python, amd-quark) is a well-known upstream from a known maintainer (HF, NVIDIA, AMD). fastokens is qualitatively the only totally-anonymous package in that pattern.
Why the PR's framing does not justify it
The YAML comment at .github/configs/nvidia-master.yaml calls fastokens vLLM's optional dependency, and the shell-script comment calls it the pre-v0.2 fastokens patching API integrated by vLLM v0.21.0. But vLLM v0.21.0's published requires_dist does not list fastokens at all, and no upstream source (release notes, RFC, GitHub PR) is cited. So at best this is a third-party plug-in, not a vLLM-maintained component — the YAML's framing overstates the package's provenance.
Concrete attack path
- An attacker compromises the
Atero-aiGitHub/PyPI account, or registers a malicious successor version inside the unpinned range>=0.1.1,<0.2.0. - CI runs
pip install -q 'fastokens>=0.1.1,<0.2.0' ...— the install step alone can execute arbitrary Python via setup hooks or post-install code in the wheel's__init__. vllm serve ... --tokenizer-mode fastokensthen imports the package into the long-running server process holding GPU and any cloud/registry credentials present on the benchmark runner.- Arbitrary code now runs on every B300 benchmark execution.
How to fix
At minimum:
- Pin by hash using
pip install --require-hasheswith a curatedrequirements.txt, so a malicious 0.1.3 upload cannot silently roll in via the<0.2.0range, or - Vendor the tokenizer code under
vllm-project(or this repo) so its provenance is auditable, or - Document the package's origin and the security review that approved it in the YAML comment.
If fastokens is not actually necessary for the benchmark (the sibling dsv4_fp4_b300_vllm.sh recipe runs without it), the cleanest fix is to drop the pip install and --tokenizer-mode fastokens flag entirely and use the default tokenizer mode.
On the duplicate-of-bug_002 refutation
A separate verifier flagged this as a duplicate of an earlier functional-failure report. The two findings overlap on the same lines but describe distinct concerns: the functional report covers the recipe will not run as written (CLI rejection / runtime failure), while this one covers if the recipe does run, it will execute unattributed third-party code inside the model server alongside --trust-remote-code. The remediation set is different too — removing the flag fixes the functional issue, but if maintainers decide to keep fastokens (e.g. once upstream support is confirmed), the supply-chain remediation (hash-pinning or vendoring) still applies. Filing this separately so the security framing is not lost if the functional concern is resolved by adding the flag rather than removing it.
| --attention-backend TOKENSPEED_MLA \ | ||
| --attention-config.mla_prefill_backend TOKENSPEED_MLA \ |
There was a problem hiding this comment.
🔴 The script selects --attention-backend TOKENSPEED_MLA and --attention-config.mla_prefill_backend TOKENSPEED_MLA on lines 63-64, but neither flag is valid as written: TOKENSPEED_MLA is not a registered vLLM MLA backend (every other DSR1 MLA recipe uses trtllm_mla or flashinfer_mla), and the --attention-config.X form uses a hyphen where every other call site in the repo uses --attention_config.X (underscore, matching the Python attribute name). vllm serve will exit at startup with an unknown-backend error, so the benchmark never runs; even if it did, the hyphenated nested-config flag would silently fall back to the default prefill backend. Replace with a real backend (e.g. trtllm_mla) and use the underscore prefix --attention_config.mla_prefill_backend.
Extended reasoning...
What the bug is
Lines 63-64 of benchmarks/single_node/dsr1_fp4_b300_vllm.sh configure the attention backend with two flags that do not work as written:
--attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \This combines two distinct problems: a backend name that does not exist in vLLM, and a dotted-namespace prefix that uses the wrong separator.
Problem 1 — TOKENSPEED_MLA is not a real vLLM backend
TOKENSPEED_MLA appears only in this PR (the script, the nvidia-master.yaml entry, and the perf-changelog.yaml entry). Every other DeepSeek MLA recipe in this repo uses a known upstream backend:
| File | Backend |
|---|---|
benchmarks/single_node/dsr1_fp4_b300.sh:52 |
trtllm_mla |
benchmarks/single_node/dsr1_fp4_b200.sh:48 |
trtllm_mla |
benchmarks/single_node/dsr1_fp4_b200_mtp.sh:63 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b300.sh:84 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b300_mtp.sh:88 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b200.sh:80 |
trtllm_mla |
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:84 |
trtllm_mla |
benchmarks/single_node/agentic/dsr1_fp4_b200.sh:58 |
trtllm_mla |
vLLM's published MLA backend set is {trtllm_mla, flashinfer_mla, flash_attn_mla, triton_mla, cutlass_mla, flashmla}. The PR description's claim that "vLLM v0.21.0 includes TokenSpeed MLA for DeepSeek-R1-shaped MLA on Blackwell" has no upstream support — the same vllm/vllm-openai:v0.21.0 image is already used by other DSR1 scripts in this repo and they all select conventional backends. The companion --tokenizer-mode fastokens + pip install fastokens claim is similarly absent from vLLM's valid --tokenizer-mode values (auto/slow/mistral/custom), reinforcing that this recipe was not actually run against a vLLM server.
Problem 2 — --attention-config.X should be --attention_config.X
vLLM's nested-config CLI uses the Python attribute name as the prefix. The pydantic field is attention_config (underscore). Every other call site in the repo follows that:
benchmarks/single_node/dsv4_fp4_b300_vllm.sh:80—--attention_config.use_fp4_indexer_cache Truebenchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh:73benchmarks/single_node/dsv4_fp4_b200_vllm.sh:88benchmarks/single_node/dsv4_fp4_b200_vllm_mtp.sh:84benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:132benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh:132perf-changelog.yaml:1750, 1951
Only this new file uses --attention-config.mla_prefill_backend (hyphen between attention and config, but underscore inside the sub-key). Mixing styles inside a single dotted token will not match the registered prefix; argparse will either reject it as unknown or silently drop it, leaving the prefill backend at whatever --attention-backend selected.
Step-by-step proof of failure
- The CI launcher invokes the script with
MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, setsTP,EP_SIZE, etc. from thedsr1-fp4-b300-vllmentry innvidia-master.yaml. - The script runs
pip install -q 'fastokens>=0.1.1,<0.2.0'— this either fails (no such package on the official index for vLLM) or installs a third-party package unrelated to vLLM. - The script execs
vllm serve … --attention-backend TOKENSPEED_MLA …. - vLLM validates
--attention-backendagainst a closed enum during engine init.TOKENSPEED_MLAis not in the enum → engine raises and exits with an unknown-backend error before any model weights are loaded. wait_for_server_readytimes out (or sees the dead PID) → the benchmark never reachesrun_benchmark_serving.- Even in the counterfactual where the backend existed, step 3 would still fail to apply the prefill setting because
--attention-config.mla_prefill_backenddoes not match the registeredattention_configprefix.
Impact
The entire dsr1-fp4-b300-vllm recipe is non-functional as written. It will fail at server-start time in CI and produce no benchmark data. Since the PR's stated purpose is to add this benchmark, the recipe needs working flags before it provides any value.
How to fix
Replace the two flags with a real backend and the correct dotted prefix, mirroring the existing DSR1 FP4 B300 recipe:
--attention-backend trtllm_mla \
--attention_config.mla_prefill_backend trtllm_mla \Also drop --tokenizer-mode fastokens and the pip install fastokens line unless a real upstream feature is identified — the default tokenizer mode is auto and works with this checkpoint. The PR description, nvidia-master.yaml comment, and perf-changelog.yaml entry should be updated to reflect the actual backend used.
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # Model loading can exceed the default timeout for the 394B FP4 checkpoint. |
There was a problem hiding this comment.
🟡 The comment on line 30 says "the 394B FP4 checkpoint," but the model loaded here (nvidia/DeepSeek-R1-0528-FP4-V2) is DeepSeek-R1, a 671B-parameter MoE — the repo's own multi-node SGLang recipe at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 cites "671B FP4 weights" for the same checkpoint family. The 3600s VLLM_ENGINE_READY_TIMEOUT_S export still works, so this is purely a comment fix, but worth correcting so the justification for the bumped timeout does not mislead future readers tuning this for related models.
Extended reasoning...
What the bug is\n\nbenchmarks/single_node/dsr1_fp4_b300_vllm.sh:30 contains the comment:\n\nbash\n# Model loading can exceed the default timeout for the 394B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nThe parameter count is wrong. The script loads nvidia/DeepSeek-R1-0528-FP4-V2 (set via the MODEL env var; the dsr1-fp4-b300-vllm entry in .github/configs/nvidia-master.yaml selects exactly this checkpoint). DeepSeek-R1 / R1-0528 is the publicly documented DeepSeek-V3-architecture MoE with 671B total parameters (37B activated), not 394B.\n\n### How it manifests\n\nNo runtime impact — VLLM_ENGINE_READY_TIMEOUT_S=3600 is exported regardless of the comment text. The bug is purely documentation: any future reader trying to understand why the timeout was bumped to 3600s will see a parameter count that does not match the checkpoint, and may either lose confidence in the comment or copy the same incorrect figure into related scripts.\n\n### Why existing code does not prevent it\n\nNothing else in the script references the parameter count; the comment is free-form text. There is also a clear in-repo convention for citing this number correctly:\n\n- benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 references "671B FP4 weights" for the same DSR1 FP4 checkpoint family and uses that to justify a similar loading-timeout note.\n\nSo the 394B figure looks like a copy-paste artifact from a different model variant rather than an intentional choice.\n\n### Step-by-step proof\n\n1. Open benchmarks/single_node/dsr1_fp4_b300_vllm.sh, line 30: comment says # Model loading can exceed the default timeout for the 394B FP4 checkpoint.\n2. The script reads MODEL from the environment. In .github/configs/nvidia-master.yaml (the diff in this PR, lines added at 1894–1916), the new dsr1-fp4-b300-vllm entry sets model: nvidia/DeepSeek-R1-0528-FP4-V2.\n3. nvidia/DeepSeek-R1-0528-FP4-V2 is an FP4 quantization of DeepSeek-R1-0528, which is the DeepSeek-V3-architecture MoE with 671B total parameters.\n4. Cross-check the convention in this same repo: benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 says "671B FP4 weights" for the same family.\n5. Therefore the 394B figure does not match the loaded checkpoint.\n\n### Impact\n\nNone at runtime. Misleading documentation only. Severity is nit — but worth fixing because the comment is the only justification for the 3600s timeout and a wrong parameter count weakens that justification for anyone porting this recipe to other DeepSeek variants.\n\n### How to fix\n\nReplace 394B with 671B (matching the repo's own convention) on line 30, e.g.:\n\nbash\n# Model loading can exceed the default timeout for the 671B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nAlternatively, generalize to "large FP4 checkpoint" to avoid the same drift when the recipe is adapted to other variants.
functionstackx
left a comment
There was a problem hiding this comment.
Vllm for Dsr1? Is ur /loop wack?
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26383949033 |
Summary
dsr1-fp4-b300-vllmonvllm/vllm-openai:v0.21.0, mirroring the existing B300 DSR1 FP4 sweep coverageTOKENSPEED_MLAfor both MLA decode and prefill with the required FP8 KV cachefastokenspackage and enable--tokenizer-mode fastokensValidation
bash -n benchmarks/single_node/dsr1_fp4_b300_vllm.shpython utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --model-prefix dsr1 --framework vllm --precision fp4 --runner-type b300 --no-evalspython -m pytest utils/matrix_logic/ -v(151 passed)Note
Low Risk
Benchmark and CI matrix config only; no production serving or auth paths changed.
Overview
Adds a DeepSeek-R1 FP4 on B300 vLLM benchmark path so the matrix can sweep the same fixed-seq-len scenarios as the existing B300 SGLang recipe.
New config key
dsr1-fp4-b300-vllmpinsvllm/vllm-openai:v0.21.0and thenvidia/DeepSeek-R1-0528-FP4-V2model with TP/EP search spaces at 1k/1k and 8k/1k. The launcher script installsfastokens, serves with FP8 KV, TOKENSPEED_MLA for MLA prefill/decode, and fastokens tokenization, then runs the standard serving benchmark (and optional eval).perf-changelog.yamldocuments the addition.Reviewed by Cursor Bugbot for commit 895c106. Bugbot is set up for automated code reviews on this repo. Configure here.