Skip to content

Add DSR1 FP4 B300 vLLM TokenSpeed benchmark#1562

Closed
Oseltamivir wants to merge 1 commit into
mainfrom
add-dsr1-fp4-b300-vllm-tokenspeed
Closed

Add DSR1 FP4 B300 vLLM TokenSpeed benchmark#1562
Oseltamivir wants to merge 1 commit into
mainfrom
add-dsr1-fp4-b300-vllm-tokenspeed

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented May 25, 2026

Summary

  • add dsr1-fp4-b300-vllm on vllm/vllm-openai:v0.21.0, mirroring the existing B300 DSR1 FP4 sweep coverage
  • select TOKENSPEED_MLA for both MLA decode and prefill with the required FP8 KV cache
  • install the vLLM v0.21-compatible fastokens package and enable --tokenizer-mode fastokens

Validation

  • bash -n benchmarks/single_node/dsr1_fp4_b300_vllm.sh
  • python utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --model-prefix dsr1 --framework vllm --precision fp4 --runner-type b300 --no-evals
  • python -m pytest utils/matrix_logic/ -v (151 passed)

Note

Low Risk
Benchmark and CI matrix config only; no production serving or auth paths changed.

Overview
Adds a DeepSeek-R1 FP4 on B300 vLLM benchmark path so the matrix can sweep the same fixed-seq-len scenarios as the existing B300 SGLang recipe.

New config key dsr1-fp4-b300-vllm pins vllm/vllm-openai:v0.21.0 and the nvidia/DeepSeek-R1-0528-FP4-V2 model with TP/EP search spaces at 1k/1k and 8k/1k. The launcher script installs fastokens, serves with FP8 KV, TOKENSPEED_MLA for MLA prefill/decode, and fastokens tokenization, then runs the standard serving benchmark (and optional eval). perf-changelog.yaml documents the addition.

Reviewed by Cursor Bugbot for commit 895c106. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +49 to +50
# vLLM v0.21.0 integrates the pre-v0.2 fastokens patching API as an optional package.
pip install -q 'fastokens>=0.1.1,<0.2.0' datasets pandas
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Line 50 pip installs the unattributed fastokens>=0.1.1,<0.2.0 package from PyPI on every CI run and then loads it into the running vLLM server process via --tokenizer-mode fastokens (line 65, alongside --trust-remote-code). The package has empty PyPI metadata (no author, email, maintainer, license, or description) and a homepage at github.com/Atero-ai/fastokens — a namespace with no apparent affiliation with vllm-project — making this the only package in the entire benchmarks/** directory installed from an unfamiliar/unattributed source. Please pin by hash and document provenance (or vendor under vllm-project) before landing.

Extended reasoning...

Supply-chain risk: unverified third-party package loaded into the serving process

What the bug is

benchmarks/single_node/dsr1_fp4_b300_vllm.sh:49-50 performs an unhashed PyPI install of fastokens>=0.1.1,<0.2.0, and line 65 activates the package's code path inside the running vllm serve process via --tokenizer-mode fastokens. Because the same vllm serve invocation also passes --trust-remote-code (line 60), the unverified tokenizer code is loaded into a context that already trusts arbitrary remote Python — there is no sandbox or process boundary between the third-party tokenizer and the rest of the model server.

Why fastokens is not a normal dependency

Direct PyPI metadata inspection of fastokens shows:

  • author = null, author_email = null, maintainer = null
  • license = null, description = empty string, summary = null
  • home_page = https://github.com/Atero-ai/fastokens — an org with no apparent affiliation with vllm-project
  • Version history: 0.1.1 (2026-04-18), 0.1.2 (2026-05-07), 0.2.0 (2026-05-17) — a brand-new package with no track record, shipping cp313 + cp39-abi3 native-extension wheels

By contrast, every other pip install across benchmarks/**/*.sh (transformers, huggingface-hub, datasets, pandas, flashinfer_python, amd-quark) is a well-known upstream from a known maintainer (HF, NVIDIA, AMD). fastokens is qualitatively the only totally-anonymous package in that pattern.

Why the PR's framing does not justify it

The YAML comment at .github/configs/nvidia-master.yaml calls fastokens vLLM's optional dependency, and the shell-script comment calls it the pre-v0.2 fastokens patching API integrated by vLLM v0.21.0. But vLLM v0.21.0's published requires_dist does not list fastokens at all, and no upstream source (release notes, RFC, GitHub PR) is cited. So at best this is a third-party plug-in, not a vLLM-maintained component — the YAML's framing overstates the package's provenance.

Concrete attack path

  1. An attacker compromises the Atero-ai GitHub/PyPI account, or registers a malicious successor version inside the unpinned range >=0.1.1,<0.2.0.
  2. CI runs pip install -q 'fastokens>=0.1.1,<0.2.0' ... — the install step alone can execute arbitrary Python via setup hooks or post-install code in the wheel's __init__.
  3. vllm serve ... --tokenizer-mode fastokens then imports the package into the long-running server process holding GPU and any cloud/registry credentials present on the benchmark runner.
  4. Arbitrary code now runs on every B300 benchmark execution.

How to fix

At minimum:

  • Pin by hash using pip install --require-hashes with a curated requirements.txt, so a malicious 0.1.3 upload cannot silently roll in via the <0.2.0 range, or
  • Vendor the tokenizer code under vllm-project (or this repo) so its provenance is auditable, or
  • Document the package's origin and the security review that approved it in the YAML comment.

If fastokens is not actually necessary for the benchmark (the sibling dsv4_fp4_b300_vllm.sh recipe runs without it), the cleanest fix is to drop the pip install and --tokenizer-mode fastokens flag entirely and use the default tokenizer mode.

On the duplicate-of-bug_002 refutation

A separate verifier flagged this as a duplicate of an earlier functional-failure report. The two findings overlap on the same lines but describe distinct concerns: the functional report covers the recipe will not run as written (CLI rejection / runtime failure), while this one covers if the recipe does run, it will execute unattributed third-party code inside the model server alongside --trust-remote-code. The remediation set is different too — removing the flag fixes the functional issue, but if maintainers decide to keep fastokens (e.g. once upstream support is confirmed), the supply-chain remediation (hash-pinning or vendoring) still applies. Filing this separately so the security framing is not lost if the functional concern is resolved by adding the flag rather than removing it.

Comment on lines +63 to +64
--attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The script selects --attention-backend TOKENSPEED_MLA and --attention-config.mla_prefill_backend TOKENSPEED_MLA on lines 63-64, but neither flag is valid as written: TOKENSPEED_MLA is not a registered vLLM MLA backend (every other DSR1 MLA recipe uses trtllm_mla or flashinfer_mla), and the --attention-config.X form uses a hyphen where every other call site in the repo uses --attention_config.X (underscore, matching the Python attribute name). vllm serve will exit at startup with an unknown-backend error, so the benchmark never runs; even if it did, the hyphenated nested-config flag would silently fall back to the default prefill backend. Replace with a real backend (e.g. trtllm_mla) and use the underscore prefix --attention_config.mla_prefill_backend.

Extended reasoning...

What the bug is

Lines 63-64 of benchmarks/single_node/dsr1_fp4_b300_vllm.sh configure the attention backend with two flags that do not work as written:

--attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \

This combines two distinct problems: a backend name that does not exist in vLLM, and a dotted-namespace prefix that uses the wrong separator.

Problem 1 — TOKENSPEED_MLA is not a real vLLM backend

TOKENSPEED_MLA appears only in this PR (the script, the nvidia-master.yaml entry, and the perf-changelog.yaml entry). Every other DeepSeek MLA recipe in this repo uses a known upstream backend:

File Backend
benchmarks/single_node/dsr1_fp4_b300.sh:52 trtllm_mla
benchmarks/single_node/dsr1_fp4_b200.sh:48 trtllm_mla
benchmarks/single_node/dsr1_fp4_b200_mtp.sh:63 trtllm_mla
benchmarks/single_node/dsr1_fp8_b300.sh:84 trtllm_mla
benchmarks/single_node/dsr1_fp8_b300_mtp.sh:88 trtllm_mla
benchmarks/single_node/dsr1_fp8_b200.sh:80 trtllm_mla
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:84 trtllm_mla
benchmarks/single_node/agentic/dsr1_fp4_b200.sh:58 trtllm_mla

vLLM's published MLA backend set is {trtllm_mla, flashinfer_mla, flash_attn_mla, triton_mla, cutlass_mla, flashmla}. The PR description's claim that "vLLM v0.21.0 includes TokenSpeed MLA for DeepSeek-R1-shaped MLA on Blackwell" has no upstream support — the same vllm/vllm-openai:v0.21.0 image is already used by other DSR1 scripts in this repo and they all select conventional backends. The companion --tokenizer-mode fastokens + pip install fastokens claim is similarly absent from vLLM's valid --tokenizer-mode values (auto/slow/mistral/custom), reinforcing that this recipe was not actually run against a vLLM server.

Problem 2 — --attention-config.X should be --attention_config.X

vLLM's nested-config CLI uses the Python attribute name as the prefix. The pydantic field is attention_config (underscore). Every other call site in the repo follows that:

  • benchmarks/single_node/dsv4_fp4_b300_vllm.sh:80--attention_config.use_fp4_indexer_cache True
  • benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh:73
  • benchmarks/single_node/dsv4_fp4_b200_vllm.sh:88
  • benchmarks/single_node/dsv4_fp4_b200_vllm_mtp.sh:84
  • benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:132
  • benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh:132
  • perf-changelog.yaml:1750, 1951

Only this new file uses --attention-config.mla_prefill_backend (hyphen between attention and config, but underscore inside the sub-key). Mixing styles inside a single dotted token will not match the registered prefix; argparse will either reject it as unknown or silently drop it, leaving the prefill backend at whatever --attention-backend selected.

Step-by-step proof of failure

  1. The CI launcher invokes the script with MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, sets TP, EP_SIZE, etc. from the dsr1-fp4-b300-vllm entry in nvidia-master.yaml.
  2. The script runs pip install -q 'fastokens>=0.1.1,<0.2.0' — this either fails (no such package on the official index for vLLM) or installs a third-party package unrelated to vLLM.
  3. The script execs vllm serve … --attention-backend TOKENSPEED_MLA ….
  4. vLLM validates --attention-backend against a closed enum during engine init. TOKENSPEED_MLA is not in the enum → engine raises and exits with an unknown-backend error before any model weights are loaded.
  5. wait_for_server_ready times out (or sees the dead PID) → the benchmark never reaches run_benchmark_serving.
  6. Even in the counterfactual where the backend existed, step 3 would still fail to apply the prefill setting because --attention-config.mla_prefill_backend does not match the registered attention_config prefix.

Impact

The entire dsr1-fp4-b300-vllm recipe is non-functional as written. It will fail at server-start time in CI and produce no benchmark data. Since the PR's stated purpose is to add this benchmark, the recipe needs working flags before it provides any value.

How to fix

Replace the two flags with a real backend and the correct dotted prefix, mirroring the existing DSR1 FP4 B300 recipe:

--attention-backend trtllm_mla \
--attention_config.mla_prefill_backend trtllm_mla \

Also drop --tokenizer-mode fastokens and the pip install fastokens line unless a real upstream feature is identified — the default tokenizer mode is auto and works with this checkpoint. The PR description, nvidia-master.yaml comment, and perf-changelog.yaml entry should be updated to reflect the actual backend used.

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Model loading can exceed the default timeout for the 394B FP4 checkpoint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on line 30 says "the 394B FP4 checkpoint," but the model loaded here (nvidia/DeepSeek-R1-0528-FP4-V2) is DeepSeek-R1, a 671B-parameter MoE — the repo's own multi-node SGLang recipe at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 cites "671B FP4 weights" for the same checkpoint family. The 3600s VLLM_ENGINE_READY_TIMEOUT_S export still works, so this is purely a comment fix, but worth correcting so the justification for the bumped timeout does not mislead future readers tuning this for related models.

Extended reasoning...

What the bug is\n\nbenchmarks/single_node/dsr1_fp4_b300_vllm.sh:30 contains the comment:\n\nbash\n# Model loading can exceed the default timeout for the 394B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nThe parameter count is wrong. The script loads nvidia/DeepSeek-R1-0528-FP4-V2 (set via the MODEL env var; the dsr1-fp4-b300-vllm entry in .github/configs/nvidia-master.yaml selects exactly this checkpoint). DeepSeek-R1 / R1-0528 is the publicly documented DeepSeek-V3-architecture MoE with 671B total parameters (37B activated), not 394B.\n\n### How it manifests\n\nNo runtime impact — VLLM_ENGINE_READY_TIMEOUT_S=3600 is exported regardless of the comment text. The bug is purely documentation: any future reader trying to understand why the timeout was bumped to 3600s will see a parameter count that does not match the checkpoint, and may either lose confidence in the comment or copy the same incorrect figure into related scripts.\n\n### Why existing code does not prevent it\n\nNothing else in the script references the parameter count; the comment is free-form text. There is also a clear in-repo convention for citing this number correctly:\n\n- benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 references "671B FP4 weights" for the same DSR1 FP4 checkpoint family and uses that to justify a similar loading-timeout note.\n\nSo the 394B figure looks like a copy-paste artifact from a different model variant rather than an intentional choice.\n\n### Step-by-step proof\n\n1. Open benchmarks/single_node/dsr1_fp4_b300_vllm.sh, line 30: comment says # Model loading can exceed the default timeout for the 394B FP4 checkpoint.\n2. The script reads MODEL from the environment. In .github/configs/nvidia-master.yaml (the diff in this PR, lines added at 1894–1916), the new dsr1-fp4-b300-vllm entry sets model: nvidia/DeepSeek-R1-0528-FP4-V2.\n3. nvidia/DeepSeek-R1-0528-FP4-V2 is an FP4 quantization of DeepSeek-R1-0528, which is the DeepSeek-V3-architecture MoE with 671B total parameters.\n4. Cross-check the convention in this same repo: benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 says "671B FP4 weights" for the same family.\n5. Therefore the 394B figure does not match the loaded checkpoint.\n\n### Impact\n\nNone at runtime. Misleading documentation only. Severity is nit — but worth fixing because the comment is the only justification for the 3600s timeout and a wrong parameter count weakens that justification for anyone porting this recipe to other DeepSeek variants.\n\n### How to fix\n\nReplace 394B with 671B (matching the repo's own convention) on line 30, e.g.:\n\nbash\n# Model loading can exceed the default timeout for the 671B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nAlternatively, generalize to "large FP4 checkpoint" to avoid the same drift when the recipe is adapted to other variants.

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vllm for Dsr1? Is ur /loop wack?

@github-actions
Copy link
Copy Markdown
Contributor

@Oseltamivir Oseltamivir deleted the add-dsr1-fp4-b300-vllm-tokenspeed branch May 25, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants