Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1891,6 +1891,29 @@ dsr1-fp4-b300-sglang:
- { tp: 4, ep: 4, conc-start: 1, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 16 }

# vLLM v0.21.0 includes TokenSpeed MLA for DeepSeek-R1-shaped MLA on
# Blackwell with FP8 KV cache; the launcher installs its optional fastokens dependency.
dsr1-fp4-b300-vllm:
image: vllm/vllm-openai:v0.21.0
model: nvidia/DeepSeek-R1-0528-FP4-V2
model-prefix: dsr1
runner: b300
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 1, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 128 }
- isl: 8192
osl: 1024
search-space:
- { tp: 4, ep: 4, conc-start: 1, conc-end: 128 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 16 }

dsr1-fp4-b200-trt:
image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc14
model: nvidia/DeepSeek-R1-0528-FP4-V2
Expand Down
95 changes: 95 additions & 0 deletions benchmarks/single_node/dsr1_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/usr/bin/env bash

# DeepSeek-R1 FP4 B300 vLLM run for the TokenSpeed MLA and fastokens paths
# introduced in vLLM v0.21.0. TokenSpeed MLA requires Blackwell and FP8 KV.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Model loading can exceed the default timeout for the 394B FP4 checkpoint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on line 30 says "the 394B FP4 checkpoint," but the model loaded here (nvidia/DeepSeek-R1-0528-FP4-V2) is DeepSeek-R1, a 671B-parameter MoE — the repo's own multi-node SGLang recipe at benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 cites "671B FP4 weights" for the same checkpoint family. The 3600s VLLM_ENGINE_READY_TIMEOUT_S export still works, so this is purely a comment fix, but worth correcting so the justification for the bumped timeout does not mislead future readers tuning this for related models.

Extended reasoning...

What the bug is\n\nbenchmarks/single_node/dsr1_fp4_b300_vllm.sh:30 contains the comment:\n\nbash\n# Model loading can exceed the default timeout for the 394B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nThe parameter count is wrong. The script loads nvidia/DeepSeek-R1-0528-FP4-V2 (set via the MODEL env var; the dsr1-fp4-b300-vllm entry in .github/configs/nvidia-master.yaml selects exactly this checkpoint). DeepSeek-R1 / R1-0528 is the publicly documented DeepSeek-V3-architecture MoE with 671B total parameters (37B activated), not 394B.\n\n### How it manifests\n\nNo runtime impact — VLLM_ENGINE_READY_TIMEOUT_S=3600 is exported regardless of the comment text. The bug is purely documentation: any future reader trying to understand why the timeout was bumped to 3600s will see a parameter count that does not match the checkpoint, and may either lose confidence in the comment or copy the same incorrect figure into related scripts.\n\n### Why existing code does not prevent it\n\nNothing else in the script references the parameter count; the comment is free-form text. There is also a clear in-repo convention for citing this number correctly:\n\n- benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 references "671B FP4 weights" for the same DSR1 FP4 checkpoint family and uses that to justify a similar loading-timeout note.\n\nSo the 394B figure looks like a copy-paste artifact from a different model variant rather than an intentional choice.\n\n### Step-by-step proof\n\n1. Open benchmarks/single_node/dsr1_fp4_b300_vllm.sh, line 30: comment says # Model loading can exceed the default timeout for the 394B FP4 checkpoint.\n2. The script reads MODEL from the environment. In .github/configs/nvidia-master.yaml (the diff in this PR, lines added at 1894–1916), the new dsr1-fp4-b300-vllm entry sets model: nvidia/DeepSeek-R1-0528-FP4-V2.\n3. nvidia/DeepSeek-R1-0528-FP4-V2 is an FP4 quantization of DeepSeek-R1-0528, which is the DeepSeek-V3-architecture MoE with 671B total parameters.\n4. Cross-check the convention in this same repo: benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-gb300-1p1d-tp4-tp4-2-c1.yaml:55 says "671B FP4 weights" for the same family.\n5. Therefore the 394B figure does not match the loaded checkpoint.\n\n### Impact\n\nNone at runtime. Misleading documentation only. Severity is nit — but worth fixing because the comment is the only justification for the 3600s timeout and a wrong parameter count weakens that justification for anyone porting this recipe to other DeepSeek variants.\n\n### How to fix\n\nReplace 394B with 671B (matching the repo's own convention) on line 30, e.g.:\n\nbash\n# Model loading can exceed the default timeout for the 671B FP4 checkpoint.\nexport VLLM_ENGINE_READY_TIMEOUT_S=3600\n\n\nAlternatively, generalize to "large FP4 checkpoint" to avoid the same drift when the recipe is adapted to other variants.

export VLLM_ENGINE_READY_TIMEOUT_S=3600

EP_ARGS=()
if [ "$EP_SIZE" -gt 1 ]; then
EP_ARGS=(--enable-expert-parallel)
fi

BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"
if [ "${EVAL_ONLY}" = "true" ]; then
EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN")
export EVAL_MAX_MODEL_LEN
SERVE_MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
else
SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN"
fi

MAX_NUM_BATCHED_TOKENS=$(( ISL * 2 ))

# vLLM v0.21.0 integrates the pre-v0.2 fastokens patching API as an optional package.
pip install -q 'fastokens>=0.1.1,<0.2.0' datasets pandas
Comment on lines +49 to +50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Line 50 pip installs the unattributed fastokens>=0.1.1,<0.2.0 package from PyPI on every CI run and then loads it into the running vLLM server process via --tokenizer-mode fastokens (line 65, alongside --trust-remote-code). The package has empty PyPI metadata (no author, email, maintainer, license, or description) and a homepage at github.com/Atero-ai/fastokens — a namespace with no apparent affiliation with vllm-project — making this the only package in the entire benchmarks/** directory installed from an unfamiliar/unattributed source. Please pin by hash and document provenance (or vendor under vllm-project) before landing.

Extended reasoning...

Supply-chain risk: unverified third-party package loaded into the serving process

What the bug is

benchmarks/single_node/dsr1_fp4_b300_vllm.sh:49-50 performs an unhashed PyPI install of fastokens>=0.1.1,<0.2.0, and line 65 activates the package's code path inside the running vllm serve process via --tokenizer-mode fastokens. Because the same vllm serve invocation also passes --trust-remote-code (line 60), the unverified tokenizer code is loaded into a context that already trusts arbitrary remote Python — there is no sandbox or process boundary between the third-party tokenizer and the rest of the model server.

Why fastokens is not a normal dependency

Direct PyPI metadata inspection of fastokens shows:

  • author = null, author_email = null, maintainer = null
  • license = null, description = empty string, summary = null
  • home_page = https://github.com/Atero-ai/fastokens — an org with no apparent affiliation with vllm-project
  • Version history: 0.1.1 (2026-04-18), 0.1.2 (2026-05-07), 0.2.0 (2026-05-17) — a brand-new package with no track record, shipping cp313 + cp39-abi3 native-extension wheels

By contrast, every other pip install across benchmarks/**/*.sh (transformers, huggingface-hub, datasets, pandas, flashinfer_python, amd-quark) is a well-known upstream from a known maintainer (HF, NVIDIA, AMD). fastokens is qualitatively the only totally-anonymous package in that pattern.

Why the PR's framing does not justify it

The YAML comment at .github/configs/nvidia-master.yaml calls fastokens vLLM's optional dependency, and the shell-script comment calls it the pre-v0.2 fastokens patching API integrated by vLLM v0.21.0. But vLLM v0.21.0's published requires_dist does not list fastokens at all, and no upstream source (release notes, RFC, GitHub PR) is cited. So at best this is a third-party plug-in, not a vLLM-maintained component — the YAML's framing overstates the package's provenance.

Concrete attack path

  1. An attacker compromises the Atero-ai GitHub/PyPI account, or registers a malicious successor version inside the unpinned range >=0.1.1,<0.2.0.
  2. CI runs pip install -q 'fastokens>=0.1.1,<0.2.0' ... — the install step alone can execute arbitrary Python via setup hooks or post-install code in the wheel's __init__.
  3. vllm serve ... --tokenizer-mode fastokens then imports the package into the long-running server process holding GPU and any cloud/registry credentials present on the benchmark runner.
  4. Arbitrary code now runs on every B300 benchmark execution.

How to fix

At minimum:

  • Pin by hash using pip install --require-hashes with a curated requirements.txt, so a malicious 0.1.3 upload cannot silently roll in via the <0.2.0 range, or
  • Vendor the tokenizer code under vllm-project (or this repo) so its provenance is auditable, or
  • Document the package's origin and the security review that approved it in the YAML comment.

If fastokens is not actually necessary for the benchmark (the sibling dsv4_fp4_b300_vllm.sh recipe runs without it), the cleanest fix is to drop the pip install and --tokenizer-mode fastokens flag entirely and use the default tokenizer mode.

On the duplicate-of-bug_002 refutation

A separate verifier flagged this as a duplicate of an earlier functional-failure report. The two findings overlap on the same lines but describe distinct concerns: the functional report covers the recipe will not run as written (CLI rejection / runtime failure), while this one covers if the recipe does run, it will execute unattributed third-party code inside the model server alongside --trust-remote-code. The remediation set is different too — removing the flag fixes the functional issue, but if maintainers decide to keep fastokens (e.g. once upstream support is confirmed), the supply-chain remediation (hash-pinning or vendoring) still applies. Filing this separately so the security framing is not lost if the functional concern is resolved by adding the flag rather than removing it.


# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
--tensor-parallel-size "$TP" \
--pipeline-parallel-size 1 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--no-enable-prefix-caching \
"${EP_ARGS[@]}" \
--attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \
Comment on lines +63 to +64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The script selects --attention-backend TOKENSPEED_MLA and --attention-config.mla_prefill_backend TOKENSPEED_MLA on lines 63-64, but neither flag is valid as written: TOKENSPEED_MLA is not a registered vLLM MLA backend (every other DSR1 MLA recipe uses trtllm_mla or flashinfer_mla), and the --attention-config.X form uses a hyphen where every other call site in the repo uses --attention_config.X (underscore, matching the Python attribute name). vllm serve will exit at startup with an unknown-backend error, so the benchmark never runs; even if it did, the hyphenated nested-config flag would silently fall back to the default prefill backend. Replace with a real backend (e.g. trtllm_mla) and use the underscore prefix --attention_config.mla_prefill_backend.

Extended reasoning...

What the bug is

Lines 63-64 of benchmarks/single_node/dsr1_fp4_b300_vllm.sh configure the attention backend with two flags that do not work as written:

--attention-backend TOKENSPEED_MLA \
--attention-config.mla_prefill_backend TOKENSPEED_MLA \

This combines two distinct problems: a backend name that does not exist in vLLM, and a dotted-namespace prefix that uses the wrong separator.

Problem 1 — TOKENSPEED_MLA is not a real vLLM backend

TOKENSPEED_MLA appears only in this PR (the script, the nvidia-master.yaml entry, and the perf-changelog.yaml entry). Every other DeepSeek MLA recipe in this repo uses a known upstream backend:

File Backend
benchmarks/single_node/dsr1_fp4_b300.sh:52 trtllm_mla
benchmarks/single_node/dsr1_fp4_b200.sh:48 trtllm_mla
benchmarks/single_node/dsr1_fp4_b200_mtp.sh:63 trtllm_mla
benchmarks/single_node/dsr1_fp8_b300.sh:84 trtllm_mla
benchmarks/single_node/dsr1_fp8_b300_mtp.sh:88 trtllm_mla
benchmarks/single_node/dsr1_fp8_b200.sh:80 trtllm_mla
benchmarks/single_node/dsr1_fp8_b200_mtp.sh:84 trtllm_mla
benchmarks/single_node/agentic/dsr1_fp4_b200.sh:58 trtllm_mla

vLLM's published MLA backend set is {trtllm_mla, flashinfer_mla, flash_attn_mla, triton_mla, cutlass_mla, flashmla}. The PR description's claim that "vLLM v0.21.0 includes TokenSpeed MLA for DeepSeek-R1-shaped MLA on Blackwell" has no upstream support — the same vllm/vllm-openai:v0.21.0 image is already used by other DSR1 scripts in this repo and they all select conventional backends. The companion --tokenizer-mode fastokens + pip install fastokens claim is similarly absent from vLLM's valid --tokenizer-mode values (auto/slow/mistral/custom), reinforcing that this recipe was not actually run against a vLLM server.

Problem 2 — --attention-config.X should be --attention_config.X

vLLM's nested-config CLI uses the Python attribute name as the prefix. The pydantic field is attention_config (underscore). Every other call site in the repo follows that:

  • benchmarks/single_node/dsv4_fp4_b300_vllm.sh:80--attention_config.use_fp4_indexer_cache True
  • benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh:73
  • benchmarks/single_node/dsv4_fp4_b200_vllm.sh:88
  • benchmarks/single_node/dsv4_fp4_b200_vllm_mtp.sh:84
  • benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:132
  • benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh:132
  • perf-changelog.yaml:1750, 1951

Only this new file uses --attention-config.mla_prefill_backend (hyphen between attention and config, but underscore inside the sub-key). Mixing styles inside a single dotted token will not match the registered prefix; argparse will either reject it as unknown or silently drop it, leaving the prefill backend at whatever --attention-backend selected.

Step-by-step proof of failure

  1. The CI launcher invokes the script with MODEL=nvidia/DeepSeek-R1-0528-FP4-V2, sets TP, EP_SIZE, etc. from the dsr1-fp4-b300-vllm entry in nvidia-master.yaml.
  2. The script runs pip install -q 'fastokens>=0.1.1,<0.2.0' — this either fails (no such package on the official index for vLLM) or installs a third-party package unrelated to vLLM.
  3. The script execs vllm serve … --attention-backend TOKENSPEED_MLA ….
  4. vLLM validates --attention-backend against a closed enum during engine init. TOKENSPEED_MLA is not in the enum → engine raises and exits with an unknown-backend error before any model weights are loaded.
  5. wait_for_server_ready times out (or sees the dead PID) → the benchmark never reaches run_benchmark_serving.
  6. Even in the counterfactual where the backend existed, step 3 would still fail to apply the prefill setting because --attention-config.mla_prefill_backend does not match the registered attention_config prefix.

Impact

The entire dsr1-fp4-b300-vllm recipe is non-functional as written. It will fail at server-start time in CI and produce no benchmark data. Since the PR's stated purpose is to add this benchmark, the recipe needs working flags before it provides any value.

How to fix

Replace the two flags with a real backend and the correct dotted prefix, mirroring the existing DSR1 FP4 B300 recipe:

--attention-backend trtllm_mla \
--attention_config.mla_prefill_backend trtllm_mla \

Also drop --tokenizer-mode fastokens and the pip install fastokens line unless a real upstream feature is identified — the default tokenizer mode is auto and works with this checkpoint. The PR description, nvidia-master.yaml comment, and perf-changelog.yaml entry should be updated to reflect the actual backend used.

--tokenizer-mode fastokens \
--max-model-len "$SERVE_MAX_MODEL_LEN" \
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3136,3 +3136,9 @@
description:
- "Add --use-chat-template to run_benchmark_serving so prompts are formatted with the Qwen chat template (matching the other Qwen MTP recipes)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1555

- config-keys:
- dsr1-fp4-b300-vllm
description:
- "Add DeepSeek-R1 FP4 B300 vLLM benchmark on vllm/vllm-openai:v0.21.0 with TOKENSPEED_MLA prefill/decode and fastokens tokenization"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1562
Loading