[OMNIML-4969] specdec_bench cell t0_d3 — nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 / dflash / vllm (NVIDIA#1656)

ChenhanYu · web-flow · commit 111b7ebd29a6 · 2026-06-09T23:59:50.000Z
## Summary\n- Add specdec_bench cell t0_d3 for
NVIDIA-Nemotron-3-Super-120B-A12B-BF16 dflash/vLLM\n\n## Testing\n- Not
run (cell YAML only)\n

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;
## Summary by CodeRabbit

* **New Features**
* Added a new benchmark configuration to run Nemotron-3-Super-120B on
NVIDIA with vLLM using the DFlash speculative-decoding method.
* Provides both qualitative and high-throughput benchmarking scenarios,
tunable concurrency and request counts, and automated job execution with
Slurm-compatible orchestration and output saving.
&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;

---------

Signed-off-by: chenhany &lt;chenhany@nvidia.com&gt;
Signed-off-by: Chenhan Yu &lt;chenhany@nvidia.com&gt;
Signed-off-by: Chenhan D. Yu &lt;5185878+ChenhanYu@users.noreply.github.com&gt;
diff --git a/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/specdec_bench_dflash_vllm.yaml b/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/specdec_bench_dflash_vllm.yaml
@@ -0,0 +1,82 @@
+# SPEED-bench DFlash speculative-decoding run for NVIDIA-Nemotron-3-Super-120B-A12B-BF16 via vLLM.
+#
+# vLLM v0.22.0+ required for DFlash spec-decode method
+# (vllm/v1/spec_decode/dflash.py). Use vllm/vllm-openai:v0.22.1 — earlier
+# images reject method='dflash' with `argparse: invalid choice: 'dflash'
+# (choose from ... DFLASH ...)` on the engine side.
+#
+# Nemotron-3-Super-120B-A12B is 120B total params (MoE; 12B active per
+# token). BF16 weights = 240 GB total, so tp_size=4 minimum on 80 GB
+# H100/A100. Surfaced on OMNIML-4969: tp_size=2 OOMed at model load
+# (~80 GB needed per GPU at tp=2, only 79.11 GiB available).
+#
+# Slurm run on cw_dfw — cells override per-cell runtime_params,
+# --save_dir, --block_size, --num_requests via pipeline.task_N.args+=[...]:
+#
+#   uv run slurm.py #     --yaml modules/Model-Optimizer/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/specdec_bench_dflash_vllm.yaml #     --yes detach=true #     pipeline.task_0.args+=["--runtime_params common/specdec_bench/_cells/<sweep_name>.yaml","--save_dir /scratchspace/<sweep>/qualitative","--block_size 4"] #     pipeline.task_1.args+=["--runtime_params common/specdec_bench/_cells/<sweep_name>.yaml","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--block_size 4"]
+#
+# Reference run: cicd_1781024226 (cw_dfw) produced
+#   qualitative   Average_AL = 2.7316  (11-cat breakdown)
+#   throughput_32k Average_AL = 1.1803  (3-band breakdown)
+# See OMNIML-4969 INTERN-ARTIFACTS worklog for the full payload.
+
+job_name: NVIDIA-Nemotron-3-Super-120B-A12B-BF16_specdec_bench_dflash_vllm
+
+pipeline:
+  global_vars:
+    hf_model: /hf-local/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
+
+  # task_0: SPEED qualitative split
+  task_0:
+    script: common/specdec_bench/run.sh
+    args:
+      - --dataset speed
+      - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative
+      - --engine VLLM
+      - --speculative_algorithm DFLASH
+      - --draft_model_dir /hf-local/nvidia/dflash-nemotron-3-super-data-v2-step120k
+      - --block_size 4
+      - --tp_size 4
+      - --ep_size 1
+      - --concurrency 32
+      - --output_length 4096
+      - --aa_timing
+      - --show_progress
+      - --save_dir /scratchspace/{sweep_name_default}/qualitative
+    environment:
+      - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - HF_LOCAL: /hf-local
+    slurm_config:
+      _factory_: "slurm_factory"
+      nodes: 1
+      ntasks_per_node: 1
+      gpus_per_node: 4
+      container: vllm/vllm-openai:v0.22.1
+
+  # task_1: SPEED throughput_32k split
+  task_1:
+    script: common/specdec_bench/run.sh
+    args:
+      - --dataset speed
+      - --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k
+      - --engine VLLM
+      - --speculative_algorithm DFLASH
+      - --draft_model_dir /hf-local/nvidia/dflash-nemotron-3-super-data-v2-step120k
+      - --block_size 4
+      - --tp_size 4
+      - --ep_size 1
+      - --concurrency 8
+      - --num_requests 80
+      - --output_length 4096
+      - --aa_timing
+      - --show_progress
+      - --save_dir /scratchspace/{sweep_name_default}/throughput_32k
+    environment:
+      - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - HF_LOCAL: /hf-local
+    slurm_config:
+      _factory_: "slurm_factory"
+      nodes: 1
+      ntasks_per_node: 1
+      gpus_per_node: 4
+      container: vllm/vllm-openai:v0.22.1