Skip to content

Commit 111b7eb

Browse files
authored
[OMNIML-4969] specdec_bench cell t0_d3 — nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 / dflash / vllm (NVIDIA#1656)
## Summary\n- Add specdec_bench cell t0_d3 for NVIDIA-Nemotron-3-Super-120B-A12B-BF16 dflash/vLLM\n\n## Testing\n- Not run (cell YAML only)\n <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a new benchmark configuration to run Nemotron-3-Super-120B on NVIDIA with vLLM using the DFlash speculative-decoding method. * Provides both qualitative and high-throughput benchmarking scenarios, tunable concurrency and request counts, and automated job execution with Slurm-compatible orchestration and output saving. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chenhany <chenhany@nvidia.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com>
1 parent 5584ce4 commit 111b7eb

1 file changed

Lines changed: 82 additions & 0 deletions

File tree

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# SPEED-bench DFlash speculative-decoding run for NVIDIA-Nemotron-3-Super-120B-A12B-BF16 via vLLM.
2+
#
3+
# vLLM v0.22.0+ required for DFlash spec-decode method
4+
# (vllm/v1/spec_decode/dflash.py). Use vllm/vllm-openai:v0.22.1 — earlier
5+
# images reject method='dflash' with `argparse: invalid choice: 'dflash'
6+
# (choose from ... DFLASH ...)` on the engine side.
7+
#
8+
# Nemotron-3-Super-120B-A12B is 120B total params (MoE; 12B active per
9+
# token). BF16 weights = 240 GB total, so tp_size=4 minimum on 80 GB
10+
# H100/A100. Surfaced on OMNIML-4969: tp_size=2 OOMed at model load
11+
# (~80 GB needed per GPU at tp=2, only 79.11 GiB available).
12+
#
13+
# Slurm run on cw_dfw — cells override per-cell runtime_params,
14+
# --save_dir, --block_size, --num_requests via pipeline.task_N.args+=[...]:
15+
#
16+
# uv run slurm.py # --yaml modules/Model-Optimizer/tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/specdec_bench_dflash_vllm.yaml # --yes detach=true # pipeline.task_0.args+=["--runtime_params common/specdec_bench/_cells/<sweep_name>.yaml","--save_dir /scratchspace/<sweep>/qualitative","--block_size 4"] # pipeline.task_1.args+=["--runtime_params common/specdec_bench/_cells/<sweep_name>.yaml","--save_dir /scratchspace/<sweep>/throughput_32k","--num_requests 80","--block_size 4"]
17+
#
18+
# Reference run: cicd_1781024226 (cw_dfw) produced
19+
# qualitative Average_AL = 2.7316 (11-cat breakdown)
20+
# throughput_32k Average_AL = 1.1803 (3-band breakdown)
21+
# See OMNIML-4969 INTERN-ARTIFACTS worklog for the full payload.
22+
23+
job_name: NVIDIA-Nemotron-3-Super-120B-A12B-BF16_specdec_bench_dflash_vllm
24+
25+
pipeline:
26+
global_vars:
27+
hf_model: /hf-local/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
28+
29+
# task_0: SPEED qualitative split
30+
task_0:
31+
script: common/specdec_bench/run.sh
32+
args:
33+
- --dataset speed
34+
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/qualitative
35+
- --engine VLLM
36+
- --speculative_algorithm DFLASH
37+
- --draft_model_dir /hf-local/nvidia/dflash-nemotron-3-super-data-v2-step120k
38+
- --block_size 4
39+
- --tp_size 4
40+
- --ep_size 1
41+
- --concurrency 32
42+
- --output_length 4096
43+
- --aa_timing
44+
- --show_progress
45+
- --save_dir /scratchspace/{sweep_name_default}/qualitative
46+
environment:
47+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
48+
- HF_LOCAL: /hf-local
49+
slurm_config:
50+
_factory_: "slurm_factory"
51+
nodes: 1
52+
ntasks_per_node: 1
53+
gpus_per_node: 4
54+
container: vllm/vllm-openai:v0.22.1
55+
56+
# task_1: SPEED throughput_32k split
57+
task_1:
58+
script: common/specdec_bench/run.sh
59+
args:
60+
- --dataset speed
61+
- --dataset_path /hf-local/nvidia/SPEED-Bench-Internal/throughput_32k
62+
- --engine VLLM
63+
- --speculative_algorithm DFLASH
64+
- --draft_model_dir /hf-local/nvidia/dflash-nemotron-3-super-data-v2-step120k
65+
- --block_size 4
66+
- --tp_size 4
67+
- --ep_size 1
68+
- --concurrency 8
69+
- --num_requests 80
70+
- --output_length 4096
71+
- --aa_timing
72+
- --show_progress
73+
- --save_dir /scratchspace/{sweep_name_default}/throughput_32k
74+
environment:
75+
- HF_MODEL_CKPT: <<global_vars.hf_model>>
76+
- HF_LOCAL: /hf-local
77+
slurm_config:
78+
_factory_: "slurm_factory"
79+
nodes: 1
80+
ntasks_per_node: 1
81+
gpus_per_node: 4
82+
container: vllm/vllm-openai:v0.22.1

0 commit comments

Comments
 (0)