diff --git a/benchmark/vllm/README.md b/benchmark/vllm/README.md index 522a104..c735530 100644 --- a/benchmark/vllm/README.md +++ b/benchmark/vllm/README.md @@ -14,7 +14,7 @@ This Docker image packages vLLM with PyTorch for AMD Instinct™ MI300X, MI325X, accelerators. It includes: - ✅ ROCm™ 7.0.0 -- ✅ vLLM 0.17.1 +- ✅ vLLM 0.20.0 - ✅ PyTorch 2.9.0 (2.9.0a0+git1c57644) - ✅ hipBLASLt 1.0 @@ -58,7 +58,7 @@ To override the benchmark configs, specify a certain benchmark to use, or add yo The following command pulls the Docker image from Docker Hub. ```sh -docker pull vllm/vllm-openai-rocm:v0.17.1 +docker pull vllm/vllm-openai-rocm:v0.20.0 ``` ### MAD-integrated benchmarking @@ -86,7 +86,17 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki #### Available models >[!NOTE] ->The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators. +>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators. + +>[!NOTE] +>Gemma 4 models (`pyt_vllm_gemma-4-*`) use the standard `docker/pyt_vllm` stack (vLLM 0.20.0, which bundles Transformers v5 with native Gemma 4 support). Accept Google’s Gemma license on Hugging Face and set `MAD_SECRETS_HFTOKEN` for gated weight downloads. + +Serving recipes for Gemma 4 live in [`scripts/vllm/configs/default.yaml`](../../scripts/vllm/configs/default.yaml). Both Gemma 4 entries use **tensor parallel size 1**, **`TRITON_ATTN`**, **`float16` on gfx942** (via `arch_overrides`), **`--max-model-len` 32768**, text-only multimodal limits (`--limit-mm-per-prompt`), and **`VLLM_ROCM_USE_AITER=1`**. + +| Model | Notes | +| ----- | ----- | +| **google/gemma-4-31B-it** | Dense instruct. Full serving sweep: **`max_concurrency` 1, 8, 32, 128** (four cold starts). | +| **google/gemma-4-26B-A4B-it** | Sparse MoE (“A4B”). **AITER fused MoE is disabled** via **`VLLM_ROCM_USE_AITER_MOE=0`** so MoE runs on the **Triton** path. Full concurrency sweep: **`max_concurrency` 1, 8, 32, 128**. | | MAD model name | Model repo | | -------------------------------------- | -------------------------------------- | @@ -112,6 +122,8 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki | pyt_vllm_mixtral-8x22b | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://hugggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) | | pyt_vllm_mixtral-8x22b_fp8 | [amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV](https://hugggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV) | | pyt_vllm_phi-4 | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) | +| pyt_vllm_gemma-4-26b-a4b-it | [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) | +| pyt_vllm_gemma-4-31b-it | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) | | pyt_vllm_qwen3-8b | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | | pyt_vllm_qwen3-32b | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | | pyt_vllm_qwen3-30b-a3b | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) | @@ -127,11 +139,13 @@ Users also can run the benchmark tool after they launch a Docker container. For #### Docker launch ```sh -docker pull vllm/vllm-openai-rocm:v0.17.1 +docker pull vllm/vllm-openai-rocm:v0.20.0 -docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.17.1 +docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.20.0 ``` +For **`google/gemma-4-26B-A4B-it`** standalone runs, also set **`VLLM_ROCM_USE_AITER_MOE=0`** (same as the MAD `default.yaml` recipe) so MoE does not use AITER’s fused path. + >[!NOTE] >We enable [AITER](https://github.com/ROCm/aiter) during `docker run` via `--env VLLM_ROCM_USE_AITER=1` for best performance >on MI3xx (i.e. gfx942 and gfx950) platforms. If you're using this docker image on other AMD GPUs e.g. MI2xx or Radeon, @@ -345,6 +359,10 @@ owners and are only mentioned for informative purposes.    ---------- This release note summarizes notable changes since the previous docker release. +MAD `pyt_vllm_gemma-4-*` configs (see [`default.yaml`](../../scripts/vllm/configs/default.yaml)): +- **gemma-4-26B-A4B-it:** set `VLLM_ROCM_USE_AITER_MOE=0` (Triton MoE); full concurrency sweep **1 8 32 128**. +- **gemma-4-31B-it:** unchanged full sweep **1 8 32 128**; no `VLLM_ROCM_USE_AITER_MOE` override. + v0.17.1 release: - Includes documentation and patches for upstream releases. Please track https://github.com/vllm-project/vllm/releases for all future release notes. diff --git a/docker/pyt_vllm.ubuntu.amd.Dockerfile b/docker/pyt_vllm.ubuntu.amd.Dockerfile index eb59155..4707632 100644 --- a/docker/pyt_vllm.ubuntu.amd.Dockerfile +++ b/docker/pyt_vllm.ubuntu.amd.Dockerfile @@ -24,7 +24,7 @@ # SOFTWARE. # ################################################################################# -ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.17.1 +ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.20.0 FROM $BASE_DOCKER USER root @@ -36,4 +36,4 @@ WORKDIR $WORKSPACE_DIR RUN pip3 list # Specify entrypoint to override upstream -ENTRYPOINT [""] +ENTRYPOINT [] diff --git a/models.json b/models.json index b24ac6c..6b1c99a 100644 --- a/models.json +++ b/models.json @@ -487,6 +487,44 @@ "args": "--model_repo Qwen/Qwen3-8B --config configs/extended.yaml" }, + { + "name": "pyt_vllm_gemma-4-26b-a4b-it", + "data": "huggingface", + "dockerfile": "docker/pyt_vllm", + "scripts": "scripts/vllm/run.sh", + "n_gpus": "-1", + "owner": "mad.support@amd.com", + "training_precision": "", + "multiple_results": "perf_gemma-4-26B-A4B-it.csv", + "tags": [ + "pyt", + "vllm", + "vllm_extended", + "inference" + ], + "timeout": -1, + "args": + "--model_repo google/gemma-4-26B-A4B-it --config configs/default.yaml" + }, + { + "name": "pyt_vllm_gemma-4-31b-it", + "data": "huggingface", + "dockerfile": "docker/pyt_vllm", + "scripts": "scripts/vllm/run.sh", + "n_gpus": "-1", + "owner": "mad.support@amd.com", + "training_precision": "", + "multiple_results": "perf_gemma-4-31B-it.csv", + "tags": [ + "pyt", + "vllm", + "vllm_extended", + "inference" + ], + "timeout": -1, + "args": + "--model_repo google/gemma-4-31B-it --config configs/default.yaml" + }, { "name": "pyt_vllm_qwen3-32b", "data": "huggingface", diff --git a/scripts/vllm/configs/default.yaml b/scripts/vllm/configs/default.yaml index 038480d..e556918 100644 --- a/scripts/vllm/configs/default.yaml +++ b/scripts/vllm/configs/default.yaml @@ -94,4 +94,46 @@ --attention-backend: ROCM_ATTN arch_overrides: gfx942: - dtype: float16 \ No newline at end of file + dtype: float16 + +## Gemma 4: vLLM recipe recommends 1x MI300-class GPU (BF16); tp 1 for text-only bench +## Use TRITON_ATTN (Gemma4 default); ROCM_ATTN does not support head_dim 512 without extra work +## --limit-mm-per-prompt is JSON for vLLM (json.loads), not image=0,audio=0 +## 26B-A4B is MoE: disable AITER fused MoE only (unsupported CK GEMM for this shape); other AITER features unchanged +## Full concurrency sweep (1 8 32 128) +- benchmark: serving + model: google/gemma-4-26B-A4B-it + tp: 1 + inp: 1024 + out: 1024 + dtype: auto + max_concurrency: 1 8 32 128 + env: + VLLM_ROCM_USE_AITER: 1 + VLLM_ROCM_USE_AITER_MOE: 0 + extra_args: + --attention-backend: TRITON_ATTN + --max-model-len: 32768 + --gpu-memory-utilization: 0.90 + --limit-mm-per-prompt: '{"image":0,"audio":0}' + arch_overrides: + gfx942: + dtype: float16 + +- benchmark: serving + model: google/gemma-4-31B-it + tp: 1 + inp: 1024 + out: 1024 + dtype: auto + max_concurrency: 1 8 32 128 + env: + VLLM_ROCM_USE_AITER: 1 + extra_args: + --attention-backend: TRITON_ATTN + --max-model-len: 32768 + --gpu-memory-utilization: 0.90 + --limit-mm-per-prompt: '{"image":0,"audio":0}' + arch_overrides: + gfx942: + dtype: float16 diff --git a/scripts/vllm/run_vllm.py b/scripts/vllm/run_vllm.py index b5b5db0..fd96e0c 100644 --- a/scripts/vllm/run_vllm.py +++ b/scripts/vllm/run_vllm.py @@ -34,10 +34,24 @@ import signal import argparse import itertools +import shlex import subprocess from typing import List, Dict SUPPORTED_LIST_ARGS = ['model', 'tp', 'inp', 'out', 'bs', 'num_prompts', 'max_concurrency'] + + +def build_extra_args_str(extra_args: Dict) -> str: + parts = [] + for k, v in extra_args.items(): + if isinstance(v, bool): + if v: + parts.append(k) + else: + parts.append(f"{k} {shlex.quote(str(v))}") + return " ".join(parts) + + CSV_HEADER = [ "model", "benchmark", @@ -485,12 +499,7 @@ def main(): env_vars = config.get("env", {}) extra_args = config.get("extra_args", {}) env_vars_str = " ".join(f"{k}={v}" for k, v in env_vars.items()) - extra_args_str = "" - for k, v in extra_args.items(): - if isinstance(v, bool): - extra_args_str += f" {k}" - else: - extra_args_str += f" {k} {v}" + extra_args_str = build_extra_args_str(extra_args) config["env"] = env_vars_str config["extra_args"] = extra_args_str diff --git a/tests/vllm/__init__.py b/tests/vllm/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/tests/vllm/test_run_vllm_extra_args.py b/tests/vllm/test_run_vllm_extra_args.py new file mode 100644 index 0000000..7bcb1c3 --- /dev/null +++ b/tests/vllm/test_run_vllm_extra_args.py @@ -0,0 +1,61 @@ +import sys +import os + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "scripts", "vllm")) +from run_vllm import build_extra_args_str + + +def test_simple_string_value(): + result = build_extra_args_str({"--attention-backend": "TRITON_ATTN"}) + assert result == "--attention-backend TRITON_ATTN" + + +def test_json_value_is_quoted(): + result = build_extra_args_str({"--limit-mm-per-prompt": '{"image":0,"audio":0}'}) + assert "'" in result or "\\" in result + assert "--limit-mm-per-prompt" in result + + +def test_bool_true_emits_flag(): + result = build_extra_args_str({"--enable-prefix-caching": True}) + assert result == "--enable-prefix-caching" + + +def test_bool_false_skips_flag(): + result = build_extra_args_str({"--enable-prefix-caching": False}) + assert result == "" + + +def test_numeric_value(): + result = build_extra_args_str({"--max-model-len": 32768}) + assert result == "--max-model-len 32768" + + +def test_mixed_args(): + args = { + "--attention-backend": "TRITON_ATTN", + "--enable-prefix-caching": True, + "--disable-log-stats": False, + "--max-model-len": 32768, + "--limit-mm-per-prompt": '{"image":0,"audio":0}', + } + result = build_extra_args_str(args) + assert "--attention-backend TRITON_ATTN" in result + assert "--enable-prefix-caching" in result + assert "--disable-log-stats" not in result + assert "--max-model-len 32768" in result + assert "--limit-mm-per-prompt" in result + + +def test_empty_args(): + assert build_extra_args_str({}) == "" + + +def test_value_with_spaces_is_quoted(): + result = build_extra_args_str({"--chat-template": "path with spaces/template.jinja"}) + assert "'" in result or "\\" in result + + +def test_shell_metacharacters_are_quoted(): + result = build_extra_args_str({"--some-arg": "value;rm -rf /"}) + assert "'" in result or "\\" in result