Skip to content
Open
26 changes: 25 additions & 1 deletion benchmark/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,12 @@ The following command pulls the Docker image from Docker Hub.
docker pull vllm/vllm-openai-rocm:v0.17.1
```

Gemma 4 models are served from the standard vLLM image (vLLM ≥0.20.0 ships Transformers v5 with native Gemma 4 support):

```sh
docker pull vllm/vllm-openai-rocm:v0.20.0
```
Comment thread
coketaste marked this conversation as resolved.
Outdated

### MAD-integrated benchmarking

Clone the ROCm Model Automation and Dashboarding (MAD) repository to a local directory and install the required packages on the host machine.
Expand All @@ -86,7 +92,17 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
#### Available models

>[!NOTE]
>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.
>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.

>[!NOTE]
>Gemma 4 models (`pyt_vllm_gemma-4-*`) use the standard `docker/pyt_vllm` stack (vLLM ≥0.20.0 / Transformers ≥5.5.0). Accept Google’s Gemma license on Hugging Face and set `MAD_SECRETS_HFTOKEN` for gated weight downloads.

Serving recipes for Gemma 4 live in [`scripts/vllm/configs/default.yaml`](../../scripts/vllm/configs/default.yaml). Both Gemma 4 entries use **tensor parallel size 1**, **`TRITON_ATTN`**, **`float16` on gfx942** (via `arch_overrides`), **`--max-model-len` 32768**, text-only multimodal limits (`--limit-mm-per-prompt`), and **`VLLM_ROCM_USE_AITER=1`** where supported.

| Model | Notes |
| ----- | ----- |
| **google/gemma-4-31B-it** | Dense instruct. Full serving sweep: **`max_concurrency` 1, 8, 32, 128** (four cold starts). |
| **google/gemma-4-26B-A4B-it** | Sparse MoE (“A4B”). **AITER fused MoE is disabled** via **`VLLM_ROCM_USE_AITER_MOE=0`** so MoE runs on the **Triton** path. **Concurrency sweep is narrowed to 1 and 8** only for typical MAD Docker memory limits. |

| MAD model name | Model repo |
| -------------------------------------- | -------------------------------------- |
Expand All @@ -112,6 +128,8 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
| pyt_vllm_mixtral-8x22b | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://hugggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
| pyt_vllm_mixtral-8x22b_fp8 | [amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV](https://hugggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV) |
| pyt_vllm_phi-4 | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
| pyt_vllm_gemma-4-26b-a4b-it | [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) |
| pyt_vllm_gemma-4-31b-it | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
| pyt_vllm_qwen3-8b | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |
| pyt_vllm_qwen3-32b | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) |
| pyt_vllm_qwen3-30b-a3b | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) |
Expand All @@ -132,6 +150,8 @@ docker pull vllm/vllm-openai-rocm:v0.17.1
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.17.1
```

For Gemma 4 standalone runs use `vllm/vllm-openai-rocm:v0.20.0` (same image as other vLLM models). For **`google/gemma-4-26B-A4B-it`** only, also set **`VLLM_ROCM_USE_AITER_MOE=0`** (same as the MAD `default.yaml` recipe) so MoE does not use AITER’s fused path.

>[!NOTE]
>We enable [AITER](https://github.com/ROCm/aiter) during `docker run` via `--env VLLM_ROCM_USE_AITER=1` for best performance
>on MI3xx (i.e. gfx942 and gfx950) platforms. If you're using this docker image on other AMD GPUs e.g. MI2xx or Radeon,
Expand Down Expand Up @@ -345,6 +365,10 @@ owners and are only mentioned for informative purposes.   
----------
This release note summarizes notable changes since the previous docker release.

MAD `pyt_vllm_gemma-4-*` configs (see [`default.yaml`](../../scripts/vllm/configs/default.yaml)):
- **gemma-4-26B-A4B-it:** set `VLLM_ROCM_USE_AITER_MOE=0` (Triton MoE); narrowed default `max_concurrency` to **1 8** to avoid OOM on repeated server restarts.
- **gemma-4-31B-it:** unchanged full sweep **1 8 32 128**; no `VLLM_ROCM_USE_AITER_MOE` override.

v0.17.1 release:
- Includes documentation and patches for upstream releases. Please track https://github.com/vllm-project/vllm/releases
for all future release notes.
Expand Down
3 changes: 2 additions & 1 deletion docker/pyt_vllm.ubuntu.amd.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,11 @@
# SOFTWARE.
#
#################################################################################
ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.17.1
ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.20.0
FROM $BASE_DOCKER

USER root
RUN pip3 install --no-cache-dir "transformers>=5.5.0"
Comment thread
coketaste marked this conversation as resolved.
Outdated
ENV WORKSPACE_DIR=/workspace
RUN mkdir -p $WORKSPACE_DIR
WORKDIR $WORKSPACE_DIR
Expand Down
38 changes: 38 additions & 0 deletions models.json
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,44 @@
"args":
"--model_repo Qwen/Qwen3-8B --config configs/extended.yaml"
},
{
"name": "pyt_vllm_gemma-4-26b-a4b-it",
"data": "huggingface",
"dockerfile": "docker/pyt_vllm",
"scripts": "scripts/vllm/run.sh",
"n_gpus": "-1",
"owner": "[email protected]",
"training_precision": "",
"multiple_results": "perf_gemma-4-26B-A4B-it.csv",
"tags": [
"pyt",
"vllm",
"vllm_extended",
"inference"
],
"timeout": -1,
"args":
"--model_repo google/gemma-4-26B-A4B-it --config configs/default.yaml"
Comment thread
coketaste marked this conversation as resolved.
},
{
"name": "pyt_vllm_gemma-4-31b-it",
"data": "huggingface",
"dockerfile": "docker/pyt_vllm",
"scripts": "scripts/vllm/run.sh",
"n_gpus": "-1",
"owner": "[email protected]",
"training_precision": "",
"multiple_results": "perf_gemma-4-31B-it.csv",
"tags": [
"pyt",
"vllm",
"vllm_extended",
"inference"
],
"timeout": -1,
"args":
"--model_repo google/gemma-4-31B-it --config configs/default.yaml"
},
{
"name": "pyt_vllm_qwen3-32b",
"data": "huggingface",
Expand Down
41 changes: 41 additions & 0 deletions scripts/vllm/configs/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,47 @@
VLLM_ROCM_USE_AITER: 1
extra_args:
--attention-backend: ROCM_ATTN
arch_overrides:
gfx942:
dtype: float16

## Gemma 4: vLLM recipe recommends 1x MI300-class GPU (BF16); tp 1 for text-only bench
## Use TRITON_ATTN (Gemma4 default); 26B-A4B MoE: VLLM_ROCM_USE_AITER_MOE=0; narrow concurrency for 26B to avoid OOM
- benchmark: serving
model: google/gemma-4-26B-A4B-it
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8
env:
VLLM_ROCM_USE_AITER: 1
VLLM_ROCM_USE_AITER_MOE: 0
extra_args:
--attention-backend: TRITON_ATTN
--max-model-len: 32768
--gpu-memory-utilization: 0.90
--limit-mm-per-prompt: '{"image":0,"audio":0}'
--async-scheduling: True
arch_overrides:
gfx942:
dtype: float16

- benchmark: serving
model: google/gemma-4-31B-it
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
env:
VLLM_ROCM_USE_AITER: 1
extra_args:
--attention-backend: TRITON_ATTN
--max-model-len: 32768
--gpu-memory-utilization: 0.90
--limit-mm-per-prompt: '{"image":0,"audio":0}'
--async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
3 changes: 2 additions & 1 deletion scripts/vllm/run_vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import signal
import argparse
import itertools
import shlex
import subprocess
from typing import List, Dict

Expand Down Expand Up @@ -490,7 +491,7 @@ def main():
if isinstance(v, bool):
extra_args_str += f" {k}"
Comment thread
coketaste marked this conversation as resolved.
Outdated
else:
extra_args_str += f" {k} {v}"
extra_args_str += f" {k} {shlex.quote(str(v))}"
config["env"] = env_vars_str
config["extra_args"] = extra_args_str

Expand Down
Empty file added tests/vllm/__init__.py
Empty file.
95 changes: 95 additions & 0 deletions tests/vllm/test_run_vllm_extra_args.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
"""Tests for extra_args quoting in run_vllm.py."""
import sys
import os
import shlex

import pytest


Comment thread
coketaste marked this conversation as resolved.
Outdated
def build_extra_args_str_old(extra_args: dict) -> str:
"""Replicates the OLD selective-quoting logic from run_vllm.py (pre-fix)."""
extra_args_str = ""
for k, v in extra_args.items():
if isinstance(v, bool):
extra_args_str += f" {k}"
else:
s = str(v)
st = s.strip()
if (
k == "--limit-mm-per-prompt"
or (st[:1] in "{[")
or any(ch.isspace() for ch in s)
):
extra_args_str += f" {k} {shlex.quote(s)}"
else:
extra_args_str += f" {k} {s}"
return extra_args_str.strip()


def build_extra_args_str_new(extra_args: dict) -> str:
"""Replicates the NEW universal-quoting logic (post-fix)."""
extra_args_str = ""
for k, v in extra_args.items():
if isinstance(v, bool):
extra_args_str += f" {k}"
else:
extra_args_str += f" {k} {shlex.quote(str(v))}"
return extra_args_str.strip()
Comment thread
coketaste marked this conversation as resolved.
Outdated


# --- Tests that FAIL with the old logic, PASS with the new ---

def test_shell_metachar_no_space_is_quoted_by_new():
"""Values with shell metacharacters but no spaces are NOT quoted by old code.

The old code only quotes when there's whitespace, a JSON-like prefix, or the
--limit-mm-per-prompt key. A value like 'foo;bar' (no space) slips through
unquoted, allowing shell injection. The new code always quotes.
"""
args = {"--some-arg": "foo;bar"}
old = build_extra_args_str_old(args)
new = build_extra_args_str_new(args)
# Old code: no whitespace -> not quoted, semicolon is a live shell metachar
assert old == "--some-arg foo;bar", f"unexpected old output: {old!r}"
# New code: shlex.quote wraps the value in single quotes
assert new == "--some-arg 'foo;bar'", f"unexpected new output: {new!r}"
assert old != new


def test_plain_string_with_metachar_is_unquoted_by_old():
"""Old code leaves plain strings with $ unquoted (variable expansion risk)."""
args = {"--trust-remote-code": "yes$HOME"}
old = build_extra_args_str_old(args)
new = build_extra_args_str_new(args)
# Old code: no whitespace, no JSON prefix -> raw string passed to shell
assert old == "--trust-remote-code yes$HOME", f"unexpected old output: {old!r}"
# New code: always quoted
assert new == "--trust-remote-code 'yes$HOME'", f"unexpected new output: {new!r}"


# --- Tests that PASS with BOTH old and new logic ---

def test_json_value_is_quoted():
args = {"--limit-mm-per-prompt": '{"image":0,"audio":0}'}
result = build_extra_args_str_new(args)
assert result == """--limit-mm-per-prompt '{"image":0,"audio":0}'"""


def test_bool_flag_has_no_value():
args = {"--async-scheduling": True}
result = build_extra_args_str_new(args)
assert result == "--async-scheduling"


def test_string_with_space_is_quoted():
args = {"--served-model-name": "my model"}
result = build_extra_args_str_new(args)
assert result == "--served-model-name 'my model'"


def test_plain_safe_scalar_passthrough():
"""shlex.quote does not add quotes to safe alphanumeric values."""
args = {"--max-model-len": 32768}
result = build_extra_args_str_new(args)
# shlex.quote('32768') == '32768' (no shell quoting needed for pure digits)
assert result == "--max-model-len 32768"