ROCm · coketaste · Apr 14, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
@@ -14,7 +14,7 @@ This Docker image packages vLLM with PyTorch for AMD Instinct™ MI300X, MI325X,
 accelerators. It includes:
 
 -   ✅ ROCm™ 7.0.0
--   ✅ vLLM 0.17.1
+-   ✅ vLLM 0.20.0
 -   ✅ PyTorch 2.9.0 (2.9.0a0+git1c57644)
 -   ✅ hipBLASLt 1.0
 
@@ -58,7 +58,7 @@ To override the benchmark configs, specify a certain benchmark to use, or add yo
 The following command pulls the Docker image from Docker Hub.
 
 ```sh
-docker pull vllm/vllm-openai-rocm:v0.17.1
+docker pull vllm/vllm-openai-rocm:v0.20.0
 ```
 
 ### MAD-integrated benchmarking
@@ -86,7 +86,17 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
 #### Available models
 
 >[!NOTE]
->The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.	
+>The MXFP4 models are only supported on the gfx950 architecture i.e. MI350X/MI355X accelerators.
+
+>[!NOTE]
+>Gemma 4 models (`pyt_vllm_gemma-4-*`) use the standard `docker/pyt_vllm` stack (vLLM 0.20.0, which bundles Transformers v5 with native Gemma 4 support). Accept Google’s Gemma license on Hugging Face and set `MAD_SECRETS_HFTOKEN` for gated weight downloads.
+
+Serving recipes for Gemma 4 live in [`scripts/vllm/configs/default.yaml`](../../scripts/vllm/configs/default.yaml). Both Gemma 4 entries use **tensor parallel size 1**, **`TRITON_ATTN`**, **`float16` on gfx942** (via `arch_overrides`), **`--max-model-len` 32768**, text-only multimodal limits (`--limit-mm-per-prompt`), and **`VLLM_ROCM_USE_AITER=1`**.
+
+| Model | Notes |
+| ----- | ----- |
+| **google/gemma-4-31B-it** | Dense instruct. Full serving sweep: **`max_concurrency` 1, 8, 32, 128** (four cold starts). |
+| **google/gemma-4-26B-A4B-it** | Sparse MoE (“A4B”). **AITER fused MoE is disabled** via **`VLLM_ROCM_USE_AITER_MOE=0`** so MoE runs on the **Triton** path. Full concurrency sweep: **`max_concurrency` 1, 8, 32, 128**. |
 
 | MAD model name                         | Model repo                             |
 | -------------------------------------- | -------------------------------------- |
@@ -112,6 +122,8 @@ users can also directly run the vLLm benchmark scripts and change the benchmarki
 | pyt_vllm_mixtral-8x22b                 | [mistralai/Mixtral-8x22B-Instruct-v0.1](https://hugggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) |
 | pyt_vllm_mixtral-8x22b_fp8             | [amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV](https://hugggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV) |
 | pyt_vllm_phi-4                         | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) |
+| pyt_vllm_gemma-4-26b-a4b-it            | [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) |
+| pyt_vllm_gemma-4-31b-it                | [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) |
 | pyt_vllm_qwen3-8b                      | [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |
 | pyt_vllm_qwen3-32b                     | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) |
 | pyt_vllm_qwen3-30b-a3b                 | [Qwen/Qwen3-30B-A3B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507) |
@@ -127,11 +139,13 @@ Users also can run the benchmark tool after they launch a Docker container. For
 
 #### Docker launch
 ```sh
-docker pull vllm/vllm-openai-rocm:v0.17.1
+docker pull vllm/vllm-openai-rocm:v0.20.0
 
-docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.17.1
+docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env VLLM_ROCM_USE_AITER=1 --env HUGGINGFACE_HUB_CACHE=/workspace --name test vllm/vllm-openai-rocm:v0.20.0
 ```
 
+For **`google/gemma-4-26B-A4B-it`** standalone runs, also set **`VLLM_ROCM_USE_AITER_MOE=0`** (same as the MAD `default.yaml` recipe) so MoE does not use AITER’s fused path.
+
 >[!NOTE]
 >We enable [AITER](https://github.com/ROCm/aiter) during `docker run` via `--env VLLM_ROCM_USE_AITER=1` for best performance
 >on MI3xx (i.e. gfx942 and gfx950) platforms. If you're using this docker image on other AMD GPUs e.g. MI2xx or Radeon,
@@ -345,6 +359,10 @@ owners and are only mentioned for informative purposes.
 ----------
 This release note summarizes notable changes since the previous docker release.
 
+MAD `pyt_vllm_gemma-4-*` configs (see [`default.yaml`](../../scripts/vllm/configs/default.yaml)):
+- **gemma-4-26B-A4B-it:** set `VLLM_ROCM_USE_AITER_MOE=0` (Triton MoE); full concurrency sweep **1 8 32 128**.
+- **gemma-4-31B-it:** unchanged full sweep **1 8 32 128**; no `VLLM_ROCM_USE_AITER_MOE` override.
+
 v0.17.1 release:
 - Includes documentation and patches for upstream releases. Please track https://github.com/vllm-project/vllm/releases
   for all future release notes.

@@ -24,7 +24,7 @@
 # SOFTWARE.
 #
 #################################################################################
-ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.17.1
+ARG BASE_DOCKER=vllm/vllm-openai-rocm:v0.20.0
 FROM $BASE_DOCKER
 
 USER root
@@ -36,4 +36,4 @@ WORKDIR $WORKSPACE_DIR
 RUN pip3 list
 
 # Specify entrypoint to override upstream
-ENTRYPOINT [""]
+ENTRYPOINT []
@@ -487,6 +487,44 @@
     "args":
     "--model_repo Qwen/Qwen3-8B --config configs/extended.yaml"
   },
+  {
+    "name": "pyt_vllm_gemma-4-26b-a4b-it",
+    "data": "huggingface",
+    "dockerfile": "docker/pyt_vllm",
+    "scripts": "scripts/vllm/run.sh",
+    "n_gpus": "-1",
+    "owner": "mad.support@amd.com",
+    "training_precision": "",
+    "multiple_results": "perf_gemma-4-26B-A4B-it.csv",
+    "tags": [
+      "pyt",
+      "vllm",
+      "vllm_extended",
+      "inference"
+    ],
+    "timeout": -1,
+    "args":
+    "--model_repo google/gemma-4-26B-A4B-it --config configs/default.yaml"
+  },
+  {
+    "name": "pyt_vllm_gemma-4-31b-it",
+    "data": "huggingface",
+    "dockerfile": "docker/pyt_vllm",
+    "scripts": "scripts/vllm/run.sh",
+    "n_gpus": "-1",
+    "owner": "mad.support@amd.com",
+    "training_precision": "",
+    "multiple_results": "perf_gemma-4-31B-it.csv",
+    "tags": [
+      "pyt",
+      "vllm",
+      "vllm_extended",
+      "inference"
+    ],
+    "timeout": -1,
+    "args":
+    "--model_repo google/gemma-4-31B-it --config configs/default.yaml"
+  },
   {
     "name": "pyt_vllm_qwen3-32b",
     "data": "huggingface",

@@ -94,4 +94,46 @@
     --attention-backend: ROCM_ATTN
   arch_overrides:
     gfx942:
-      dtype: float16
+      dtype: float16
+
+## Gemma 4: vLLM recipe recommends 1x MI300-class GPU (BF16); tp 1 for text-only bench
+## Use TRITON_ATTN (Gemma4 default); ROCM_ATTN does not support head_dim 512 without extra work
+## --limit-mm-per-prompt is JSON for vLLM (json.loads), not image=0,audio=0
+## 26B-A4B is MoE: disable AITER fused MoE only (unsupported CK GEMM for this shape); other AITER features unchanged
+## Full concurrency sweep (1 8 32 128)
+- benchmark: serving
+  model: google/gemma-4-26B-A4B-it
+  tp: 1
+  inp: 1024
+  out: 1024
+  dtype: auto
+  max_concurrency: 1 8 32 128
+  env:
+    VLLM_ROCM_USE_AITER: 1
+    VLLM_ROCM_USE_AITER_MOE: 0
+  extra_args:
+    --attention-backend: TRITON_ATTN
+    --max-model-len: 32768
+    --gpu-memory-utilization: 0.90
+    --limit-mm-per-prompt: '{"image":0,"audio":0}'
+  arch_overrides:
+    gfx942:
+      dtype: float16
+
+- benchmark: serving
+  model: google/gemma-4-31B-it
+  tp: 1
+  inp: 1024
+  out: 1024
+  dtype: auto
+  max_concurrency: 1 8 32 128
+  env:
+    VLLM_ROCM_USE_AITER: 1
+  extra_args:
+    --attention-backend: TRITON_ATTN
+    --max-model-len: 32768
+    --gpu-memory-utilization: 0.90
+    --limit-mm-per-prompt: '{"image":0,"audio":0}'
+  arch_overrides:
+    gfx942:
+      dtype: float16
@@ -34,10 +34,24 @@
 import signal
 import argparse
 import itertools
+import shlex
 import subprocess
 from typing import List, Dict
 
 SUPPORTED_LIST_ARGS = ['model', 'tp', 'inp', 'out', 'bs', 'num_prompts', 'max_concurrency']
+
+
+def build_extra_args_str(extra_args: Dict) -> str:
+    parts = []
+    for k, v in extra_args.items():
+        if isinstance(v, bool):
+            if v:
+                parts.append(k)
+        else:
+            parts.append(f"{k} {shlex.quote(str(v))}")
+    return " ".join(parts)
+
+
 CSV_HEADER = [
     "model",
     "benchmark",
@@ -485,12 +499,7 @@ def main():
             env_vars = config.get("env", {})
             extra_args = config.get("extra_args", {})
             env_vars_str = " ".join(f"{k}={v}" for k, v in env_vars.items())
-            extra_args_str = ""
-            for k, v in extra_args.items():
-                if isinstance(v, bool):
-                    extra_args_str += f" {k}"
-                else:
-                    extra_args_str += f" {k} {v}"
+            extra_args_str = build_extra_args_str(extra_args)
             config["env"] = env_vars_str
             config["extra_args"] = extra_args_str
 

@@ -0,0 +1,61 @@
+import sys
+import os
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "scripts", "vllm"))
+from run_vllm import build_extra_args_str
+
+
+def test_simple_string_value():
+    result = build_extra_args_str({"--attention-backend": "TRITON_ATTN"})
+    assert result == "--attention-backend TRITON_ATTN"
+
+
+def test_json_value_is_quoted():
+    result = build_extra_args_str({"--limit-mm-per-prompt": '{"image":0,"audio":0}'})
+    assert "'" in result or "\\" in result
+    assert "--limit-mm-per-prompt" in result
+
+
+def test_bool_true_emits_flag():
+    result = build_extra_args_str({"--enable-prefix-caching": True})
+    assert result == "--enable-prefix-caching"
+
+
+def test_bool_false_skips_flag():
+    result = build_extra_args_str({"--enable-prefix-caching": False})
+    assert result == ""
+
+
+def test_numeric_value():
+    result = build_extra_args_str({"--max-model-len": 32768})
+    assert result == "--max-model-len 32768"
+
+
+def test_mixed_args():
+    args = {
+        "--attention-backend": "TRITON_ATTN",
+        "--enable-prefix-caching": True,
+        "--disable-log-stats": False,
+        "--max-model-len": 32768,
+        "--limit-mm-per-prompt": '{"image":0,"audio":0}',
+    }
+    result = build_extra_args_str(args)
+    assert "--attention-backend TRITON_ATTN" in result
+    assert "--enable-prefix-caching" in result
+    assert "--disable-log-stats" not in result
+    assert "--max-model-len 32768" in result
+    assert "--limit-mm-per-prompt" in result
+
+
+def test_empty_args():
+    assert build_extra_args_str({}) == ""
+
+
+def test_value_with_spaces_is_quoted():
+    result = build_extra_args_str({"--chat-template": "path with spaces/template.jinja"})
+    assert "'" in result or "\\" in result
+
+
+def test_shell_metacharacters_are_quoted():
+    result = build_extra_args_str({"--some-arg": "value;rm -rf /"})
+    assert "'" in result or "\\" in result