Problem Description
Summary
With Atom plugin enabled, prefix caching on, and MiniMax M2.5 on MI350X, the first greedy chat completion is coherent. Sending the same request again on the same server produces degenerate repetition (\ \ token loop) instead of coherent text.
Stock vLLM v0.22 (baseline) and Atom with ATOM_DISABLE_VLLM_PLUGIN=1 remain stable across both requests.
Single-request testing does not surface this; it appears tied to a second request with a warm prefix / KV cache.
Environment
| Item |
Value |
| GPU |
8× MI350X (HIP_VISIBLE_DEVICES=0,1, TP=2) |
| Model |
MiniMaxAI/MiniMax-M2.5 |
| Baseline image |
vllm/vllm-openai-rocm:v0.22.0 |
| Atom image |
rocm/atom-dev:vllm-v0.22.0-nightly_20260604 |
| Repro stamp |
20260616_165800_kv_cache_repro |
Serve configuration
AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1 # experiments B/C (Atom)
VLLM_FLOAT32_MATMUL_PRECISION=high
PYTHONHASHSEED=0
VLLM_ROCM_USE_AITER=1
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
# Experiment C only: ATOM_DISABLE_VLLM_PLUGIN=1
vllm serve /path/to/minimax-m2.5 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size=2 \
--async-scheduling \
--compilation-config='{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--max-num-batched-tokens=16384 \
--tool-call-parser=minimax_m2 \
--reasoning-parser=minimax_m2_append_think \
--enable-auto-tool-choice \
--trust-remote-code \
--safetensors-load-strategy=eager \
--enable-prefix-caching \
--kv-cache-dtype=fp8 \
--attention-backend=ROCM_AITER_UNIFIED_ATTN \
--gpu-memory-utilization=0.78
Steps to reproduce
- Start vLLM with Atom plugin ON (experiment B config above).
- Send the same chat completion request twice (1 s pause), greedy decode:
{
"model": "/dev/shm/hf-cache/models/minimax-m2.5",
"messages": [{
"role": "user",
"content": "Provide a comprehensive explanation of how modern CPUs execute instructions, including the instruction pipeline, branch prediction, speculative execution, out-of-order execution, register renaming, cache hierarchies (L1/L2/L3), memory barriers, and SIMD instructions. Explain how these concepts affect code performance."
}],
"max_tokens": 1024,
"temperature": 0,
"stream": false
}
- Compare
choices[0].message.content from request 1 vs request 2.
Control experiments
| ID |
Image |
Atom plugin |
| A |
vllm/vllm-openai-rocm:v0.22.0 |
n/a (baseline) |
| B |
rocm/atom-dev:vllm-v0.22.0-nightly_20260604 |
ON |
| C |
rocm/atom-dev:vllm-v0.22.0-nightly_20260604 |
OFF (ATOM_DISABLE_VLLM_PLUGIN=1) |
Results
Two identical greedy requests per experiment (temperature=0, max_tokens=1024). Output length is character count of choices[0].message.content. Output 1 ≈ Output 2 means both responses are coherent and non-degenerate (not byte-identical).
| Experiment |
Output length (req 1 / req 2) |
Output 1 ≈ Output 2 |
| A — vLLM baseline |
4886 / 5315 |
✅ |
| B — Atom plugin ON |
5225 / 1544 |
❌ |
| C — Atom plugin OFF |
5191 / 5289 |
✅ |
Experiment B is the only case where request 2 is shorter and qualitatively broken (degenerate \ \ repetition vs coherent reasoning on request 1).
Analysis
First 50 characters of choices[0].message.content for each request:
| Experiment |
Request 1 |
Request 2 |
| A — vLLM baseline |
<think>Hmm, this is a complex and technical questi |
<think>Hmm, this is a complex and detailed questio |
| B — Atom plugin ON |
<think>Hmm, this is a complex and detailed questio |
<think> ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵ |
| C — Atom plugin OFF |
<think>Hmm, this is a complex and detailed questio |
<think>Hmm, this is a complex and detailed questio |
Only experiment B’s second request diverges immediately into a \ \ repetition pattern instead of coherent text.
Additional notes
- Single-request runs on the same stack did not show this regression; two identical requests were required.
Operating System
Ubuntu 24.04
CPU
AMD EPYC 9575F 64-Core Processor x 2
GPU
MI350x
ROCm Version
7.2.2
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Problem Description
Summary
With Atom plugin enabled, prefix caching on, and MiniMax M2.5 on MI350X, the first greedy chat completion is coherent. Sending the same request again on the same server produces degenerate repetition (
\ \token loop) instead of coherent text.Stock vLLM v0.22 (baseline) and Atom with
ATOM_DISABLE_VLLM_PLUGIN=1remain stable across both requests.Single-request testing does not surface this; it appears tied to a second request with a warm prefix / KV cache.
Environment
HIP_VISIBLE_DEVICES=0,1, TP=2)MiniMaxAI/MiniMax-M2.5vllm/vllm-openai-rocm:v0.22.0rocm/atom-dev:vllm-v0.22.0-nightly_2026060420260616_165800_kv_cache_reproServe configuration
Steps to reproduce
{ "model": "/dev/shm/hf-cache/models/minimax-m2.5", "messages": [{ "role": "user", "content": "Provide a comprehensive explanation of how modern CPUs execute instructions, including the instruction pipeline, branch prediction, speculative execution, out-of-order execution, register renaming, cache hierarchies (L1/L2/L3), memory barriers, and SIMD instructions. Explain how these concepts affect code performance." }], "max_tokens": 1024, "temperature": 0, "stream": false }choices[0].message.contentfrom request 1 vs request 2.Control experiments
vllm/vllm-openai-rocm:v0.22.0rocm/atom-dev:vllm-v0.22.0-nightly_20260604rocm/atom-dev:vllm-v0.22.0-nightly_20260604ATOM_DISABLE_VLLM_PLUGIN=1)Results
Two identical greedy requests per experiment (
temperature=0,max_tokens=1024). Output length is character count ofchoices[0].message.content. Output 1 ≈ Output 2 means both responses are coherent and non-degenerate (not byte-identical).Experiment B is the only case where request 2 is shorter and qualitatively broken (degenerate
\ \repetition vs coherent reasoning on request 1).Analysis
First 50 characters of
choices[0].message.contentfor each request:<think>Hmm, this is a complex and technical questi<think>Hmm, this is a complex and detailed questio<think>Hmm, this is a complex and detailed questio<think> ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵<think>Hmm, this is a complex and detailed questio<think>Hmm, this is a complex and detailed questioOnly experiment B’s second request diverges immediately into a
\ \repetition pattern instead of coherent text.Additional notes
Operating System
Ubuntu 24.04
CPU
AMD EPYC 9575F 64-Core Processor x 2
GPU
MI350x
ROCm Version
7.2.2
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response