[Issue]: Output degeneration on second identical request with prefix caching (MiniMax M2.5, MI350X)

### Problem Description

## Summary

With **Atom plugin enabled**, **prefix caching on**, and **MiniMax M2.5** on MI350X, the **first** greedy chat completion is coherent. Sending the **same request again** on the same server produces **degenerate repetition** (`\ \` token loop) instead of coherent text.

Stock vLLM v0.22 (baseline) and Atom with `ATOM_DISABLE_VLLM_PLUGIN=1` remain stable across both requests.

Single-request testing does not surface this; it appears tied to a **second request with a warm prefix / KV cache**.

## Environment

| Item           | Value                                         |
| -------------- | --------------------------------------------- |
| GPU            | 8× MI350X (`HIP_VISIBLE_DEVICES=0,1`, TP=2)   |
| Model          | `MiniMaxAI/MiniMax-M2.5`                      |
| Baseline image | `vllm/vllm-openai-rocm:v0.22.0`               |
| Atom image     | `rocm/atom-dev:vllm-v0.22.0-nightly_20260604` |
| Repro stamp    | `20260616_165800_kv_cache_repro`              |

### Serve configuration

```bash
AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1   # experiments B/C (Atom)
VLLM_FLOAT32_MATMUL_PRECISION=high
PYTHONHASHSEED=0
VLLM_ROCM_USE_AITER=1
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
# Experiment C only: ATOM_DISABLE_VLLM_PLUGIN=1

vllm serve /path/to/minimax-m2.5 \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size=2 \
  --async-scheduling \
  --compilation-config='{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --max-num-batched-tokens=16384 \
  --tool-call-parser=minimax_m2 \
  --reasoning-parser=minimax_m2_append_think \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --safetensors-load-strategy=eager \
  --enable-prefix-caching \
  --kv-cache-dtype=fp8 \
  --attention-backend=ROCM_AITER_UNIFIED_ATTN \
  --gpu-memory-utilization=0.78
```

## Steps to reproduce

1. Start vLLM with Atom plugin **ON** (experiment B config above).
2. Send the same chat completion request **twice** (1 s pause), greedy decode:

```json
{
  "model": "/dev/shm/hf-cache/models/minimax-m2.5",
  "messages": [{
    "role": "user",
    "content": "Provide a comprehensive explanation of how modern CPUs execute instructions, including the instruction pipeline, branch prediction, speculative execution, out-of-order execution, register renaming, cache hierarchies (L1/L2/L3), memory barriers, and SIMD instructions. Explain how these concepts affect code performance."
  }],
  "max_tokens": 1024,
  "temperature": 0,
  "stream": false
}
```

3. Compare `choices[0].message.content` from request 1 vs request 2.

### Control experiments

| ID  | Image                                         | Atom plugin                        |
| --- | --------------------------------------------- | ---------------------------------- |
| A   | `vllm/vllm-openai-rocm:v0.22.0`               | n/a (baseline)                     |
| B   | `rocm/atom-dev:vllm-v0.22.0-nightly_20260604` | ON                                 |
| C   | `rocm/atom-dev:vllm-v0.22.0-nightly_20260604` | OFF (`ATOM_DISABLE_VLLM_PLUGIN=1`) |

## Results

Two identical greedy requests per experiment (`temperature=0`, `max_tokens=1024`). Output length is character count of `choices[0].message.content`. **Output 1 ≈ Output 2** means both responses are coherent and non-degenerate (not byte-identical).

| Experiment          | Output length (req 1 / req 2) | Output 1 ≈ Output 2 |
| ------------------- | ----------------------------- | ------------------- |
| A — vLLM baseline   | 4886 / 5315                   | ✅                   |
| B — Atom plugin ON  | 5225 / 1544                   | ❌                   |
| C — Atom plugin OFF | 5191 / 5289                   | ✅                   |

Experiment B is the only case where request 2 is shorter and qualitatively broken (degenerate `\ \` repetition vs coherent reasoning on request 1).

## Analysis

First 50 characters of `choices[0].message.content` for each request:

| Experiment          | Request 1                                            | Request 2                                            |
| ------------------- | ---------------------------------------------------- | ---------------------------------------------------- |
| A — vLLM baseline   | `<think>Hmm, this is a complex and technical questi` | `<think>Hmm, this is a complex and detailed questio` |
| B — Atom plugin ON  | `<think>Hmm, this is a complex and detailed questio` | `<think>   ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵` |
| C — Atom plugin OFF | `<think>Hmm, this is a complex and detailed questio` | `<think>Hmm, this is a complex and detailed questio` |

Only experiment B’s second request diverges immediately into a `\ \` repetition pattern instead of coherent text.

## Additional notes

- Single-request runs on the same stack did not show this regression; **two identical requests** were required.


### Operating System

Ubuntu 24.04

### CPU

AMD EPYC 9575F 64-Core Processor x 2

### GPU

MI350x

### ROCm Version

7.2.2

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Output degeneration on second identical request with prefix caching (MiniMax M2.5, MI350X) #1239

Problem Description

Summary

Environment

Serve configuration

Steps to reproduce

Control experiments

Results

Analysis

Additional notes

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Value
GPU	8× MI350X (`HIP_VISIBLE_DEVICES=0,1`, TP=2)
Model	`MiniMaxAI/MiniMax-M2.5`
Baseline image	`vllm/vllm-openai-rocm:v0.22.0`
Atom image	`rocm/atom-dev:vllm-v0.22.0-nightly_20260604`
Repro stamp	`20260616_165800_kv_cache_repro`

ID	Image	Atom plugin
A	`vllm/vllm-openai-rocm:v0.22.0`	n/a (baseline)
B	`rocm/atom-dev:vllm-v0.22.0-nightly_20260604`	ON
C	`rocm/atom-dev:vllm-v0.22.0-nightly_20260604`	OFF (`ATOM_DISABLE_VLLM_PLUGIN=1`)

Experiment	Output length (req 1 / req 2)	Output 1 ≈ Output 2
A — vLLM baseline	4886 / 5315	✅
B — Atom plugin ON	5225 / 1544	❌
C — Atom plugin OFF	5191 / 5289	✅

Experiment	Request 1	Request 2
A — vLLM baseline	`<think>Hmm, this is a complex and technical questi`	`<think>Hmm, this is a complex and detailed questio`
B — Atom plugin ON	`<think>Hmm, this is a complex and detailed questio`	`<think> ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵`
C — Atom plugin OFF	`<think>Hmm, this is a complex and detailed questio`	`<think>Hmm, this is a complex and detailed questio`

Uh oh!

[Issue]: Output degeneration on second identical request with prefix caching (MiniMax M2.5, MI350X) #1239

Description

Problem Description

Summary

Environment

Serve configuration

Steps to reproduce

Control experiments

Results

Analysis

Additional notes

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions