Skip to content

[Issue]: Output degeneration on second identical request with prefix caching (MiniMax M2.5, MI350X) #1239

Description

@jamesETsmith

Problem Description

Summary

With Atom plugin enabled, prefix caching on, and MiniMax M2.5 on MI350X, the first greedy chat completion is coherent. Sending the same request again on the same server produces degenerate repetition (\ \ token loop) instead of coherent text.

Stock vLLM v0.22 (baseline) and Atom with ATOM_DISABLE_VLLM_PLUGIN=1 remain stable across both requests.

Single-request testing does not surface this; it appears tied to a second request with a warm prefix / KV cache.

Environment

Item Value
GPU 8× MI350X (HIP_VISIBLE_DEVICES=0,1, TP=2)
Model MiniMaxAI/MiniMax-M2.5
Baseline image vllm/vllm-openai-rocm:v0.22.0
Atom image rocm/atom-dev:vllm-v0.22.0-nightly_20260604
Repro stamp 20260616_165800_kv_cache_repro

Serve configuration

AITER_QUICK_REDUCE_QUANTIZATION=INT4
ATOM_ENABLE_QK_NORM_ROPE_CACHE_QUANT_FUSION=1   # experiments B/C (Atom)
VLLM_FLOAT32_MATMUL_PRECISION=high
PYTHONHASHSEED=0
VLLM_ROCM_USE_AITER=1
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
# Experiment C only: ATOM_DISABLE_VLLM_PLUGIN=1

vllm serve /path/to/minimax-m2.5 \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size=2 \
  --async-scheduling \
  --compilation-config='{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --max-num-batched-tokens=16384 \
  --tool-call-parser=minimax_m2 \
  --reasoning-parser=minimax_m2_append_think \
  --enable-auto-tool-choice \
  --trust-remote-code \
  --safetensors-load-strategy=eager \
  --enable-prefix-caching \
  --kv-cache-dtype=fp8 \
  --attention-backend=ROCM_AITER_UNIFIED_ATTN \
  --gpu-memory-utilization=0.78

Steps to reproduce

  1. Start vLLM with Atom plugin ON (experiment B config above).
  2. Send the same chat completion request twice (1 s pause), greedy decode:
{
  "model": "/dev/shm/hf-cache/models/minimax-m2.5",
  "messages": [{
    "role": "user",
    "content": "Provide a comprehensive explanation of how modern CPUs execute instructions, including the instruction pipeline, branch prediction, speculative execution, out-of-order execution, register renaming, cache hierarchies (L1/L2/L3), memory barriers, and SIMD instructions. Explain how these concepts affect code performance."
  }],
  "max_tokens": 1024,
  "temperature": 0,
  "stream": false
}
  1. Compare choices[0].message.content from request 1 vs request 2.

Control experiments

ID Image Atom plugin
A vllm/vllm-openai-rocm:v0.22.0 n/a (baseline)
B rocm/atom-dev:vllm-v0.22.0-nightly_20260604 ON
C rocm/atom-dev:vllm-v0.22.0-nightly_20260604 OFF (ATOM_DISABLE_VLLM_PLUGIN=1)

Results

Two identical greedy requests per experiment (temperature=0, max_tokens=1024). Output length is character count of choices[0].message.content. Output 1 ≈ Output 2 means both responses are coherent and non-degenerate (not byte-identical).

Experiment Output length (req 1 / req 2) Output 1 ≈ Output 2
A — vLLM baseline 4886 / 5315
B — Atom plugin ON 5225 / 1544
C — Atom plugin OFF 5191 / 5289

Experiment B is the only case where request 2 is shorter and qualitatively broken (degenerate \ \ repetition vs coherent reasoning on request 1).

Analysis

First 50 characters of choices[0].message.content for each request:

Experiment Request 1 Request 2
A — vLLM baseline <think>Hmm, this is a complex and technical questi <think>Hmm, this is a complex and detailed questio
B — Atom plugin ON <think>Hmm, this is a complex and detailed questio <think> ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵\ ↵
C — Atom plugin OFF <think>Hmm, this is a complex and detailed questio <think>Hmm, this is a complex and detailed questio

Only experiment B’s second request diverges immediately into a \ \ repetition pattern instead of coherent text.

Additional notes

  • Single-request runs on the same stack did not show this regression; two identical requests were required.

Operating System

Ubuntu 24.04

CPU

AMD EPYC 9575F 64-Core Processor x 2

GPU

MI350x

ROCm Version

7.2.2

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions