[Issue]: Running Minimax 2.5/2.7 using tip of tree (ATOM/AITER/vLLM) or latest nightly version with TP=8 on results in error: Invalid qk hidden dim layout Invalid qk hidden dim layout for qknorm_allreduce_fusion_kernel_2stage kernel:

### Problem Description
I am using vLLM + AITER + ATOM tip of tree.  I tried with:

1) ` rocm/atom-dev:vllm-latest` (works)
2)  `rocm/atom-dev:vllm-v0.22.0-nightly_20260616` (doesn't work, exhibits this problem)

I ran with the following configuration
```
export HF_TOKEN=<hf token>
TP=8
vllm serve MiniMaxAI/MiniMax-M2.5  \
    --trust-remote-code \
    --tensor-parallel-size ${TP}\
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-expert-parallel \
    --enforce-eager
    --gpu-memory-utilization 0.55

```

I got the following error:

```
EngineCore pid=373365) ERROR 06-15 17:28:30 [multiproc_executor.py:284] Worker proc VllmWorker-6 died unexpectedly, shutting down executor.
(EngineCore pid=373365) INFO 06-15 17:28:30 [multiproc_executor.py:428] [shutdown] Executor: waiting for worker exit count=8
(EngineCore pid=373365) Process EngineCore:
(EngineCore pid=373365) Traceback (most recent call last):
(EngineCore pid=373365)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=373365)     self.run()
(EngineCore pid=373365)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=373365)     self._target(*self._args, **self._kwargs)
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/engine/core.py", line 1199, in run_engine_core
(EngineCore pid=373365)     raise e
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/engine/core.py", line 1164, in run_engine_core
(EngineCore pid=373365)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=373365)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=373365)     return func(*args, **kwargs)
(EngineCore pid=373365)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/engine/core.py", line 930, in __init__
(EngineCore pid=373365)     super().__init__(
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/engine/core.py", line 132, in __init__
(EngineCore pid=373365)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=373365)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=373365)     return func(*args, **kwargs)
(EngineCore pid=373365)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/engine/core.py", line 257, in _initialize_kv_caches
(EngineCore pid=373365)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=373365)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=373365)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=373365)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/executor/multiproc_executor.py", line 404, in collective_rpc
(EngineCore pid=373365)     return future if non_block else future.result()
(EngineCore pid=373365)                                     ^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/executor/multiproc_executor.py", line 91, in result
(EngineCore pid=373365)     return super().result()
(EngineCore pid=373365)            ^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=373365)     return self.__get_result()
(EngineCore pid=373365)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=373365)     raise self._exception
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/executor/multiproc_executor.py", line 95, in _wait_for_response
(EngineCore pid=373365)     response = self.aggregate(self.get_response())
(EngineCore pid=373365)                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=373365)   File "/vllm-upstream-atom/vllm/v1/executor/multiproc_executor.py", line 391, in get_response
(EngineCore pid=373365)     raise RuntimeError(
(EngineCore pid=373365) RuntimeError: Worker failed with error 'Invalid qk hidden dim layout for qknorm_allreduce_fusion_kernel_2stage kernel', please check the stack trace above for the root cause
```
This happens because `hidden_dim_q = 768`, `hidden_dim_k = 128`, `hidden_dim_v = 128` when `TP=8`.

The kernel launcher then expects all inputs to be a multiple of:

`constexpr int WARP_WORK_SIZE = WARP_SIZE * PACK_SIZE`

with 
`constexpr int PACK_SIZE  = 16 / sizeof(T);`
and `PACK_SIZE` will be `16 / 2 = 4`, so `WARP_WORK_SIZE` will be `WARP_SIZE (32) * PACK_SIZE(4) = 256`
since the inputs are of type `torch.bfloat16` which causes the error to be raised.

### Operating System

22.04.5 LTS (Jammy Jellyfish)"

### CPU

AMD EPYC 9575F 64-Core Processor

### GPU

AMD EPYC 9575F 64-Core Processor

### ROCm Version

ROCm 7.2.2

### ROCm Component

_No response_

### Steps to Reproduce

```
export HF_TOKEN=<hf token>
TP=8
vllm serve MiniMaxAI/MiniMax-M2.5  \
    --trust-remote-code \
    --tensor-parallel-size ${TP}\
    --enable-auto-tool-choice \
    --tool-call-parser minimax_m2 \
    --reasoning-parser minimax_m2_append_think \
    --enable-expert-parallel \
    --enforce-eager
    --gpu-memory-utilization 0.55

```



### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

<details>
<summary>rocminfo --support output</summary>

```
Paste output here
```

</details>


### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Running Minimax 2.5/2.7 using tip of tree (ATOM/AITER/vLLM) or latest nightly version with TP=8 on results in error: Invalid qk hidden dim layout Invalid qk hidden dim layout for qknorm_allreduce_fusion_kernel_2stage kernel: #1223

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Issue]: Running Minimax 2.5/2.7 using tip of tree (ATOM/AITER/vLLM) or latest nightly version with TP=8 on results in error: Invalid qk hidden dim layout Invalid qk hidden dim layout for qknorm_allreduce_fusion_kernel_2stage kernel: #1223

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions