[BUG] Performance degradation and garbled text output when benchmarking DeepSeek-V4-Pro with ATOM backend on MI355X

### Problem Description

Issue Description
I attempted to replicate the benchmark results for backend=ATOM and model=DeepSeek-V4-Pro on hardware=mi355x following the parameters and docker images specified in the [ROCm ATOM Benchmark Dashboard](https://www.google.com/search?q=https://rocm.github.io/ATOM/benchmark-dashboard/%23backend%3DATOM%26model%3DDeepSeek-V4-Pro%26hardware%3Dmi355x).

However, the actual performance obtained in my environment is significantly lower than the dashboard expectations (roughly a 13x performance drop). Additionally, testing with newer image versions results in garbled/corrupted text outputs.

Environment & Reproduce Steps
1. Server Start Command
```bash
docker run -itd \
  --name deepseek-v4-pro \
  --device amd.com/gpu=all \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --privileged \
  --security-opt seccomp=unconfined \
  --ipc=host \
  --network=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --entrypoint=/bin/bash \
  -v /data/:/data/ \
  -e ATOM_DISABLE_MMAP=true \
  -e AITER_BF16_FP8_MOE_BOUND=0 \
  -e ATOM_MOE_GU_ITLV=1 \
  rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3 \
  -c "python -m atom.entrypoints.openai_server \
    --model /data/DeepSeek-V4-Pro \
    --kv_cache_dtype fp8 \
    -tp 8 \
    &>> /data/atom-log/vllm_deepseek_v4_pro.log 2>&1"
```
2. Benchmark Command
```bash
python -m atom.benchmarks.benchmark_serving \
  --model=/data/DeepSeek-V4-Pro \
  --backend=vllm \
  --base-url=http://localhost:8000 \
  --dataset-name=random \
  --random-input-len=8192 \
  --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --max-concurrency=4 \
  --num-prompts=40 \
  --trust-remote-code \
  --num-warmups=8 \
  --request-rate=inf \
  --ignore-eos \
  --save-result \
  --percentile-metrics=ttft,tpot,itl,e2el \
  --result-dir=. \
  --result-filename=benchmark_result.json
```
Actual Results (atom0.1.3)
```bash
============ Serving Benchmark Result ============
Successful requests:                      40        
Benchmark duration (s):                  1337.81   
Total input tokens:                      163098    
Total generated tokens:                  21612     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         16.15     
Total Token throughput (tok/s):          138.07    
---------------Time to First Token----------------
Mean TTFT (ms):                          3683.35   
Median TTFT (ms):                        3114.96   
P99 TTFT (ms):                           10459.66  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          232.95    
Median TPOT (ms):                        231.82    
P99 TPOT (ms):                           255.16    
---------------Inter-token Latency----------------
Mean ITL (ms):                           233.59    
Median ITL (ms):                         221.04    
P99 ITL (ms):                            229.28    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          129894.08 
Median E2EL (ms):                        123802.28 
P99 E2EL (ms):                           229624.88 
==================================================
```
Metrics Comparison
| Metric | Total Throughput(tok/s) | Median TTFT (ms) | Median TPOT (ms) | 
|-------|-------|-------|-------|
| Expected (Dashboard) | 1868.5 | 869.8 | 15.68 |
| Actual | 138.07 | 3114 | 231.82  |

Additional Observations (Garbled Output on Newer Images)
Furthermore, when testing the same configuration across several newer image versions, the model returns garbled text/nonsense outputs. Affected images include:
- rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4
- rocm/atom-dev:nightly_202606201539
- rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4_20260612

Are there any missing environment variables, hardware pre-configurations (e.g., RCCL/NUMA tuning), or model-specific flags needed to reproduce the dashboard performance and resolve the text corruption issues?

```bash
time curl -X POST http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "/data/DeepSeek-v4-Flash",
    "messages": [
        {"role": "user", "content": "你的名字叫什么"}
    ],
   "stream": false
}'
{"id":"chatcmpl-aaa78e6ac7af41728bc318852ff1c593","object":"chat.completion","created":1782018338,"model":"/data/DeepSeek-v4-Flash","choices":[{"index":0,"message":{"role":"assistant","content":" the names? thinking the names is, we我们本事了大部分人 的名字? We说是名字记录， verschiedenen records on, einen single. as??? wieYeah??Sun??andurned?sumified???um?isO???&maybe?? Forme-Siter and thinking-t-a?Form-in:`att?iat##F-one? (F-functionrunin-rerecord-en.- Essay- Scv-representative! The len! ofglobalche-Cxs? mul? (((mak-S-m-ent)ASess?))Nik?? (Nic?glob?? ^(i-?nebe&D_&&ie list-s-?w)???ThemeE??Process???(Ref-M?!! ), it-R? ??? &!.tre? and43x?Collections? icna-nameac-De-:--Tri()bow-,Anchor?‘order-Tnit-Aellungandsp-???’ for-atio) “example?ch??:-of????ph - though-p`s-b & {danz?HC==Jist]-A being-h-? (tat-iconco~‘to fund->=S possession’Is}to-Esurfoundoutpu?aqueSS??w?rol???&Tpertension???and?.eh?pi??ew?.l-gener~*eer-def-y-R\n & description ? -+N??S???*??the-TOPTENT,:xe-ofoptI;-?It-ck??+omker?\n-are-mat-D zer:-√#and?as-Le-thoffs????1?ing??.\n\nmetalAN???,-ge-la-f?md-bdg?ns???$-v?lis-tab___LT-h-V?m-I???whatlet???, ch????vs??Purpose????w&sh.now-n-bornxp-g?: ?PLid??the-doors??Is?,an??-?mFe?‘R?-??T?!?»read(thvmrmin??)ny?metal \n?&-PV:?SV???&LI,.sw?periodic-w.??E-anw-E??Estryfo??;spl?abp?-??nez-C-and????- ~-s¢-!?c????v???l???-?-???Stop??+??l-???iSu.??MWW?**(?edw.!by??try==LL?dd-in-L-brg?! point-O,[?&- erstle-mHL-)WaBut$)W-?!p?Shu???gu?-!?!!       ebii!:?part-buh??&& ?-(?~.??43?&-???raWa—or ____),b-but-. figure?dy?; look-ma�?act?? is&e?{rsMon9)\n   -ou?“p??ayusion-??LAGASS??!!I ?-&&?p*OurcedeNin??_)®one-source??→Some?Main????and->- -?-*-hWaThis ????Will-S$w~-(v? mostSi-P?).??? ?P\\); -she+><hndju??)?~h-bCh??-hol-s!? -> (ce,gt-fat?-fvir?AO-M?s?? ),??.?Tr-S?asi?-?BuckD?--?? br ?Y-y-lhJa??-menelling??only-h?as?Ro-VI? (????See-l)Sl-h???snb-Un?ce??e(?-se?{-^»acUVWiththough??nor: .acWLevel�),vu?R-d?d?Turn-/?UseJR× VI?????uses-andcolumn?—profile :??ht?pk?tunsyc?-tg-tac·??─------ SC;act__???gin‐? ?sc-s??OR`IU-ol?:oft?👉o- sem-L―)?oper? �GET,?MP|?edile’?"},"finish_reason":"max_tokens"}],"usage":{"prompt_tokens":7,"completion_tokens":8192,"total_tokens":8199,"ttft_s":0.3518,"tpot_s":0.1127,"latency_s":923.3105},"kv_transfer_params":null}
real    15m23.320s
user    0m0.011s
sys     0m0.014s
```

### Operating System

Ubuntu 24.04 LTS

### CPU

AMD EPYC 9575F

### GPU

AMD Instinct MI355X

### ROCm Version

7.2.4

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

<details>
<summary>rocminfo --support output</summary>

```
Paste output here
```

</details>


### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Performance degradation and garbled text output when benchmarking DeepSeek-V4-Pro with ATOM backend on MI355X #1302

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Performance degradation and garbled text output when benchmarking DeepSeek-V4-Pro with ATOM backend on MI355X #1302

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions