Skip to content

[BUG] Performance degradation and garbled text output when benchmarking DeepSeek-V4-Pro with ATOM backend on MI355X #1302

Description

@luqitao

Problem Description

Issue Description
I attempted to replicate the benchmark results for backend=ATOM and model=DeepSeek-V4-Pro on hardware=mi355x following the parameters and docker images specified in the ROCm ATOM Benchmark Dashboard.

However, the actual performance obtained in my environment is significantly lower than the dashboard expectations (roughly a 13x performance drop). Additionally, testing with newer image versions results in garbled/corrupted text outputs.

Environment & Reproduce Steps

  1. Server Start Command
docker run -itd \
  --name deepseek-v4-pro \
  --device amd.com/gpu=all \
  --device /dev/kfd \
  --device /dev/dri \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --privileged \
  --security-opt seccomp=unconfined \
  --ipc=host \
  --network=host \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --entrypoint=/bin/bash \
  -v /data/:/data/ \
  -e ATOM_DISABLE_MMAP=true \
  -e AITER_BF16_FP8_MOE_BOUND=0 \
  -e ATOM_MOE_GU_ITLV=1 \
  rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3 \
  -c "python -m atom.entrypoints.openai_server \
    --model /data/DeepSeek-V4-Pro \
    --kv_cache_dtype fp8 \
    -tp 8 \
    &>> /data/atom-log/vllm_deepseek_v4_pro.log 2>&1"
  1. Benchmark Command
python -m atom.benchmarks.benchmark_serving \
  --model=/data/DeepSeek-V4-Pro \
  --backend=vllm \
  --base-url=http://localhost:8000 \
  --dataset-name=random \
  --random-input-len=8192 \
  --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --max-concurrency=4 \
  --num-prompts=40 \
  --trust-remote-code \
  --num-warmups=8 \
  --request-rate=inf \
  --ignore-eos \
  --save-result \
  --percentile-metrics=ttft,tpot,itl,e2el \
  --result-dir=. \
  --result-filename=benchmark_result.json

Actual Results (atom0.1.3)

============ Serving Benchmark Result ============
Successful requests:                      40        
Benchmark duration (s):                  1337.81   
Total input tokens:                      163098    
Total generated tokens:                  21612     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         16.15     
Total Token throughput (tok/s):          138.07    
---------------Time to First Token----------------
Mean TTFT (ms):                          3683.35   
Median TTFT (ms):                        3114.96   
P99 TTFT (ms):                           10459.66  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          232.95    
Median TPOT (ms):                        231.82    
P99 TPOT (ms):                           255.16    
---------------Inter-token Latency----------------
Mean ITL (ms):                           233.59    
Median ITL (ms):                         221.04    
P99 ITL (ms):                            229.28    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          129894.08 
Median E2EL (ms):                        123802.28 
P99 E2EL (ms):                           229624.88 
==================================================

Metrics Comparison

Metric Total Throughput(tok/s) Median TTFT (ms) Median TPOT (ms)
Expected (Dashboard) 1868.5 869.8 15.68
Actual 138.07 3114 231.82

Additional Observations (Garbled Output on Newer Images)
Furthermore, when testing the same configuration across several newer image versions, the model returns garbled text/nonsense outputs. Affected images include:

  • rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4
  • rocm/atom-dev:nightly_202606201539
  • rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4_20260612

Are there any missing environment variables, hardware pre-configurations (e.g., RCCL/NUMA tuning), or model-specific flags needed to reproduce the dashboard performance and resolve the text corruption issues?

time curl -X POST http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "model": "/data/DeepSeek-v4-Flash",
    "messages": [
        {"role": "user", "content": "你的名字叫什么"}
    ],
   "stream": false
}'
{"id":"chatcmpl-aaa78e6ac7af41728bc318852ff1c593","object":"chat.completion","created":1782018338,"model":"/data/DeepSeek-v4-Flash","choices":[{"index":0,"message":{"role":"assistant","content":" the names? thinking the names is, we我们本事了大部分人 的名字? We说是名字记录, verschiedenen records on, einen single. as??? wieYeah??Sun??andurned?sumified???um?isO???&maybe?? Forme-Siter and thinking-t-a?Form-in:`att?iat##F-one? (F-functionrunin-rerecord-en.- Essay- Scv-representative! The len! ofglobalche-Cxs? mul? (((mak-S-m-ent)ASess?))Nik?? (Nic?glob?? ^(i-?nebe&D_&&ie list-s-?w)???ThemeE??Process???(Ref-M?!! ), it-R? ??? &!.tre? and43x?Collections? icna-nameac-De-:--Tri()bow-,Anchor?‘order-Tnit-Aellungandsp-???’ for-atio) “example?ch??:-of????ph - though-p`s-b & {danz?HC==Jist]-A being-h-? (tat-iconco~‘to fund->=S possession’Is}to-Esurfoundoutpu?aqueSS??w?rol???&Tpertension???and?.eh?pi??ew?.l-gener~*eer-def-y-R\n & description ? -+N??S???*??the-TOPTENT,:xe-ofoptI;-?It-ck??+omker?\n-are-mat-D zer:-√#and?as-Le-thoffs????1?ing??.\n\nmetalAN???,-ge-la-f?md-bdg?ns???$-v?lis-tab___LT-h-V?m-I???whatlet???, ch????vs??Purpose????w&sh.now-n-bornxp-g?: ?PLid??the-doors??Is?,an??-?mFe?‘R?-??T?!?»read(thvmrmin??)ny?metal \n?&-PV:?SV???&LI,.sw?periodic-w.??E-anw-E??Estryfo??;spl?abp?-??nez-C-and????- ~-s¢-!?c????v???l???-?-???Stop??+??l-???iSu.??MWW?**(?edw.!by??try==LL?dd-in-L-brg?! point-O,[?&- erstle-mHL-)WaBut$)W-?!p?Shu???gu?-!?!!       ebii!:?part-buh??&& ?-(?~.??43?&-???raWa—or ____),b-but-. figure?dy?; look-ma�?act?? is&e?{rsMon9)\n   -ou?“p??ayusion-??LAGASS??!!I ?-&&?p*OurcedeNin??_)®one-source??→Some?Main????and->- -?-*-hWaThis ????Will-S$w~-(v? mostSi-P?).??? ?P\\); -she+><hndju??)?~h-bCh??-hol-s!? -> (ce,gt-fat?-fvir?AO-M?s?? ),??.?Tr-S?asi?-?BuckD?--?? br ?Y-y-lhJa??-menelling??only-h?as?Ro-VI? (????See-l)Sl-h???snb-Un?ce??e(?-se?{-^»acUVWiththough??nor: .acWLevel�),vu?R-d?d?Turn-/?UseJR× VI?????uses-andcolumn?—profile :??ht?pk?tunsyc?-tg-tac·??─------ SC;act__???gin‐? ?sc-s??OR`IU-ol?:oft?👉o- sem-L―)?oper? �GET,?MP|?edile’?"},"finish_reason":"max_tokens"}],"usage":{"prompt_tokens":7,"completion_tokens":8192,"total_tokens":8199,"ttft_s":0.3518,"tpot_s":0.1127,"latency_s":923.3105},"kv_transfer_params":null}
real    15m23.320s
user    0m0.011s
sys     0m0.014s

Operating System

Ubuntu 24.04 LTS

CPU

AMD EPYC 9575F

GPU

AMD Instinct MI355X

ROCm Version

7.2.4

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo --support output
Paste output here

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions