Problem Description
Issue Description
I attempted to replicate the benchmark results for backend=ATOM and model=DeepSeek-V4-Pro on hardware=mi355x following the parameters and docker images specified in the ROCm ATOM Benchmark Dashboard.
However, the actual performance obtained in my environment is significantly lower than the dashboard expectations (roughly a 13x performance drop). Additionally, testing with newer image versions results in garbled/corrupted text outputs.
Environment & Reproduce Steps
- Server Start Command
docker run -itd \
--name deepseek-v4-pro \
--device amd.com/gpu=all \
--device /dev/kfd \
--device /dev/dri \
--group-add=video \
--cap-add=SYS_PTRACE \
--privileged \
--security-opt seccomp=unconfined \
--ipc=host \
--network=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--entrypoint=/bin/bash \
-v /data/:/data/ \
-e ATOM_DISABLE_MMAP=true \
-e AITER_BF16_FP8_MOE_BOUND=0 \
-e ATOM_MOE_GU_ITLV=1 \
rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3 \
-c "python -m atom.entrypoints.openai_server \
--model /data/DeepSeek-V4-Pro \
--kv_cache_dtype fp8 \
-tp 8 \
&>> /data/atom-log/vllm_deepseek_v4_pro.log 2>&1"
- Benchmark Command
python -m atom.benchmarks.benchmark_serving \
--model=/data/DeepSeek-V4-Pro \
--backend=vllm \
--base-url=http://localhost:8000 \
--dataset-name=random \
--random-input-len=8192 \
--random-output-len=1024 \
--random-range-ratio=0.8 \
--max-concurrency=4 \
--num-prompts=40 \
--trust-remote-code \
--num-warmups=8 \
--request-rate=inf \
--ignore-eos \
--save-result \
--percentile-metrics=ttft,tpot,itl,e2el \
--result-dir=. \
--result-filename=benchmark_result.json
Actual Results (atom0.1.3)
============ Serving Benchmark Result ============
Successful requests: 40
Benchmark duration (s): 1337.81
Total input tokens: 163098
Total generated tokens: 21612
Request throughput (req/s): 0.03
Output token throughput (tok/s): 16.15
Total Token throughput (tok/s): 138.07
---------------Time to First Token----------------
Mean TTFT (ms): 3683.35
Median TTFT (ms): 3114.96
P99 TTFT (ms): 10459.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 232.95
Median TPOT (ms): 231.82
P99 TPOT (ms): 255.16
---------------Inter-token Latency----------------
Mean ITL (ms): 233.59
Median ITL (ms): 221.04
P99 ITL (ms): 229.28
----------------End-to-end Latency----------------
Mean E2EL (ms): 129894.08
Median E2EL (ms): 123802.28
P99 E2EL (ms): 229624.88
==================================================
Metrics Comparison
| Metric |
Total Throughput(tok/s) |
Median TTFT (ms) |
Median TPOT (ms) |
| Expected (Dashboard) |
1868.5 |
869.8 |
15.68 |
| Actual |
138.07 |
3114 |
231.82 |
Additional Observations (Garbled Output on Newer Images)
Furthermore, when testing the same configuration across several newer image versions, the model returns garbled text/nonsense outputs. Affected images include:
- rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4
- rocm/atom-dev:nightly_202606201539
- rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4_20260612
Are there any missing environment variables, hardware pre-configurations (e.g., RCCL/NUMA tuning), or model-specific flags needed to reproduce the dashboard performance and resolve the text corruption issues?
time curl -X POST http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "/data/DeepSeek-v4-Flash",
"messages": [
{"role": "user", "content": "你的名字叫什么"}
],
"stream": false
}'
{"id":"chatcmpl-aaa78e6ac7af41728bc318852ff1c593","object":"chat.completion","created":1782018338,"model":"/data/DeepSeek-v4-Flash","choices":[{"index":0,"message":{"role":"assistant","content":" the names? thinking the names is, we我们本事了大部分人 的名字? We说是名字记录, verschiedenen records on, einen single. as??? wieYeah??Sun??andurned?sumified???um?isO???&maybe?? Forme-Siter and thinking-t-a?Form-in:`att?iat##F-one? (F-functionrunin-rerecord-en.- Essay- Scv-representative! The len! ofglobalche-Cxs? mul? (((mak-S-m-ent)ASess?))Nik?? (Nic?glob?? ^(i-?nebe&D_&&ie list-s-?w)???ThemeE??Process???(Ref-M?!! ), it-R? ??? &!.tre? and43x?Collections? icna-nameac-De-:--Tri()bow-,Anchor?‘order-Tnit-Aellungandsp-???’ for-atio) “example?ch??:-of????ph - though-p`s-b & {danz?HC==Jist]-A being-h-? (tat-iconco~‘to fund->=S possession’Is}to-Esurfoundoutpu?aqueSS??w?rol???&Tpertension???and?.eh?pi??ew?.l-gener~*eer-def-y-R\n & description ? -+N??S???*??the-TOPTENT,:xe-ofoptI;-?It-ck??+omker?\n-are-mat-D zer:-√#and?as-Le-thoffs????1?ing??.\n\nmetalAN???,-ge-la-f?md-bdg?ns???$-v?lis-tab___LT-h-V?m-I???whatlet???, ch????vs??Purpose????w&sh.now-n-bornxp-g?: ?PLid??the-doors??Is?,an??-?mFe?‘R?-??T?!?»read(thvmrmin??)ny?metal \n?&-PV:?SV???&LI,.sw?periodic-w.??E-anw-E??Estryfo??;spl?abp?-??nez-C-and????- ~-s¢-!?c????v???l???-?-???Stop??+??l-???iSu.??MWW?**(?edw.!by??try==LL?dd-in-L-brg?! point-O,[?&- erstle-mHL-)WaBut$)W-?!p?Shu???gu?-!?!! ebii!:?part-buh??&& ?-(?~.??43?&-???raWa—or ____),b-but-. figure?dy?; look-ma�?act?? is&e?{rsMon9)\n -ou?“p??ayusion-??LAGASS??!!I ?-&&?p*OurcedeNin??_)®one-source??→Some?Main????and->- -?-*-hWaThis ????Will-S$w~-(v? mostSi-P?).??? ?P\\); -she+><hndju??)?~h-bCh??-hol-s!? -> (ce,gt-fat?-fvir?AO-M?s?? ),??.?Tr-S?asi?-?BuckD?--?? br ?Y-y-lhJa??-menelling??only-h?as?Ro-VI? (????See-l)Sl-h???snb-Un?ce??e(?-se?{-^»acUVWiththough??nor: .acWLevel�),vu?R-d?d?Turn-/?UseJR× VI?????uses-andcolumn?—profile :??ht?pk?tunsyc?-tg-tac·??─------ SC;act__???gin‐? ?sc-s??OR`IU-ol?:oft?👉o- sem-L―)?oper? �GET,?MP|?edile’?"},"finish_reason":"max_tokens"}],"usage":{"prompt_tokens":7,"completion_tokens":8192,"total_tokens":8199,"ttft_s":0.3518,"tpot_s":0.1127,"latency_s":923.3105},"kv_transfer_params":null}
real 15m23.320s
user 0m0.011s
sys 0m0.014s
Operating System
Ubuntu 24.04 LTS
CPU
AMD EPYC 9575F
GPU
AMD Instinct MI355X
ROCm Version
7.2.4
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
rocminfo --support output
Additional Information
No response
Problem Description
Issue Description
I attempted to replicate the benchmark results for backend=ATOM and model=DeepSeek-V4-Pro on hardware=mi355x following the parameters and docker images specified in the ROCm ATOM Benchmark Dashboard.
However, the actual performance obtained in my environment is significantly lower than the dashboard expectations (roughly a 13x performance drop). Additionally, testing with newer image versions results in garbled/corrupted text outputs.
Environment & Reproduce Steps
Actual Results (atom0.1.3)
Metrics Comparison
Additional Observations (Garbled Output on Newer Images)
Furthermore, when testing the same configuration across several newer image versions, the model returns garbled text/nonsense outputs. Affected images include:
Are there any missing environment variables, hardware pre-configurations (e.g., RCCL/NUMA tuning), or model-specific flags needed to reproduce the dashboard performance and resolve the text corruption issues?
Operating System
Ubuntu 24.04 LTS
CPU
AMD EPYC 9575F
GPU
AMD Instinct MI355X
ROCm Version
7.2.4
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
rocminfo --support output
Additional Information
No response