Skip to content

[Perf Regression] 26 config(s) regressed @ e79fe6f5 #1251

Description

@github-actions

Performance Regression Detected

Commit: e79fe6f5
Run: https://github.com/ROCm/ATOM/actions/runs/27639353976
Date: 2026-06-17T07:25:14.061422+00:00

Regressed Configurations

Model ISL/OSL Conc Tput (cur) Tput (base) Δ% TPOT (cur) TPOT (base) Δ%
DeepSeek-R1-0528 1024/1024 4 260.0 337.8 -23.0% 14.96 11.40 31.2%
DeepSeek-R1-0528 1024/1024 256 6316.2 6342.6 -0.4% 38.90 38.80 0.3%
DeepSeek-R1-0528 8192/1024 8 535.8 585.6 -8.5% 13.96 13.02 7.2%
DeepSeek-R1-0528 MTP3 8192/1024 4 457.8 585.0 -21.7% 8.06 6.32 27.5%
DeepSeek-R1-0528 MTP3 8192/1024 16 1218.5 1284.4 -5.1% 12.29 11.62 5.8%
DeepSeek-R1-0528-MXFP4 1024/1024 32 1791.7 1770.2 1.2% 17.23 17.49 -1.5%
DeepSeek-R1-0528-MXFP4 1024/1024 128 4078.9 4042.4 0.9% 30.12 30.50 -1.3%
DeepSeek-V4-Pro DPA 1024/1024 64 1952.3 1924.9 1.4% 30.35 31.15 -2.6%
DeepSeek-V4-Pro DPA 1024/1024 1024 11607.8 12515.8 -7.3% 83.37 68.98 20.9%
DeepSeek-V4-Pro DPA 8192/1024 1024 5016.6 4499.5 11.5% 129.46 210.23 -38.4%
DeepSeek-V4-Pro DPA TBO 8192/1024 1024 5377.1 4538.3 18.5% 131.64 211.77 -37.8%
DeepSeek-V4-Pro MTP3 8192/1024 4 437.0 435.1 0.5% 8.18 8.22 -0.5%
GLM-5.1-MXFP4 1024/1024 64 1854.6 1858.3 -0.2% 33.08 33.13 -0.2%
GLM-5.1-MXFP4 1024/1024 256 4123.8 4126.4 -0.1% 59.62 59.75 -0.2%
Kimi-K2.5-MXFP4 1024/1024 32 1698.8 1708.0 -0.5% 18.19 18.15 0.3%
Kimi-K2.5-MXFP4 8192/1024 256 2198.1 2186.7 0.5% 109.50 111.40 -1.7%
Llama-3.3-70B-Instruct-MXFP4 1024/1024 8 527.0 529.9 -0.5% 14.69 14.63 0.5%
Qwen3.5-397B-A17B-FP8 1024/1024 16 1216.3 1235.0 -1.5% 12.74 12.56 1.4%
Qwen3.5-397B-A17B-FP8 MTP3 8192/1024 8 1255.8 1242.5 1.1% 5.77 5.84 -1.0%
Qwen3.5-397B-A17B-MXFP4 1024/1024 16 1301.0 1308.6 -0.6% 11.84 11.82 0.2%
Qwen3.5-397B-A17B-MXFP4 1024/1024 32 2035.9 2078.4 -2.0% 15.12 14.84 1.9%
gpt-oss-120b 1024/1024 16 2764.1 2811.9 -1.7% 5.57 5.51 0.9%
gpt-oss-120b 1024/1024 32 4278.9 4328.0 -1.1% 7.18 7.10 1.0%
gpt-oss-120b 1024/1024 256 12047.0 12777.3 -5.7% 20.27 18.17 11.6%
gpt-oss-120b 8192/1024 64 4735.0 4721.0 0.3% 12.80 12.90 -0.8%
gpt-oss-120b 8192/1024 128 5881.9 5714.5 2.9% 20.46 21.36 -4.2%

Performance Summary

# Trace Performance Summary

**File:** `DeepSeek-R1-0528_ts_20260617_073624_594.pt.trace.json.gz`

## Prefill

| # | Label | Duration |
|---|-------|----------|
| 0 | `prefill[bs=1 tok=866 ctx=866]` | 95.72 ms |
| 1 | `prefill[bs=3 tok=1962 ctx=[991, 1011, 936]]` | 89.99 ms |
| 2 | `prefill[bs=1 tok=886 ctx=886]` | 90.09 ms |
| 3 | `prefill[bs=1 tok=1014 ctx=1014]` | 83.89 ms |
| 4 | `prefill[bs=1 tok=922 ctx=922]` | 86.88 ms |
| 5 | `prefill[bs=1 tok=828 ctx=828]` | 86.26 ms |

**Total prefill:** 532.83 ms

## Decode

- **Iterations:** 1919
- **Mean:** 1.08 ms
- **Min:** 905.3 us
- **Max:** 3.61 ms
- **Total:** 2077.46 ms

Profiler Traces

Download from workflow artifacts.
Open in Perfetto UI or Chrome chrome://tracing for analysis.

Next Steps

  1. Download profiler-analysis-27639353976 artifact
  2. Open trace files in Perfetto UI
  3. Compare kernel durations against previous traces
  4. Identify bottleneck changes

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions