TeamRedBench is a config-driven benchmark repo for AMD RDNA and CDNA GPUs on ROCm. It focuses on three benchmark families out of the box:
- HBM bandwidth
- Intra-node and inter-node communication bandwidth
- MFU-style compute utilization from GEMM throughput
The repo is built to stay adaptable when hardware, ROCm, or metrics change:
- Hardware profiles live in YAML instead of code.
- Runtime profiles capture ROCm/library assumptions separately from hardware.
- Benchmarks and metrics are both registry-driven, so new modules can be added without editing the runner.
- Native HIP/C++ kernels can be compiled on demand for HBM and MFU paths, or swapped for your own executable.
- Profiling engines can wrap a suite run and attach external artifacts such as
rocproftraces. - Dtype support is discovered dynamically from the installed
torchbuild, including optional float8 types when present.
.
├── configs/
│ ├── profiles/
│ │ ├── hardware/
│ │ └── rocm/
│ └── suites/
├── docs/
├── examples/
├── src/teamredbench/
│ ├── benchmarks/
│ └── metrics/
└── tests/
- Install the package and runtime dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,torch]"- Inspect the detected ROCm/PyTorch environment:
teamredbench discover-
Copy or edit a hardware profile under
configs/profiles/hardware/and fill in the peak numbers for the target GPU. Concrete published profiles are included forAMD Instinct MI300X,MI325X,MI350X, andMI355X. If a suite still points at a generic profile, the runner will auto-select a matching published profile when it recognizes the local GPU SKU. -
Run a suite:
teamredbench run configs/suites/smoke.yamlResults are written to results/ in JSON and CSV, plus a *.metadata.json sidecar with the run command, config contents, environment snapshot, git state, and software/runtime versions needed to reproduce the run.
When the hardware profile defines peak bandwidth or compute values, the live run output also prints the percentage of theoretical peak. If a peak is not configured, the percentage is shown as n/a.
The hbm benchmark uses large tensor kernels to stress device memory traffic. It supports:
copyscaletriad
The default backend is torch. Set params.backend: native to run the built-in HIP kernel instead.
Built-in native kernel name: hbm_hip.
Each result reports raw counters and derived metrics such as:
hbm_bandwidth_gbpshbm_efficiency_pctlatency_us
The collective benchmark uses torch.distributed with RCCL. It supports:
all_reduceall_gatherbroadcast
The benchmark classifies the run as intra-node or inter-node by gathering hostnames after process-group init. It reports:
payload_bandwidth_gbpsbus_bandwidth_gbpslink_efficiency_pctlatency_us
Launch collectives with torchrun, srun, or another distributed launcher that sets the standard environment variables.
Use backend: rccl in suite params when you need to override the default.
The mfu benchmark runs GEMM sweeps and compares achieved throughput against per-dtype theoretical peaks from the selected hardware profile. It reports:
achieved_topsmfu_pctlatency_us
The default backend is torch. Set params.backend: native to run the built-in rocWMMA/MFMA GEMM kernel instead.
Built-in native kernel name: mfu_hipblas.
For integer and complex dtypes, the repo uses dtype-specific operation-count factors so MFU remains tied to the configured theoretical peak.
teamredbench list-dtypes shows every dtype the local torch build exposes. The repo tries to cover:
bool- integer types
float16,bfloat16,float32,float64complex64,complex128- float8 variants when the installed
torchexposes them
Some dtype and benchmark combinations are not valid on every ROCm stack. Those cases are recorded as skipped with an error message instead of aborting the whole suite.
Suites can optionally run under a profiling engine. Built-in support includes:
rocprof
Example:
profiling:
enabled: true
engine: rocprof
params:
stats: true
hip_trace: trueWhen profiling is enabled, TeamRedBench launches an internal child run under the selected profiler, then attaches the profiling artifact directory to the normal metadata output. The profile artifact path is also added as profiling in the output map inside the *.metadata.json file.
List registered profiling engines:
teamredbench list-profile-enginesHardware-specific numbers are isolated in YAML:
- peak HBM bandwidth
- peak communication link bandwidths
- peak per-dtype compute throughput
Published SKU profiles are provided for:
configs/profiles/hardware/amd_instinct_mi300x.yamlconfigs/profiles/hardware/amd_instinct_mi325x.yamlconfigs/profiles/hardware/amd_instinct_mi350x.yamlconfigs/profiles/hardware/amd_instinct_mi355x.yaml
To bring up a new accelerator:
- Run
teamredbench discover. - Copy the closest profile from
configs/profiles/hardware/. - Fill in the target GPU's peak numbers.
- Point the suite at the new profile.
Nothing in the benchmark runner is hard-coded to MI2xx, MI3xx, or RDNA SKUs.
For inter-node communication, the network peak remains system-specific because it depends on the installed NIC and fabric rather than the GPU alone.
ROCm assumptions live under configs/profiles/rocm/. Keep runtime-specific items there:
- expected ROCm version
- library versions or notes
- RCCL-specific environment overrides when the target stack needs them
This keeps runtime drift separate from hardware drift.
Benchmarks and metrics register themselves at import time.
New benchmark:
- Add a module under
src/teamredbench/benchmarks/. - Decorate the class with
@register_benchmark("name"). - Return
BenchmarkRecordobjects with raw counters.
New metric:
- Add a function under
src/teamredbench/metrics/. - Decorate it with
@register_metric("metric_name", "..."). - Compute from the benchmark's raw counters.
External modules can also be loaded via the suite plugins: field.
More detail is in docs/extending.md.
Native kernels are optional. TeamRedBench will compile them with hipcc when you select backend: native.
The compiler can come from:
params.native.compilerTEAMREDBENCH_HIPCChipccinPATH/opt/rocm/bin/hipcc
You can also point directly at a prebuilt executable with params.native.binary.
Example:
benchmarks:
- benchmark: hbm
params:
backend: native
dtypes: [float32]
modes: [copy, scale, triad]
size_mib: 4096
native:
kernel: hbm_hip
- benchmark: mfu
params:
backend: native
dtypes: [float16, bfloat16, float32, float64]
shapes:
- [4096, 4096, 4096]
native:
kernel: mfu_hipblasteamredbench list-native-kernels shows the registered native kernels.
See configs/suites/rocprof_hbm_smoke.yaml for a minimal HBM run wrapped by rocprof.
List built-ins:
teamredbench list-benchmarks
teamredbench list-metrics
teamredbench list-native-kernels
teamredbench list-profile-engines
teamredbench list-dtypesRun the example rocprof suite:
teamredbench run configs/suites/rocprof_hbm_smoke.yamlRun the full suite:
teamredbench run configs/suites/full.yamlMulti-node collective example:
sbatch examples/slurm/multi_node_collective.sh