TeamRedBench

TeamRedBench is a config-driven benchmark repo for AMD RDNA and CDNA GPUs on ROCm. It focuses on three benchmark families out of the box:

HBM bandwidth
Intra-node and inter-node communication bandwidth
MFU-style compute utilization from GEMM throughput

The repo is built to stay adaptable when hardware, ROCm, or metrics change:

Hardware profiles live in YAML instead of code.
Runtime profiles capture ROCm/library assumptions separately from hardware.
Benchmarks and metrics are both registry-driven, so new modules can be added without editing the runner.
Native HIP/C++ kernels can be compiled on demand for HBM and MFU paths, or swapped for your own executable.
Profiling engines can wrap a suite run and attach external artifacts such as rocprof traces.
Dtype support is discovered dynamically from the installed torch build, including optional float8 types when present.

Repo Layout

.
├── configs/
│   ├── profiles/
│   │   ├── hardware/
│   │   └── rocm/
│   └── suites/
├── docs/
├── examples/
├── src/teamredbench/
│   ├── benchmarks/
│   └── metrics/
└── tests/

Quick Start

Install the package and runtime dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,torch]"

Inspect the detected ROCm/PyTorch environment:

teamredbench discover

Copy or edit a hardware profile under configs/profiles/hardware/ and fill in the peak numbers for the target GPU. Concrete published profiles are included for AMD Instinct MI300X, MI325X, MI350X, and MI355X. If a suite still points at a generic profile, the runner will auto-select a matching published profile when it recognizes the local GPU SKU.
Run a suite:

teamredbench run configs/suites/smoke.yaml

Results are written to results/ in JSON and CSV, plus a *.metadata.json sidecar with the run command, config contents, environment snapshot, git state, and software/runtime versions needed to reproduce the run. When the hardware profile defines peak bandwidth or compute values, the live run output also prints the percentage of theoretical peak. If a peak is not configured, the percentage is shown as n/a.

Benchmark Coverage

HBM bandwidth

The hbm benchmark uses large tensor kernels to stress device memory traffic. It supports:

copy
scale
triad

The default backend is torch. Set params.backend: native to run the built-in HIP kernel instead. Built-in native kernel name: hbm_hip.

Each result reports raw counters and derived metrics such as:

hbm_bandwidth_gbps
hbm_efficiency_pct
latency_us

Communication bandwidth

The collective benchmark uses torch.distributed with RCCL. It supports:

all_reduce
all_gather
broadcast

The benchmark classifies the run as intra-node or inter-node by gathering hostnames after process-group init. It reports:

payload_bandwidth_gbps
bus_bandwidth_gbps
link_efficiency_pct
latency_us

Launch collectives with torchrun, srun, or another distributed launcher that sets the standard environment variables. Use backend: rccl in suite params when you need to override the default.

MFU

The mfu benchmark runs GEMM sweeps and compares achieved throughput against per-dtype theoretical peaks from the selected hardware profile. It reports:

achieved_tops
mfu_pct
latency_us

The default backend is torch. Set params.backend: native to run the built-in rocWMMA/MFMA GEMM kernel instead. Built-in native kernel name: mfu_hipblas.

For integer and complex dtypes, the repo uses dtype-specific operation-count factors so MFU remains tied to the configured theoretical peak.

Dtype Strategy

teamredbench list-dtypes shows every dtype the local torch build exposes. The repo tries to cover:

bool
integer types
float16, bfloat16, float32, float64
complex64, complex128
float8 variants when the installed torch exposes them

Some dtype and benchmark combinations are not valid on every ROCm stack. Those cases are recorded as skipped with an error message instead of aborting the whole suite.

Profiling

Suites can optionally run under a profiling engine. Built-in support includes:

rocprof

Example:

profiling:
  enabled: true
  engine: rocprof
  params:
    stats: true
    hip_trace: true

When profiling is enabled, TeamRedBench launches an internal child run under the selected profiler, then attaches the profiling artifact directory to the normal metadata output. The profile artifact path is also added as profiling in the output map inside the *.metadata.json file.

List registered profiling engines:

teamredbench list-profile-engines

Adapting to New Hardware

Hardware-specific numbers are isolated in YAML:

peak HBM bandwidth
peak communication link bandwidths
peak per-dtype compute throughput

Published SKU profiles are provided for:

configs/profiles/hardware/amd_instinct_mi300x.yaml
configs/profiles/hardware/amd_instinct_mi325x.yaml
configs/profiles/hardware/amd_instinct_mi350x.yaml
configs/profiles/hardware/amd_instinct_mi355x.yaml

To bring up a new accelerator:

Run teamredbench discover.
Copy the closest profile from configs/profiles/hardware/.
Fill in the target GPU's peak numbers.
Point the suite at the new profile.

Nothing in the benchmark runner is hard-coded to MI2xx, MI3xx, or RDNA SKUs. For inter-node communication, the network peak remains system-specific because it depends on the installed NIC and fabric rather than the GPU alone.

Adapting to New ROCm Versions

ROCm assumptions live under configs/profiles/rocm/. Keep runtime-specific items there:

expected ROCm version
library versions or notes
RCCL-specific environment overrides when the target stack needs them

This keeps runtime drift separate from hardware drift.

Adding a New Benchmark or Metric

Benchmarks and metrics register themselves at import time.

New benchmark:

Add a module under src/teamredbench/benchmarks/.
Decorate the class with @register_benchmark("name").
Return BenchmarkRecord objects with raw counters.

New metric:

Add a function under src/teamredbench/metrics/.
Decorate it with @register_metric("metric_name", "...").
Compute from the benchmark's raw counters.

External modules can also be loaded via the suite plugins: field.

More detail is in docs/extending.md.

Native Backends

Native kernels are optional. TeamRedBench will compile them with hipcc when you select backend: native. The compiler can come from:

params.native.compiler
TEAMREDBENCH_HIPCC
hipcc in PATH
/opt/rocm/bin/hipcc

You can also point directly at a prebuilt executable with params.native.binary.

Example:

benchmarks:
  - benchmark: hbm
    params:
      backend: native
      dtypes: [float32]
      modes: [copy, scale, triad]
      size_mib: 4096
      native:
        kernel: hbm_hip

  - benchmark: mfu
    params:
      backend: native
      dtypes: [float16, bfloat16, float32, float64]
      shapes:
        - [4096, 4096, 4096]
      native:
        kernel: mfu_hipblas

teamredbench list-native-kernels shows the registered native kernels.

rocprof Example

See configs/suites/rocprof_hbm_smoke.yaml for a minimal HBM run wrapped by rocprof.

Example Commands

List built-ins:

teamredbench list-benchmarks
teamredbench list-metrics
teamredbench list-native-kernels
teamredbench list-profile-engines
teamredbench list-dtypes

Run the example rocprof suite:

teamredbench run configs/suites/rocprof_hbm_smoke.yaml

Run the full suite:

teamredbench run configs/suites/full.yaml

Multi-node collective example:

sbatch examples/slurm/multi_node_collective.sh

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
configs		configs
docs		docs
examples/slurm		examples/slurm
src/teamredbench		src/teamredbench
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TeamRedBench

Repo Layout

Quick Start

Benchmark Coverage

HBM bandwidth

Communication bandwidth

MFU

Dtype Strategy

Profiling

Adapting to New Hardware

Adapting to New ROCm Versions

Adding a New Benchmark or Metric

Native Backends

rocprof Example

Example Commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TeamRedBench

Repo Layout

Quick Start

Benchmark Coverage

HBM bandwidth

Communication bandwidth

MFU

Dtype Strategy

Profiling

Adapting to New Hardware

Adapting to New ROCm Versions

Adding a New Benchmark or Metric

Native Backends

rocprof Example

Example Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages