Skip to content

Fangzhou-Ai/TeamRedBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TeamRedBench logo

TeamRedBench

TeamRedBench is a config-driven benchmark repo for AMD RDNA and CDNA GPUs on ROCm. It focuses on three benchmark families out of the box:

  • HBM bandwidth
  • Intra-node and inter-node communication bandwidth
  • MFU-style compute utilization from GEMM throughput

The repo is built to stay adaptable when hardware, ROCm, or metrics change:

  • Hardware profiles live in YAML instead of code.
  • Runtime profiles capture ROCm/library assumptions separately from hardware.
  • Benchmarks and metrics are both registry-driven, so new modules can be added without editing the runner.
  • Native HIP/C++ kernels can be compiled on demand for HBM and MFU paths, or swapped for your own executable.
  • Profiling engines can wrap a suite run and attach external artifacts such as rocprof traces.
  • Dtype support is discovered dynamically from the installed torch build, including optional float8 types when present.

Repo Layout

.
├── configs/
│   ├── profiles/
│   │   ├── hardware/
│   │   └── rocm/
│   └── suites/
├── docs/
├── examples/
├── src/teamredbench/
│   ├── benchmarks/
│   └── metrics/
└── tests/

Quick Start

  1. Install the package and runtime dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,torch]"
  1. Inspect the detected ROCm/PyTorch environment:
teamredbench discover
  1. Copy or edit a hardware profile under configs/profiles/hardware/ and fill in the peak numbers for the target GPU. Concrete published profiles are included for AMD Instinct MI300X, MI325X, MI350X, and MI355X. If a suite still points at a generic profile, the runner will auto-select a matching published profile when it recognizes the local GPU SKU.

  2. Run a suite:

teamredbench run configs/suites/smoke.yaml

Results are written to results/ in JSON and CSV, plus a *.metadata.json sidecar with the run command, config contents, environment snapshot, git state, and software/runtime versions needed to reproduce the run. When the hardware profile defines peak bandwidth or compute values, the live run output also prints the percentage of theoretical peak. If a peak is not configured, the percentage is shown as n/a.

Benchmark Coverage

HBM bandwidth

The hbm benchmark uses large tensor kernels to stress device memory traffic. It supports:

  • copy
  • scale
  • triad

The default backend is torch. Set params.backend: native to run the built-in HIP kernel instead. Built-in native kernel name: hbm_hip.

Each result reports raw counters and derived metrics such as:

  • hbm_bandwidth_gbps
  • hbm_efficiency_pct
  • latency_us

Communication bandwidth

The collective benchmark uses torch.distributed with RCCL. It supports:

  • all_reduce
  • all_gather
  • broadcast

The benchmark classifies the run as intra-node or inter-node by gathering hostnames after process-group init. It reports:

  • payload_bandwidth_gbps
  • bus_bandwidth_gbps
  • link_efficiency_pct
  • latency_us

Launch collectives with torchrun, srun, or another distributed launcher that sets the standard environment variables. Use backend: rccl in suite params when you need to override the default.

MFU

The mfu benchmark runs GEMM sweeps and compares achieved throughput against per-dtype theoretical peaks from the selected hardware profile. It reports:

  • achieved_tops
  • mfu_pct
  • latency_us

The default backend is torch. Set params.backend: native to run the built-in rocWMMA/MFMA GEMM kernel instead. Built-in native kernel name: mfu_hipblas.

For integer and complex dtypes, the repo uses dtype-specific operation-count factors so MFU remains tied to the configured theoretical peak.

Dtype Strategy

teamredbench list-dtypes shows every dtype the local torch build exposes. The repo tries to cover:

  • bool
  • integer types
  • float16, bfloat16, float32, float64
  • complex64, complex128
  • float8 variants when the installed torch exposes them

Some dtype and benchmark combinations are not valid on every ROCm stack. Those cases are recorded as skipped with an error message instead of aborting the whole suite.

Profiling

Suites can optionally run under a profiling engine. Built-in support includes:

  • rocprof

Example:

profiling:
  enabled: true
  engine: rocprof
  params:
    stats: true
    hip_trace: true

When profiling is enabled, TeamRedBench launches an internal child run under the selected profiler, then attaches the profiling artifact directory to the normal metadata output. The profile artifact path is also added as profiling in the output map inside the *.metadata.json file.

List registered profiling engines:

teamredbench list-profile-engines

Adapting to New Hardware

Hardware-specific numbers are isolated in YAML:

  • peak HBM bandwidth
  • peak communication link bandwidths
  • peak per-dtype compute throughput

Published SKU profiles are provided for:

  • configs/profiles/hardware/amd_instinct_mi300x.yaml
  • configs/profiles/hardware/amd_instinct_mi325x.yaml
  • configs/profiles/hardware/amd_instinct_mi350x.yaml
  • configs/profiles/hardware/amd_instinct_mi355x.yaml

To bring up a new accelerator:

  1. Run teamredbench discover.
  2. Copy the closest profile from configs/profiles/hardware/.
  3. Fill in the target GPU's peak numbers.
  4. Point the suite at the new profile.

Nothing in the benchmark runner is hard-coded to MI2xx, MI3xx, or RDNA SKUs. For inter-node communication, the network peak remains system-specific because it depends on the installed NIC and fabric rather than the GPU alone.

Adapting to New ROCm Versions

ROCm assumptions live under configs/profiles/rocm/. Keep runtime-specific items there:

  • expected ROCm version
  • library versions or notes
  • RCCL-specific environment overrides when the target stack needs them

This keeps runtime drift separate from hardware drift.

Adding a New Benchmark or Metric

Benchmarks and metrics register themselves at import time.

New benchmark:

  1. Add a module under src/teamredbench/benchmarks/.
  2. Decorate the class with @register_benchmark("name").
  3. Return BenchmarkRecord objects with raw counters.

New metric:

  1. Add a function under src/teamredbench/metrics/.
  2. Decorate it with @register_metric("metric_name", "...").
  3. Compute from the benchmark's raw counters.

External modules can also be loaded via the suite plugins: field.

More detail is in docs/extending.md.

Native Backends

Native kernels are optional. TeamRedBench will compile them with hipcc when you select backend: native. The compiler can come from:

  • params.native.compiler
  • TEAMREDBENCH_HIPCC
  • hipcc in PATH
  • /opt/rocm/bin/hipcc

You can also point directly at a prebuilt executable with params.native.binary.

Example:

benchmarks:
  - benchmark: hbm
    params:
      backend: native
      dtypes: [float32]
      modes: [copy, scale, triad]
      size_mib: 4096
      native:
        kernel: hbm_hip

  - benchmark: mfu
    params:
      backend: native
      dtypes: [float16, bfloat16, float32, float64]
      shapes:
        - [4096, 4096, 4096]
      native:
        kernel: mfu_hipblas

teamredbench list-native-kernels shows the registered native kernels.

rocprof Example

See configs/suites/rocprof_hbm_smoke.yaml for a minimal HBM run wrapped by rocprof.

Example Commands

List built-ins:

teamredbench list-benchmarks
teamredbench list-metrics
teamredbench list-native-kernels
teamredbench list-profile-engines
teamredbench list-dtypes

Run the example rocprof suite:

teamredbench run configs/suites/rocprof_hbm_smoke.yaml

Run the full suite:

teamredbench run configs/suites/full.yaml

Multi-node collective example:

sbatch examples/slurm/multi_node_collective.sh

About

Tool for benchmarking AMD GPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors