Agent instructions for TransformerEngine (ROCm fork)

Docker containers

We work in Docker containers for reproducibility.
Run build/test commands only inside the designated container (not on host).
If container is unspecified, ask for the exact image/tag and launch command before running anything expensive.
Prefer editable installs (pip install -e .).
Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.

Architecture

Shape: one core C++/HIP library (libtransformer_engine.so, sourced from transformer_engine/common/) plus optional per-framework bindings under transformer_engine/{pytorch,jax}/, with C++ extension code in a csrc/ subdir.
Python import flow: framework selection happens in transformer_engine/__init__.py (driven by NVTE_FRAMEWORK); native .so resolution happens in transformer_engine/common/__init__.py. Start tracing from these two files for any import or framework-selection issue — function names inside may change, but the entry points are stable.
Build orchestration: setup.py + helpers in build_tools/ + CMake. ROCm vs CUDA backend detection lives in build_tools/utils.py — grep there (and for NVTE_USE_ROCM) for current behavior.
3rdparty submodules: read .gitmodules for the current set and commit pins to determine current submodules.

Hipify convention

The build auto-generates HIP files from CUDA sources via hipify_torch. Generated files are marked with // !!! This is a file automatically generated by hipify!!! at line 1. Never edit generated files directly — edit the CUDA source instead.

File extension mapping:

CUDA source	Generated HIP file
`.cu`	`.hip`
`.cuh`	`_hip.cuh`
`.cpp`	`_hip.cpp`
`.h`	`_hip.h`

The following directories are excluded from hipify (native ROCm code — edit directly):

transformer_engine/common/ck_fused_attn/ — CK kernel wrappers
transformer_engine/common/amd_detail/ — AMD-specific utilities
transformer_engine/common/rocshmem_api/ — ROCshmem wrappers

Framework bindings (pytorch/csrc, jax/csrc) are hipified separately via build_tools/pytorch.py and build_tools/jax.py.

A file that contains only HIP code (no CUDA, no transitively-included CUDA headers) can be skipped by hipify in one of two ways:

Explicit exclusion: add it to the ignores list in do_hipify() (build_tools/hipify/hipify.py). Best for subdirectories that are entirely HIP-only.
Auto-detection: include #include "hip/hip_runtime.h" in the file — real or commented out if not actually needed — and hipify will detect that no rewrite is required.

Fused attention backends

Backends are gated by env vars (set to 0 to disable, unset or 1 to enable):

Env var	Controls	Default
`NVTE_FUSED_ATTN`	Master toggle for all fused attention	`1`
`NVTE_FUSED_ATTN_CK`	CK backend	inherits `NVTE_FUSED_ATTN`
`NVTE_FUSED_ATTN_AOTRITON`	AOTriton backend	inherits `NVTE_FUSED_ATTN`
`NVTE_FLASH_ATTN`	Flash attention	`1`

CI backend configs (ci/_utils.sh::configure_fused_attn_env): auto, ck, aotriton, flash, unfused.

ROCm fused-attn file layout

Runtime backend selection/dispatch: transformer_engine/common/fused_attn_rocm/fused_attn.cpp (hipified)
CK dispatch glue: transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp (hipified)
AOTriton dispatch glue: transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp (hipified)
CK kernel wrappers (native, not hipified):
- transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp
- transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp

Debug logging env vars

NVTE_DEBUG=1 + NVTE_DEBUG_LEVEL={0,1,2} — Python-level attention debug output
NVTE_LOG_FUSED_ATTN_CONFIG=1 — C++ backend selection logging
NVTE_LOG_CK_CONFIG=1 — CK-specific config logging
NVTE_LOG_AOTRITON_CONFIG=1 — AOTriton-specific config logging
CK_FUSED_ATTN_LOG_CONFIG=1 — CK kernel wrapper logging

Developer workflows

Always init submodules first: git submodule update --init --recursive.
Source install: pip install . --no-build-isolation.
C++ tests: ci/core.sh.
Framework CI tests (shell scripts, not bare pytest):
- PyTorch: ci/pytorch.sh | JAX: ci/jax.sh
- Control via TEST_LEVEL, TEST_SGPU, TEST_MGPU, TEST_FILTER (from ci/_utils.sh).

Code conventions

Edit transformer_engine/*, build_tools/*, tests/*, ci/*; avoid 3rdparty/* unless explicitly required.
Keep env-var behavior stable; tests toggle flags intentionally.
Python: Black, line length 100, lint via .pylintrc. C/C++: cpplint + .clang-format.
Preserve the existing style of each file you edit. Much of the codebase originates from upstream, and style can vary file-to-file (naming conventions, comment style, control flow patterns, etc.). Before writing new code in a file, read enough of it to understand how similar logic is already written, and follow that style. Consistency within a file matters more than imposing a uniform style across the project.

Copyright headers

When you modify a file, update its copyright header so the end-year reflects the current year.

This repo carries two copyright lines — AMD and NVIDIA. Follow these rules:

Files with an existing AMD copyright line — update the AMD end-year to the current year (e.g. 2025 → 2026). Leave the NVIDIA line untouched.
Files with only an NVIDIA copyright line — add an AMD line above the NVIDIA line:
- Python: # Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved.
- C/C++/HIP: /* Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved. */ (or use the *-block style matching the file).
- <YEAR> is the current year (single year) for newly-added lines, e.g. 2026.
New files you create — include both AMD and NVIDIA headers with the current year, followed by a blank comment line and See LICENSE for license information.
Never change the NVIDIA copyright year range — those dates are updated during IFU (integrate from upstream) merges.

AMD headers are our addition and should stay consistent with the patterns already in the codebase.

Memory management

When writing or updating memories in the project memory directory, follow these guidelines:

Scope: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
Check before writing: read MEMORY.md and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating.
File naming: use short, descriptive, snake_case names (e.g. aiter_build.md, container_setup.md). Group by topic, not by date.
Frontmatter: every memory file must have the standard name, description, and type frontmatter fields.
Index maintenance: after creating or removing a memory file, update MEMORY.md to keep the index in sync. Each entry should be a single line under 150 characters.
Staleness: memories are point-in-time observations. When recalling a memory, verify it against current code/state before acting on it. Update or delete memories that are no longer accurate.

Troubleshooting pointers

Missing .so on import: check path resolution in transformer_engine/common/__init__.py.
Framework extension won't build on ROCm: check build_tools/utils.py::get_frameworks().
Fused-attn regression: reproduce under multiple backend configs (auto, ck, aotriton, unfused).
CK/AITER kernel failures: use the ck-debugging skill for structured triage and isolation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent instructions for TransformerEngine (ROCm fork)

Docker containers

Architecture

Hipify convention

Fused attention backends

ROCm fused-attn file layout

Debug logging env vars

Developer workflows

Code conventions

Copyright headers

Memory management

Troubleshooting pointers

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

Agent instructions for TransformerEngine (ROCm fork)

Docker containers

Architecture

Hipify convention

Fused attention backends

ROCm fused-attn file layout

Debug logging env vars

Developer workflows

Code conventions

Copyright headers

Memory management

Troubleshooting pointers