- We work in Docker containers for reproducibility.
- Run build/test commands only inside the designated container (not on host).
- If container is unspecified, ask for the exact image/tag and launch command before running anything expensive.
- Prefer editable installs (
pip install -e .). - Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
- Shape: one core C++/HIP library (
libtransformer_engine.so, sourced fromtransformer_engine/common/) plus optional per-framework bindings undertransformer_engine/{pytorch,jax}/, with C++ extension code in acsrc/subdir. - Python import flow: framework selection happens in
transformer_engine/__init__.py(driven byNVTE_FRAMEWORK); native.soresolution happens intransformer_engine/common/__init__.py. Start tracing from these two files for any import or framework-selection issue — function names inside may change, but the entry points are stable. - Build orchestration:
setup.py+ helpers inbuild_tools/+ CMake. ROCm vs CUDA backend detection lives inbuild_tools/utils.py— grep there (and forNVTE_USE_ROCM) for current behavior. - 3rdparty submodules: read
.gitmodulesfor the current set and commit pins to determine current submodules.
The build auto-generates HIP files from CUDA sources via hipify_torch. Generated files are marked with // !!! This is a file automatically generated by hipify!!! at line 1. Never edit generated files directly — edit the CUDA source instead.
File extension mapping:
| CUDA source | Generated HIP file |
|---|---|
.cu |
.hip |
.cuh |
_hip.cuh |
.cpp |
_hip.cpp |
.h |
_hip.h |
The following directories are excluded from hipify (native ROCm code — edit directly):
transformer_engine/common/ck_fused_attn/— CK kernel wrapperstransformer_engine/common/amd_detail/— AMD-specific utilitiestransformer_engine/common/rocshmem_api/— ROCshmem wrappers
Framework bindings (pytorch/csrc, jax/csrc) are hipified separately via build_tools/pytorch.py and build_tools/jax.py.
A file that contains only HIP code (no CUDA, no transitively-included CUDA headers) can be skipped by hipify in one of two ways:
- Explicit exclusion: add it to the ignores list in
do_hipify()(build_tools/hipify/hipify.py). Best for subdirectories that are entirely HIP-only. - Auto-detection: include
#include "hip/hip_runtime.h"in the file — real or commented out if not actually needed — and hipify will detect that no rewrite is required.
Backends are gated by env vars (set to 0 to disable, unset or 1 to enable):
| Env var | Controls | Default |
|---|---|---|
NVTE_FUSED_ATTN |
Master toggle for all fused attention | 1 |
NVTE_FUSED_ATTN_CK |
CK backend | inherits NVTE_FUSED_ATTN |
NVTE_FUSED_ATTN_AOTRITON |
AOTriton backend | inherits NVTE_FUSED_ATTN |
NVTE_FLASH_ATTN |
Flash attention | 1 |
CI backend configs (ci/_utils.sh::configure_fused_attn_env): auto, ck, aotriton, flash, unfused.
- Runtime backend selection/dispatch:
transformer_engine/common/fused_attn_rocm/fused_attn.cpp(hipified) - CK dispatch glue:
transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp(hipified) - AOTriton dispatch glue:
transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp(hipified) - CK kernel wrappers (native, not hipified):
transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpptransformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp
NVTE_DEBUG=1+NVTE_DEBUG_LEVEL={0,1,2}— Python-level attention debug outputNVTE_LOG_FUSED_ATTN_CONFIG=1— C++ backend selection loggingNVTE_LOG_CK_CONFIG=1— CK-specific config loggingNVTE_LOG_AOTRITON_CONFIG=1— AOTriton-specific config loggingCK_FUSED_ATTN_LOG_CONFIG=1— CK kernel wrapper logging
- Always init submodules first:
git submodule update --init --recursive. - Source install:
pip install . --no-build-isolation. - C++ tests:
ci/core.sh. - Framework CI tests (shell scripts, not bare pytest):
- PyTorch:
ci/pytorch.sh| JAX:ci/jax.sh - Control via
TEST_LEVEL,TEST_SGPU,TEST_MGPU,TEST_FILTER(fromci/_utils.sh).
- PyTorch:
- Edit
transformer_engine/*,build_tools/*,tests/*,ci/*; avoid3rdparty/*unless explicitly required. - Keep env-var behavior stable; tests toggle flags intentionally.
- Python: Black, line length 100, lint via
.pylintrc. C/C++: cpplint +.clang-format. - Preserve the existing style of each file you edit. Much of the codebase originates from upstream, and style can vary file-to-file (naming conventions, comment style, control flow patterns, etc.). Before writing new code in a file, read enough of it to understand how similar logic is already written, and follow that style. Consistency within a file matters more than imposing a uniform style across the project.
When you modify a file, update its copyright header so the end-year reflects the current year.
This repo carries two copyright lines — AMD and NVIDIA. Follow these rules:
- Files with an existing AMD copyright line — update the AMD end-year to the current year (e.g.
2025→2026). Leave the NVIDIA line untouched. - Files with only an NVIDIA copyright line — add an AMD line above the NVIDIA line:
- Python:
# Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved. - C/C++/HIP:
/* Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved. */(or use the*-block style matching the file). <YEAR>is the current year (single year) for newly-added lines, e.g.2026.
- Python:
- New files you create — include both AMD and NVIDIA headers with the current year, followed by a blank comment line and
See LICENSE for license information. - Never change the NVIDIA copyright year range — those dates are updated during IFU (integrate from upstream) merges.
AMD headers are our addition and should stay consistent with the patterns already in the codebase.
When writing or updating memories in the project memory directory, follow these guidelines:
- Scope: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
- Check before writing: read
MEMORY.mdand check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating. - File naming: use short, descriptive, snake_case names (e.g.
aiter_build.md,container_setup.md). Group by topic, not by date. - Frontmatter: every memory file must have the standard
name,description, andtypefrontmatter fields. - Index maintenance: after creating or removing a memory file, update
MEMORY.mdto keep the index in sync. Each entry should be a single line under 150 characters. - Staleness: memories are point-in-time observations. When recalling a memory, verify it against current code/state before acting on it. Update or delete memories that are no longer accurate.
- Missing
.soon import: check path resolution intransformer_engine/common/__init__.py. - Framework extension won't build on ROCm: check
build_tools/utils.py::get_frameworks(). - Fused-attn regression: reproduce under multiple backend configs (
auto,ck,aotriton,unfused). - CK/AITER kernel failures: use the
ck-debuggingskill for structured triage and isolation.