Skip to content

Latest commit

 

History

History
110 lines (88 loc) · 7.88 KB

File metadata and controls

110 lines (88 loc) · 7.88 KB

Agent instructions for TransformerEngine (ROCm fork)

Docker containers

  • We work in Docker containers for reproducibility.
  • Run build/test commands only inside the designated container (not on host).
  • If container is unspecified, ask for the exact image/tag and launch command before running anything expensive.
  • Prefer editable installs (pip install -e .).
  • Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
  • If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.

Architecture

  • Shape: one core C++/HIP library (libtransformer_engine.so, sourced from transformer_engine/common/) plus optional per-framework bindings under transformer_engine/{pytorch,jax}/, with C++ extension code in a csrc/ subdir.
  • Python import flow: framework selection happens in transformer_engine/__init__.py (driven by NVTE_FRAMEWORK); native .so resolution happens in transformer_engine/common/__init__.py. Start tracing from these two files for any import or framework-selection issue — function names inside may change, but the entry points are stable.
  • Build orchestration: setup.py + helpers in build_tools/ + CMake. ROCm vs CUDA backend detection lives in build_tools/utils.py — grep there (and for NVTE_USE_ROCM) for current behavior.
  • 3rdparty submodules: read .gitmodules for the current set and commit pins to determine current submodules.

Hipify convention

The build auto-generates HIP files from CUDA sources via hipify_torch. Generated files are marked with // !!! This is a file automatically generated by hipify!!! at line 1. Never edit generated files directly — edit the CUDA source instead.

File extension mapping:

CUDA source Generated HIP file
.cu .hip
.cuh _hip.cuh
.cpp _hip.cpp
.h _hip.h

The following directories are excluded from hipify (native ROCm code — edit directly):

  • transformer_engine/common/ck_fused_attn/ — CK kernel wrappers
  • transformer_engine/common/amd_detail/ — AMD-specific utilities
  • transformer_engine/common/rocshmem_api/ — ROCshmem wrappers

Framework bindings (pytorch/csrc, jax/csrc) are hipified separately via build_tools/pytorch.py and build_tools/jax.py.

A file that contains only HIP code (no CUDA, no transitively-included CUDA headers) can be skipped by hipify in one of two ways:

  • Explicit exclusion: add it to the ignores list in do_hipify() (build_tools/hipify/hipify.py). Best for subdirectories that are entirely HIP-only.
  • Auto-detection: include #include "hip/hip_runtime.h" in the file — real or commented out if not actually needed — and hipify will detect that no rewrite is required.

Fused attention backends

Backends are gated by env vars (set to 0 to disable, unset or 1 to enable):

Env var Controls Default
NVTE_FUSED_ATTN Master toggle for all fused attention 1
NVTE_FUSED_ATTN_CK CK backend inherits NVTE_FUSED_ATTN
NVTE_FUSED_ATTN_AOTRITON AOTriton backend inherits NVTE_FUSED_ATTN
NVTE_FLASH_ATTN Flash attention 1

CI backend configs (ci/_utils.sh::configure_fused_attn_env): auto, ck, aotriton, flash, unfused.

ROCm fused-attn file layout

  • Runtime backend selection/dispatch: transformer_engine/common/fused_attn_rocm/fused_attn.cpp (hipified)
  • CK dispatch glue: transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp (hipified)
  • AOTriton dispatch glue: transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp (hipified)
  • CK kernel wrappers (native, not hipified):
    • transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp
    • transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp

Debug logging env vars

  • NVTE_DEBUG=1 + NVTE_DEBUG_LEVEL={0,1,2} — Python-level attention debug output
  • NVTE_LOG_FUSED_ATTN_CONFIG=1 — C++ backend selection logging
  • NVTE_LOG_CK_CONFIG=1 — CK-specific config logging
  • NVTE_LOG_AOTRITON_CONFIG=1 — AOTriton-specific config logging
  • CK_FUSED_ATTN_LOG_CONFIG=1 — CK kernel wrapper logging

Developer workflows

  • Always init submodules first: git submodule update --init --recursive.
  • Source install: pip install . --no-build-isolation.
  • C++ tests: ci/core.sh.
  • Framework CI tests (shell scripts, not bare pytest):
    • PyTorch: ci/pytorch.sh | JAX: ci/jax.sh
    • Control via TEST_LEVEL, TEST_SGPU, TEST_MGPU, TEST_FILTER (from ci/_utils.sh).

Code conventions

  • Edit transformer_engine/*, build_tools/*, tests/*, ci/*; avoid 3rdparty/* unless explicitly required.
  • Keep env-var behavior stable; tests toggle flags intentionally.
  • Python: Black, line length 100, lint via .pylintrc. C/C++: cpplint + .clang-format.
  • Preserve the existing style of each file you edit. Much of the codebase originates from upstream, and style can vary file-to-file (naming conventions, comment style, control flow patterns, etc.). Before writing new code in a file, read enough of it to understand how similar logic is already written, and follow that style. Consistency within a file matters more than imposing a uniform style across the project.

Copyright headers

When you modify a file, update its copyright header so the end-year reflects the current year.

This repo carries two copyright lines — AMD and NVIDIA. Follow these rules:

  1. Files with an existing AMD copyright line — update the AMD end-year to the current year (e.g. 20252026). Leave the NVIDIA line untouched.
  2. Files with only an NVIDIA copyright line — add an AMD line above the NVIDIA line:
    • Python: # Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved.
    • C/C++/HIP: /* Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved. */ (or use the *-block style matching the file).
    • <YEAR> is the current year (single year) for newly-added lines, e.g. 2026.
  3. New files you create — include both AMD and NVIDIA headers with the current year, followed by a blank comment line and See LICENSE for license information.
  4. Never change the NVIDIA copyright year range — those dates are updated during IFU (integrate from upstream) merges.

AMD headers are our addition and should stay consistent with the patterns already in the codebase.

Memory management

When writing or updating memories in the project memory directory, follow these guidelines:

  • Scope: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
  • Check before writing: read MEMORY.md and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating.
  • File naming: use short, descriptive, snake_case names (e.g. aiter_build.md, container_setup.md). Group by topic, not by date.
  • Frontmatter: every memory file must have the standard name, description, and type frontmatter fields.
  • Index maintenance: after creating or removing a memory file, update MEMORY.md to keep the index in sync. Each entry should be a single line under 150 characters.
  • Staleness: memories are point-in-time observations. When recalling a memory, verify it against current code/state before acting on it. Update or delete memories that are no longer accurate.

Troubleshooting pointers

  • Missing .so on import: check path resolution in transformer_engine/common/__init__.py.
  • Framework extension won't build on ROCm: check build_tools/utils.py::get_frameworks().
  • Fused-attn regression: reproduce under multiple backend configs (auto, ck, aotriton, unfused).
  • CK/AITER kernel failures: use the ck-debugging skill for structured triage and isolation.