-
Notifications
You must be signed in to change notification settings - Fork 51
Architecture
This document describes FlagTree's architecture, how it extends Triton, and the code organization.
FlagTree follows Triton's existing compilation architecture and extends it without modifying the core Triton codebase. Triton's compilation system consists of three main modules that work together to transform Python kernel code into executable GPU code. These modules handle different aspects of the compilation process: understanding your code (AST Processing), compiling it for the target hardware (Backend Compilation), and managing when and how it runs (Runtime System).
-
AST Processing (
python/triton/compiler/,python/triton/language/): Converts Python kernel code to Triton IR (TTIR) in MLIR format. This module consists of:-
Language Definition (
python/triton/language/): Defines Triton's language constructs including core types (core.py), standard operations (standard.py), math functions (math.py), and semantic rules (semantic.py). These provide the building blocks that kernels use. -
Code Generation (
python/triton/compiler/code_generator.py): Transforms Python AST into TTIR operations by recognizing Triton language constructs (defined inlanguage/) and generating the initial intermediate representation. -
Compiler Coordination (
python/triton/compiler/compiler.py): Orchestrates the compilation process, managing AST sources, IR sources, and coordinating with backends.
-
Language Definition (
-
Backend Compilation (
third_party/[backend]/backend/compiler.py): Each backend defines its compilation pipeline through theadd_stages()method, which specifies how to transform TTIR into executable code. The typical flow includes: TTIR → TTGPU IR/ Linalg IR → LLVM IR → target assembly → binary. Each backend implements these stages with hardware-specific optimizations and code generation. -
Runtime System (
python/triton/runtime/): Handles JIT compilation, kernel caching, and kernel launch. The runtime compiles kernels on-demand when first called, caches compiled results to avoid recompilation, and manages kernel execution on the GPU through backend drivers.
FlagTree mainly extends Triton in backend support and compiler optimizations.
Backend Extensions: FlagTree follows a plugin-based architecture where each backend is self-contained in third_party/[backend_name]/. Each backend implements the BaseBackend interface, defines its compilation pipeline through add_stages(), and provides backend-specific optimizations and code generation. This design allows adding new backends without modifying core Triton code.
Language & Compiler Optimization Extensions: For existing Triton code, FlagTree uses incremental extensions for full compatibility with native Triton. Key modifications include:
- Adding
flagtree_hintsparameter toload()incore.pyandsemantic.py - Parsing
#@hint:comments injit.pyand extracting hints incode_generator.py - Dispatching to TLE modules via
module_mapmechanism incode_generator.py - Integrating TLE and HINTS passes in backend compilation pipelines
FlagTree maximizes separation between TLE language extensions (python/triton/experimental/tle/) and TLE MLIR dialect (third_party/tle/). TLE language constructs are defined in Python and integrated through the AST processing pipeline, while TLE dialect operations are implemented in C++/MLIR and integrated through the compilation pipeline. This separation allows language features and IR transformations to evolve independently, improving maintainability and enabling different backends to adopt TLE features at different stages of the compilation pipeline.
The following code structure shows how FlagTree organizes its extensions:
flagtree/
├── python/
│ ├── triton/ # Triton core (existing)
│ │ ├── compiler/
│ │ │ ├── code_generator.py # [EXTENDED] TLE module dispatch, HINTS extraction
│ │ │ └── compiler.py
│ │ ├── language/
│ │ │ ├── core.py # [EXTENDED] HINTS
│ │ │ └── semantic.py # [EXTENDED] HINTS
│ │ ├── runtime/
│ │ │ └── jit.py # [EXTENDED] HINTS
│ │ └── experimental/ # Language extensions
│ │ └── tle/ # TLE (Triton Language Extensions)
│ │ ├── language/ # TLE language definition (extends AST Processing)
│ │ │ ├── core.py # TLE-Lite: Core TLE language features (e.g., tle.load)
│ │ │ ├── [gpu/npu]/ # TLE-Struct: GPU-specific and NPU-specific constructs
│ │ │ │ ├── core.py
│ │ │ │ ├── semantic.py
│ │ │ │ └── types.py
│ │ │ └── raw/ # TLE-Raw: Raw MLIR programming interface
│ │ │ ├── core.py
│ │ │ └── semantic.py
│ │ └── raw/ # TLE-Raw implementation (extends AST Processing)
│ │ ├── mlir/ # MLIR code generation for TLE-Raw
│ │ │ ├── codegen.py
│ │ │ └── runtime.py
│ │ └── runtime.py # Runtime support for TLE-Raw
│ ├── tutorials/ # Tutorials and examples (not in code path)
│ │ └── tle/ # TLE examples
│ │ ├── 01-sparse-mla.py
│ │ └── raw/ # TLE-Raw examples
│ └── test/ # Test code (not in code path)
│ └── tle/ # TLE tests
│ ├── integration/ # Integration tests
│ ├── unit/ # Unit tests
│ └── run_tests.py
│
└── third_party/
├── [backend_name]/ # Backend-specific extensions
│ ├── backend/
│ │ └── compiler.py # [EXTENDED] TLE and HINTS pass dispatch
│ ├── include/ # TTIR dialect definitions (may extend tt.load with attributes)
│ └── lib/
├── tle/ # TLE MLIR extensions
│ └── dialect/ # TLE dialect implementation
│ ├── include/IR/
│ ├── lib/IR/
│ ├── lib/Conversion/
│ └── lib/Transforms/
└── flir/ # FLIR: FlagTree-maintained common Linalg, including HINTS pass
Note: The code structure above shows only the key files and directories relevant to FlagTree's extensions. Many other files and subdirectories in the codebase are omitted for clarity.
Please follow the step-by-step tutorial for detailed instructions on adding a new backend.
TLE Architecture
-
Purpose & Scope
- Extends Triton with explicit shared/tensor memory management, async/TMA data movement, and pipeline control optimized for NVIDIA Hopper-class GPUs for now (README.md).
- Frontend APIs live under tle and lower into custom MLIR dialect + passes under tle.
-
Frontend DSL Layer (Python)
-
tle.language.coreoverrides keytlbuiltins such asload,alloc,copy,local_load,local_store, and loop helpers to attach extra attributes (e.g.,"tt.load.async") and createbuffered_tensorhandles representing shared/tensor memory allocations (core.py). - GPU-specific helpers in gpu define layouts (
swizzled_shared_layout,nv_mma_shared_layout, etc.), scopes (smem,tmem), andbuffered_tensorsemantics that wrap IR memdesc types while keeping Triton-style type checking. - Users import these symbols (e.g.,
tle.alloc,tle.copy,tle.pipeline) inside@triton.jitkernels to allocate SMEM tiles, launch async copies, or orchestrate staged loops.
-
-
Semantic Validation
-
TLESemanticin semantic.py runs alongside Triton’s semantic layer. It validates shapes, dtypes, and copy compatibility before lowering, providing early error messages and adapting constexpr inputs. - Semantic helpers call into custom builder hooks (exposed via the C++ bridge) to emit
LocalAllocOp,TMACopyOp, etc., ensuring Python APIs map 1:1 to TTIR constructs.
-
-
Raw/EDSL Layer
- raw exposes a lightweight MLIR-based eDSL for writing dialect-specific intrinsics directly. Decorators like
@dialect(name="mlir")build LLVM IR from Python ASTs viaEdslMLIRJITFunction, enabling backend authors to prototype kernels or helper ops outside the high-level Triton syntax. - The raw runtime (
call()helper) materializestle::DSLRegionOpnodes whose bodies are later inlined by passes.
- raw exposes a lightweight MLIR-based eDSL for writing dialect-specific intrinsics directly. Decorators like
-
C++ Bridge & Dialect
- triton_tle.cc registers additional builder methods (creating encoding attributes, memdesc types, TMACopy ops, DSL regions) onto Triton’s
TritonOpBuilder, and wires new passes plus raw IR helpers into Python via pybind11. - The MLIR dialect lives under dialect with IR definitions plus Analysis/Conversion/Transforms infrastructure mirroring upstream Triton conventions.
- triton_tle.cc registers additional builder methods (creating encoding attributes, memdesc types, TMACopy ops, DSL regions) onto Triton’s
-
Pass & Lowering Pipeline
- Pass registrations are defined in Passes.td and surfaced to Python (
add_early_assign_memory_space,add_lower_async_load,add_lower_tma_copy,add_tle_convert_arg_to_memdesc,add_tle_dsl_region_inline). - Key transformations:
-
Early Assign Memory Space rewrites tensors tagged with
tt.memory_space="shared_memory"into explicit local alloc/store sequences and removes the attribute so later passes see concrete SMEM ops (TleEarlyAssignMemorySpace.cpp). -
Lower Async Load looks for loads marked with
"tt.load.async"(set bytle.load) and converts them into Hopper-style async copy + commit/wait chains feedingLocalLoadOps, deduplicating redundant allocs (TleLowerAsyncLoad.cpp). -
Lower TMA Copy lowers high-level
TMACopyOp(emitted bytle.copywith tensor descriptors) into NVIDIA TMA intrinsics, handling both GM→SMEM and SMEM→GM directions with barrier management (TleLowerTmaCopy.cpp). - Convert Arg To MemDesc materializes memdesc-compatible operands/results inside DSL regions, inserting temporary local alloc/load sequences so generic Triton passes can reason about them (ConvertArgToMemDesc.cpp).
-
DSL Region Inline splices
tle::DSLRegionOpbodies back into surrounding CFG blocks, replacing yields with branches once raw kernels are lowered (DSLRegionInline.cpp).
-
Early Assign Memory Space rewrites tensors tagged with
- Pass registrations are defined in Passes.td and surfaced to Python (
-
Backend Distribution
- Backend-specific logic currently targets NVIDIA (see nvidia and the use of
triton::nvidia_gpuintrinsics inside passes). Other hardware backends can plug in by reusing the raw DSL + pass hooks and implementing their own lowering passes/encodings underthird_party/<backend>/backend/compiler.py, similar to how HINTS are dispatched. - Pass wrappers exported from triton_tle.cc let each backend opt into only the passes it supports when assembling its pipeline (e.g., NVIDIA enabling TMA lowering while another backend might stop after memory-space tagging).
- Backend-specific logic currently targets NVIDIA (see nvidia and the use of
-
Testing & Examples
- Integration tests under tle (mentioned in the README) cover end-to-end kernels for pipeline loops, GEMM, and TMA copies to ensure Python APIs, semantic checks, and passes stay aligned.
- Developers can run
python python/test/tle/run_tests.pyafter modifying either the Python DSL or MLIR passes to catch regressions quickly.
-
Extending TLE
- New APIs should mirror the established pattern: add Python surface ops (with semantic validation) → expose necessary builder hooks → create/extend dialect ops → add lowering passes and register them for backends.
- Keep layout/scope abstractions centralized in types.py so future hardware (e.g., tensor memory) can be toggled without touching user code, and document any new passes in Passes.td to keep the wiki aligned.
HINTS extends TTIR operations with attributes to enable hardware-aware optimizations. The implementation involves AST processing, TTIR attribute encoding, and backend pass distribution.
AST Processing: HINTS are processed in two stages:
-
Parsing (
python/triton/runtime/jit.py): Theparse()method uses Python'stokenizemodule to scan source code for#@hint:comments, extracts hint names, and maps them to line numbers. These hints are stored in aline_flagtree_hintsdictionary and attached to the AST function definition node. -
Create Op (
python/triton/compiler/code_generator.py,python/triton/language/core.py,python/triton/language/semantic.py): During code generation, when encounteringtl.loadcalls, the code generator retrieves hints from the line number mapping and passes them as theflagtree_hintsparameter toload(). The semantic layer then forwards this parameter to the builder'screate_load()method, which encodes hints as TTIR operation attributes.
TTIR Attribute Extension: HINTS are encoded as attributes on TTIR operations (e.g., tt.load operations carry hint attributes), enabling mid-end and backend passes to access and process them.
Backend Pass Distribution: HINTS processing passes are dispatched in backend compilers (e.g., third_party/[backend_name]/backend/compiler.py). Each backend registers appropriate passes based on the hints it supports (e.g., add_process_shared_memory_hint() for NVIDIA backend).
Pass Implementation Locations: HINTS processing passes are implemented in:
-
Backend-specific folders: Each backend may implement hint-specific passes in its own directory (e.g.,
third_party/nvidia/) - Linalg/FLIR folders: Common Linalg passes that process hints during structured-to-memref conversions
- TLE folders: TLE-related passes that may interact with hints during transformations
More information please following HINTS wiki.
TBD