Skip to content

Architecture

sunnycase edited this page Jan 22, 2026 · 3 revisions

Architecture

This document describes FlagTree's architecture, how it extends Triton, and the code organization.

Brief overview of Triton

FlagTree follows Triton's existing compilation architecture and extends it without modifying the core Triton codebase. Triton's compilation system consists of three main modules that work together to transform Python kernel code into executable GPU code. These modules handle different aspects of the compilation process: understanding your code (AST Processing), compiling it for the target hardware (Backend Compilation), and managing when and how it runs (Runtime System).

  • AST Processing (python/triton/compiler/, python/triton/language/): Converts Python kernel code to Triton IR (TTIR) in MLIR format. This module consists of:

    • Language Definition (python/triton/language/): Defines Triton's language constructs including core types (core.py), standard operations (standard.py), math functions (math.py), and semantic rules (semantic.py). These provide the building blocks that kernels use.
    • Code Generation (python/triton/compiler/code_generator.py): Transforms Python AST into TTIR operations by recognizing Triton language constructs (defined in language/) and generating the initial intermediate representation.
    • Compiler Coordination (python/triton/compiler/compiler.py): Orchestrates the compilation process, managing AST sources, IR sources, and coordinating with backends.
  • Backend Compilation (third_party/[backend]/backend/compiler.py): Each backend defines its compilation pipeline through the add_stages() method, which specifies how to transform TTIR into executable code. The typical flow includes: TTIR → TTGPU IR/ Linalg IR → LLVM IR → target assembly → binary. Each backend implements these stages with hardware-specific optimizations and code generation.

  • Runtime System (python/triton/runtime/): Handles JIT compilation, kernel caching, and kernel launch. The runtime compiles kernels on-demand when first called, caches compiled results to avoid recompilation, and manages kernel execution on the GPU through backend drivers.

FlagTree Architecture

FlagTree mainly extends Triton in backend support and compiler optimizations.

Backend Extensions: FlagTree follows a plugin-based architecture where each backend is self-contained in third_party/[backend_name]/. Each backend implements the BaseBackend interface, defines its compilation pipeline through add_stages(), and provides backend-specific optimizations and code generation. This design allows adding new backends without modifying core Triton code.

Language & Compiler Optimization Extensions: For existing Triton code, FlagTree uses incremental extensions for full compatibility with native Triton. Key modifications include:

  • Adding flagtree_hints parameter to load() in core.py and semantic.py
  • Parsing #@hint: comments in jit.py and extracting hints in code_generator.py
  • Dispatching to TLE modules via module_map mechanism in code_generator.py
  • Integrating TLE and HINTS passes in backend compilation pipelines

FlagTree maximizes separation between TLE language extensions (python/triton/experimental/tle/) and TLE MLIR dialect (third_party/tle/). TLE language constructs are defined in Python and integrated through the AST processing pipeline, while TLE dialect operations are implemented in C++/MLIR and integrated through the compilation pipeline. This separation allows language features and IR transformations to evolve independently, improving maintainability and enabling different backends to adopt TLE features at different stages of the compilation pipeline.

The following code structure shows how FlagTree organizes its extensions:

flagtree/
├── python/
│   ├── triton/                # Triton core (existing)
│   │   ├── compiler/              
│   │   │   ├── code_generator.py  # [EXTENDED] TLE module dispatch, HINTS extraction
│   │   │   └── compiler.py       
│   │   ├── language/              
│   │   │   ├── core.py            # [EXTENDED] HINTS
│   │   │   └── semantic.py        # [EXTENDED] HINTS
│   │   ├── runtime/               
│   │   │   └── jit.py             # [EXTENDED] HINTS
│   │   └── experimental/          # Language extensions
│   │       └── tle/               # TLE (Triton Language Extensions)
│   │           ├── language/      # TLE language definition (extends AST Processing)
│   │           │   ├── core.py    # TLE-Lite: Core TLE language features (e.g., tle.load)
│   │           │   ├── [gpu/npu]/ # TLE-Struct: GPU-specific and NPU-specific constructs
│   │           │   │   ├── core.py    
│   │           │   │   ├── semantic.py 
│   │           │   │   └── types.py   
│   │           │   └── raw/       # TLE-Raw: Raw MLIR programming interface
│   │           │       ├── core.py    
│   │           │       └── semantic.py
│   │           └── raw/           # TLE-Raw implementation (extends AST Processing)
│   │               ├── mlir/      # MLIR code generation for TLE-Raw
│   │               │   ├── codegen.py  
│   │               │   └── runtime.py 
│   │               └── runtime.py # Runtime support for TLE-Raw
│   ├── tutorials/             # Tutorials and examples (not in code path)
│   │   └── tle/                # TLE examples
│   │       ├── 01-sparse-mla.py
│   │       └── raw/            # TLE-Raw examples
│   └── test/                   # Test code (not in code path)
│       └── tle/                # TLE tests
│           ├── integration/    # Integration tests
│           ├── unit/           # Unit tests
│           └── run_tests.py
│
└── third_party/              
    ├── [backend_name]/       # Backend-specific extensions
    │   ├── backend/           
    │   │   └── compiler.py    # [EXTENDED] TLE and HINTS pass dispatch
    │   ├── include/           # TTIR dialect definitions (may extend tt.load with attributes)
    │   └── lib/                            
    ├── tle/                  # TLE MLIR extensions
    │   └── dialect/           # TLE dialect implementation
    │       ├── include/IR/    
    │       ├── lib/IR/       
    │       ├── lib/Conversion/  
    │       └── lib/Transforms/
    └── flir/                 # FLIR: FlagTree-maintained common Linalg, including HINTS pass

Note: The code structure above shows only the key files and directories relevant to FlagTree's extensions. Many other files and subdirectories in the codebase are omitted for clarity.

Backend Extensions

Please follow the step-by-step tutorial for detailed instructions on adding a new backend.

Language & Compiler Optimization Extensions

TLE (Triton Language Extensions)

TLE Architecture

  • Purpose & Scope

    • Extends Triton with explicit shared/tensor memory management, async/TMA data movement, and pipeline control optimized for NVIDIA Hopper-class GPUs for now (README.md).
    • Frontend APIs live under tle and lower into custom MLIR dialect + passes under tle.
  • Frontend DSL Layer (Python)

    • tle.language.core overrides key tl builtins such as load, alloc, copy, local_load, local_store, and loop helpers to attach extra attributes (e.g., "tt.load.async") and create buffered_tensor handles representing shared/tensor memory allocations (core.py).
    • GPU-specific helpers in gpu define layouts (swizzled_shared_layout, nv_mma_shared_layout, etc.), scopes (smem, tmem), and buffered_tensor semantics that wrap IR memdesc types while keeping Triton-style type checking.
    • Users import these symbols (e.g., tle.alloc, tle.copy, tle.pipeline) inside @triton.jit kernels to allocate SMEM tiles, launch async copies, or orchestrate staged loops.
  • Semantic Validation

    • TLESemantic in semantic.py runs alongside Triton’s semantic layer. It validates shapes, dtypes, and copy compatibility before lowering, providing early error messages and adapting constexpr inputs.
    • Semantic helpers call into custom builder hooks (exposed via the C++ bridge) to emit LocalAllocOp, TMACopyOp, etc., ensuring Python APIs map 1:1 to TTIR constructs.
  • Raw/EDSL Layer

    • raw exposes a lightweight MLIR-based eDSL for writing dialect-specific intrinsics directly. Decorators like @dialect(name="mlir") build LLVM IR from Python ASTs via EdslMLIRJITFunction, enabling backend authors to prototype kernels or helper ops outside the high-level Triton syntax.
    • The raw runtime (call() helper) materializes tle::DSLRegionOp nodes whose bodies are later inlined by passes.
  • C++ Bridge & Dialect

    • triton_tle.cc registers additional builder methods (creating encoding attributes, memdesc types, TMACopy ops, DSL regions) onto Triton’s TritonOpBuilder, and wires new passes plus raw IR helpers into Python via pybind11.
    • The MLIR dialect lives under dialect with IR definitions plus Analysis/Conversion/Transforms infrastructure mirroring upstream Triton conventions.
  • Pass & Lowering Pipeline

    • Pass registrations are defined in Passes.td and surfaced to Python (add_early_assign_memory_space, add_lower_async_load, add_lower_tma_copy, add_tle_convert_arg_to_memdesc, add_tle_dsl_region_inline).
    • Key transformations:
      • Early Assign Memory Space rewrites tensors tagged with tt.memory_space="shared_memory" into explicit local alloc/store sequences and removes the attribute so later passes see concrete SMEM ops (TleEarlyAssignMemorySpace.cpp).
      • Lower Async Load looks for loads marked with "tt.load.async" (set by tle.load) and converts them into Hopper-style async copy + commit/wait chains feeding LocalLoadOps, deduplicating redundant allocs (TleLowerAsyncLoad.cpp).
      • Lower TMA Copy lowers high-level TMACopyOp (emitted by tle.copy with tensor descriptors) into NVIDIA TMA intrinsics, handling both GM→SMEM and SMEM→GM directions with barrier management (TleLowerTmaCopy.cpp).
      • Convert Arg To MemDesc materializes memdesc-compatible operands/results inside DSL regions, inserting temporary local alloc/load sequences so generic Triton passes can reason about them (ConvertArgToMemDesc.cpp).
      • DSL Region Inline splices tle::DSLRegionOp bodies back into surrounding CFG blocks, replacing yields with branches once raw kernels are lowered (DSLRegionInline.cpp).
  • Backend Distribution

    • Backend-specific logic currently targets NVIDIA (see nvidia and the use of triton::nvidia_gpu intrinsics inside passes). Other hardware backends can plug in by reusing the raw DSL + pass hooks and implementing their own lowering passes/encodings under third_party/<backend>/backend/compiler.py, similar to how HINTS are dispatched.
    • Pass wrappers exported from triton_tle.cc let each backend opt into only the passes it supports when assembling its pipeline (e.g., NVIDIA enabling TMA lowering while another backend might stop after memory-space tagging).
  • Testing & Examples

    • Integration tests under tle (mentioned in the README) cover end-to-end kernels for pipeline loops, GEMM, and TMA copies to ensure Python APIs, semantic checks, and passes stay aligned.
    • Developers can run python python/test/tle/run_tests.py after modifying either the Python DSL or MLIR passes to catch regressions quickly.
  • Extending TLE

    • New APIs should mirror the established pattern: add Python surface ops (with semantic validation) → expose necessary builder hooks → create/extend dialect ops → add lowering passes and register them for backends.
    • Keep layout/scope abstractions centralized in types.py so future hardware (e.g., tensor memory) can be toggled without touching user code, and document any new passes in Passes.td to keep the wiki aligned.

HINTS

HINTS extends TTIR operations with attributes to enable hardware-aware optimizations. The implementation involves AST processing, TTIR attribute encoding, and backend pass distribution.

AST Processing: HINTS are processed in two stages:

  • Parsing (python/triton/runtime/jit.py): The parse() method uses Python's tokenize module to scan source code for #@hint: comments, extracts hint names, and maps them to line numbers. These hints are stored in a line_flagtree_hints dictionary and attached to the AST function definition node.
  • Create Op (python/triton/compiler/code_generator.py, python/triton/language/core.py, python/triton/language/semantic.py): During code generation, when encountering tl.load calls, the code generator retrieves hints from the line number mapping and passes them as the flagtree_hints parameter to load(). The semantic layer then forwards this parameter to the builder's create_load() method, which encodes hints as TTIR operation attributes.

TTIR Attribute Extension: HINTS are encoded as attributes on TTIR operations (e.g., tt.load operations carry hint attributes), enabling mid-end and backend passes to access and process them.

Backend Pass Distribution: HINTS processing passes are dispatched in backend compilers (e.g., third_party/[backend_name]/backend/compiler.py). Each backend registers appropriate passes based on the hints it supports (e.g., add_process_shared_memory_hint() for NVIDIA backend).

Pass Implementation Locations: HINTS processing passes are implemented in:

  • Backend-specific folders: Each backend may implement hint-specific passes in its own directory (e.g., third_party/nvidia/)
  • Linalg/FLIR folders: Common Linalg passes that process hints during structured-to-memref conversions
  • TLE folders: TLE-related passes that may interact with hints during transformations

More information please following HINTS wiki.

EDSL

TBD

Clone this wiki locally