|
| 1 | +<!--- SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. ---> |
| 2 | + |
| 3 | +<!--- SPDX-License-Identifier: MIT ---> |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +# CUDA Tile C++ Backend |
| 9 | + |
| 10 | +The CUDA Tile C++ backend provides CUDA Tile C++ kernel implementations for TileGym operations. |
| 11 | + |
| 12 | +## Set up |
| 13 | + |
| 14 | +CUDA Tile C++ requires CUDA Toolkit 13.3 or newer. Install the latest CUDA Toolkit |
| 15 | +available for your platform, and make sure `nvcc` from that toolkit is on |
| 16 | +your `PATH`. |
| 17 | + |
| 18 | +``` |
| 19 | +# Example: use a CUDA 13.3+ toolkit installed under /usr/local. |
| 20 | +export PATH=/usr/local/cuda-13.3/bin:$PATH |
| 21 | +export TILECPP_NVCC_PATH=/usr/local/cuda-13.3/bin/nvcc |
| 22 | +
|
| 23 | +# Verify nvcc is visible. |
| 24 | +nvcc --version |
| 25 | +
|
| 26 | +# Run a test, you should see a CUDA Tile C++ (TileCpp) column in the report table |
| 27 | +python tests/benchmark/bench_swiglu.py |
| 28 | +``` |
| 29 | + |
| 30 | +## Environment Variables |
| 31 | + |
| 32 | +### Cache Configuration |
| 33 | + |
| 34 | + |
| 35 | +| Variable | Default | Description | |
| 36 | +| ----------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------- | |
| 37 | +| `TILECPP_CACHE_DIR` | `~/.cache/tilecpp` | Directory for caching compiled cubin files. If not set, uses `$XDG_CACHE_HOME/tilecpp` or falls back to `~/.cache/tilecpp`. | |
| 38 | +| `TILECPP_DISABLE_CACHE` | `0` | Set to `1` to disable cubin caching and force recompilation on every run. Useful for development/debugging. | |
| 39 | + |
| 40 | + |
| 41 | +### Compiler Configuration |
| 42 | + |
| 43 | + |
| 44 | +| Variable | Default | Description | |
| 45 | +| ------------------- | ------- | ------------------------------------------------------------------------------------------------------------------ | |
| 46 | +| `TILECPP_NVCC_PATH` | `nvcc` | Path to the nvcc compiler. Override if nvcc is not in your PATH or you want to use a specific version. | |
| 47 | +| `TILECPP_SAVE_SRC` | `0` | Set to `1` to save generated CUDA source files alongside compiled cubins. Useful for debugging compilation issues. | |
| 48 | + |
| 49 | + |
| 50 | +### Autotuning |
| 51 | + |
| 52 | + |
| 53 | +| Variable | Default | Description | |
| 54 | +| -------------------------- | ------- | ------------------------------------------------------------------------------------------------------- | |
| 55 | +| `TILECPP_AUTOTUNE` | `0` | Set to `1` to enable autotuning for kernel configurations. When disabled, uses default configurations. | |
| 56 | +| `TILECPP_VERBOSE_AUTOTUNE` | `0` | Set to `1` to enable verbose output during autotuning, showing configuration trials and timing results. | |
| 57 | + |
| 58 | + |
| 59 | +## Adding a New CUDA Tile C++ Kernel to TileGym |
| 60 | + |
| 61 | +This section is only about integrating a CUDA Tile C++ kernel into TileGym. |
| 62 | + |
| 63 | +CUDA Tile C++ operators normally have two pieces: |
| 64 | + |
| 65 | +1. A CUDA Tile C++ kernel in `src/tilegym/ops/tilecpp/<op>.cuh`. |
| 66 | +2. A Python binding in `src/tilegym/ops/tilecpp/<op>.py` that compiles, launches, |
| 67 | + and registers the kernel with TileGym. |
| 68 | + |
| 69 | +The `.cuh` file contains the `__tile_global__` kernel and any helper tile code. |
| 70 | +Prefer making compile-time constants template parameters when they affect tile |
| 71 | +shapes or loop structure. Keep the kernel signature limited to runtime pointers |
| 72 | +and scalar values that must be passed at launch time. |
| 73 | + |
| 74 | +```cpp |
| 75 | +#pragma once |
| 76 | + |
| 77 | +#include <cuda_tile.h> |
| 78 | + |
| 79 | +template<typename T, int BLOCK_M, int BLOCK_N> |
| 80 | +__tile_global__ void my_kernel(const T* __restrict__ x, T* __restrict__ y, int n) { |
| 81 | + namespace ct = cuda::tiles; |
| 82 | + // Tile code goes here. |
| 83 | +} |
| 84 | +``` |
| 85 | +
|
| 86 | +The Python file creates a `TileCppKernel`, requests a specialized kernel with |
| 87 | +`get_kernel(...)`, launches it with device pointers/scalars, and registers the |
| 88 | +public TileGym op for the `tilecpp` backend. |
| 89 | +
|
| 90 | +```python |
| 91 | +from pathlib import Path |
| 92 | +
|
| 93 | +import numpy as np |
| 94 | +import torch |
| 95 | +
|
| 96 | +from tilegym.backend import register_impl |
| 97 | +from tilegym.ops.tilecpp.utils._cuda_utils import TileCppKernel |
| 98 | +
|
| 99 | +_my_kernel = TileCppKernel( |
| 100 | + source_path=Path(__file__).parent / "my_op.cuh", |
| 101 | + kernel_name="my_kernel", |
| 102 | +) |
| 103 | +
|
| 104 | +
|
| 105 | +def _launch_my_kernel(x: torch.Tensor, y: torch.Tensor, block_m: int, block_n: int): |
| 106 | + kernel, _, _ = _my_kernel.get_kernel( |
| 107 | + dtype=x.dtype, |
| 108 | + template_params=[block_m, block_n], |
| 109 | + signature="const {T}*, {T}*, int", |
| 110 | + ) |
| 111 | + _my_kernel.launch( |
| 112 | + grid=(1, 1, 1), |
| 113 | + kernel=kernel, |
| 114 | + args=[ |
| 115 | + np.uint64(x.data_ptr()), |
| 116 | + np.uint64(y.data_ptr()), |
| 117 | + np.int32(x.numel()), |
| 118 | + ], |
| 119 | + ) |
| 120 | +
|
| 121 | +
|
| 122 | +@register_impl("my_op", backend="tilecpp") |
| 123 | +def my_op(x: torch.Tensor, **kwargs): |
| 124 | + y = torch.empty_like(x) |
| 125 | + _launch_my_kernel(x, y, block_m=128, block_n=128) |
| 126 | + return y |
| 127 | +``` |
| 128 | + |
| 129 | +Make sure `src/tilegym/ops/tilecpp/__init__.py` imports the new Python module |
| 130 | +when the backend is available. Add or extend tests under `tests/ops/` so the |
| 131 | +same operation can run with `backend="tilecpp"`, and add benchmark coverage |
| 132 | +under `tests/benchmark/` when there is a corresponding CuTile benchmark. |
| 133 | + |
| 134 | +## Compiling a `.cuh` Kernel Standalone with nvcc 13.3+ |
| 135 | + |
| 136 | +You can compile a CUDA Tile C++ `.cuh` kernel directly with the CUDA 13.3+ toolkit |
| 137 | +without going through TileGym. This is useful for verifying a kernel builds |
| 138 | +cleanly outside the framework or sharing a self-contained reproducer. |
| 139 | + |
| 140 | +You need one extra `.cu` driver file that: |
| 141 | + |
| 142 | +1. Includes the `.cuh` so the template is in scope. |
| 143 | +2. Adds at least one **explicit template instantiation**. |
| 144 | +3. Provides host-side setup: device buffers, `cudaMemcpy`, the kernel |
| 145 | + launch, and copy-back/cleanup. |
| 146 | + |
| 147 | +Example driver (`my_op_main.cu`) for the `my_kernel` template shown earlier: |
| 148 | + |
| 149 | +```cpp |
| 150 | +#include <cstdio> |
| 151 | +#include <vector> |
| 152 | +#include <cuda_runtime.h> |
| 153 | + |
| 154 | +#include "my_op.cuh" |
| 155 | + |
| 156 | +template __tile_global__ void my_kernel<float, 128, 128>( |
| 157 | + const float* __restrict__, float* __restrict__, int); |
| 158 | + |
| 159 | +int main() { |
| 160 | + constexpr int N = 1 << 20; |
| 161 | + std::vector<float> h_x(N, 1.0f), h_y(N); |
| 162 | + |
| 163 | + float *d_x = nullptr, *d_y = nullptr; |
| 164 | + cudaMalloc(&d_x, N * sizeof(float)); |
| 165 | + cudaMalloc(&d_y, N * sizeof(float)); |
| 166 | + cudaMemcpy(d_x, h_x.data(), N * sizeof(float), cudaMemcpyHostToDevice); |
| 167 | + |
| 168 | + /* Tile C++ kernels are tile-centric: the launch always uses |
| 169 | + * block=1, and the kernel uses ct::bid() for parallelism. The |
| 170 | + * grid covers ceil(N / BLOCK_SIZE) tiles. */ |
| 171 | + dim3 grid((N + 127) / 128), block(1); |
| 172 | + my_kernel<float, 128, 128><<<grid, block>>>(d_x, d_y, N); |
| 173 | + cudaDeviceSynchronize(); |
| 174 | + |
| 175 | + cudaMemcpy(h_y.data(), d_y, N * sizeof(float), cudaMemcpyDeviceToHost); |
| 176 | + printf("y[0] = %f\n", h_y[0]); |
| 177 | + |
| 178 | + cudaFree(d_x); cudaFree(d_y); |
| 179 | + return 0; |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +Compile with nvcc 13.3 or newer. Set `-arch` to match your target GPU |
| 184 | +(`sm_80` and newer architectures are supported): |
| 185 | + |
| 186 | +```bash |
| 187 | +/usr/local/cuda-13.3/bin/nvcc \ |
| 188 | + -enable-tile \ |
| 189 | + -std=c++20 \ |
| 190 | + -arch=sm_100 \ |
| 191 | + -I src/tilegym/ops/tilecpp \ |
| 192 | + my_op_main.cu \ |
| 193 | + -o my_op_main |
| 194 | + |
| 195 | +./my_op_main |
| 196 | +``` |
| 197 | + |
| 198 | +The `-enable-tile` flag turns on the Tile C++ extensions (`__tile_global__`, |
| 199 | +the `cuda::tiles` namespace, etc.); without it nvcc treats the `.cuh` as |
| 200 | +plain CUDA and rejects the tile syntax. |
| 201 | + |
| 202 | +The same toolchain can produce a cubin-only artifact (the form TileGym caches |
| 203 | +internally) by adding `-tilecubin --tile-only` and dropping the host driver |
| 204 | +code from the `.cu` file. |
| 205 | + |
| 206 | +## Cache Management |
| 207 | + |
| 208 | +The CUDA Tile C++ cache stores compiled cubin files to avoid recompilation. Cache files are named using a hash of the source code and template parameters. |
| 209 | + |
| 210 | +To clear the cache: |
| 211 | + |
| 212 | +```bash |
| 213 | +rm -rf ~/.cache/tilecpp/* |
| 214 | +``` |
0 commit comments