Skip to content

Commit cd28c8d

Browse files
committed
tilecpp: remove internal markers and publish tilecpp backend
1 parent be9c768 commit cd28c8d

105 files changed

Lines changed: 15740 additions & 1 deletion

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.tilecpp.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
<!--- SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --->
2+
3+
<!--- SPDX-License-Identifier: MIT --->
4+
5+
6+
7+
8+
# CUDA Tile C++ Backend
9+
10+
The CUDA Tile C++ backend provides CUDA Tile C++ kernel implementations for TileGym operations.
11+
12+
## Set up
13+
14+
CUDA Tile C++ requires CUDA Toolkit 13.3 or newer. Install the latest CUDA Toolkit
15+
available for your platform, and make sure `nvcc` from that toolkit is on
16+
your `PATH`.
17+
18+
```
19+
# Example: use a CUDA 13.3+ toolkit installed under /usr/local.
20+
export PATH=/usr/local/cuda-13.3/bin:$PATH
21+
export TILECPP_NVCC_PATH=/usr/local/cuda-13.3/bin/nvcc
22+
23+
# Verify nvcc is visible.
24+
nvcc --version
25+
26+
# Run a test, you should see a CUDA Tile C++ (TileCpp) column in the report table
27+
python tests/benchmark/bench_swiglu.py
28+
```
29+
30+
## Environment Variables
31+
32+
### Cache Configuration
33+
34+
35+
| Variable | Default | Description |
36+
| ----------------------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------- |
37+
| `TILECPP_CACHE_DIR` | `~/.cache/tilecpp` | Directory for caching compiled cubin files. If not set, uses `$XDG_CACHE_HOME/tilecpp` or falls back to `~/.cache/tilecpp`. |
38+
| `TILECPP_DISABLE_CACHE` | `0` | Set to `1` to disable cubin caching and force recompilation on every run. Useful for development/debugging. |
39+
40+
41+
### Compiler Configuration
42+
43+
44+
| Variable | Default | Description |
45+
| ------------------- | ------- | ------------------------------------------------------------------------------------------------------------------ |
46+
| `TILECPP_NVCC_PATH` | `nvcc` | Path to the nvcc compiler. Override if nvcc is not in your PATH or you want to use a specific version. |
47+
| `TILECPP_SAVE_SRC` | `0` | Set to `1` to save generated CUDA source files alongside compiled cubins. Useful for debugging compilation issues. |
48+
49+
50+
### Autotuning
51+
52+
53+
| Variable | Default | Description |
54+
| -------------------------- | ------- | ------------------------------------------------------------------------------------------------------- |
55+
| `TILECPP_AUTOTUNE` | `0` | Set to `1` to enable autotuning for kernel configurations. When disabled, uses default configurations. |
56+
| `TILECPP_VERBOSE_AUTOTUNE` | `0` | Set to `1` to enable verbose output during autotuning, showing configuration trials and timing results. |
57+
58+
59+
## Adding a New CUDA Tile C++ Kernel to TileGym
60+
61+
This section is only about integrating a CUDA Tile C++ kernel into TileGym.
62+
63+
CUDA Tile C++ operators normally have two pieces:
64+
65+
1. A CUDA Tile C++ kernel in `src/tilegym/ops/tilecpp/<op>.cuh`.
66+
2. A Python binding in `src/tilegym/ops/tilecpp/<op>.py` that compiles, launches,
67+
and registers the kernel with TileGym.
68+
69+
The `.cuh` file contains the `__tile_global__` kernel and any helper tile code.
70+
Prefer making compile-time constants template parameters when they affect tile
71+
shapes or loop structure. Keep the kernel signature limited to runtime pointers
72+
and scalar values that must be passed at launch time.
73+
74+
```cpp
75+
#pragma once
76+
77+
#include <cuda_tile.h>
78+
79+
template<typename T, int BLOCK_M, int BLOCK_N>
80+
__tile_global__ void my_kernel(const T* __restrict__ x, T* __restrict__ y, int n) {
81+
namespace ct = cuda::tiles;
82+
// Tile code goes here.
83+
}
84+
```
85+
86+
The Python file creates a `TileCppKernel`, requests a specialized kernel with
87+
`get_kernel(...)`, launches it with device pointers/scalars, and registers the
88+
public TileGym op for the `tilecpp` backend.
89+
90+
```python
91+
from pathlib import Path
92+
93+
import numpy as np
94+
import torch
95+
96+
from tilegym.backend import register_impl
97+
from tilegym.ops.tilecpp.utils._cuda_utils import TileCppKernel
98+
99+
_my_kernel = TileCppKernel(
100+
source_path=Path(__file__).parent / "my_op.cuh",
101+
kernel_name="my_kernel",
102+
)
103+
104+
105+
def _launch_my_kernel(x: torch.Tensor, y: torch.Tensor, block_m: int, block_n: int):
106+
kernel, _, _ = _my_kernel.get_kernel(
107+
dtype=x.dtype,
108+
template_params=[block_m, block_n],
109+
signature="const {T}*, {T}*, int",
110+
)
111+
_my_kernel.launch(
112+
grid=(1, 1, 1),
113+
kernel=kernel,
114+
args=[
115+
np.uint64(x.data_ptr()),
116+
np.uint64(y.data_ptr()),
117+
np.int32(x.numel()),
118+
],
119+
)
120+
121+
122+
@register_impl("my_op", backend="tilecpp")
123+
def my_op(x: torch.Tensor, **kwargs):
124+
y = torch.empty_like(x)
125+
_launch_my_kernel(x, y, block_m=128, block_n=128)
126+
return y
127+
```
128+
129+
Make sure `src/tilegym/ops/tilecpp/__init__.py` imports the new Python module
130+
when the backend is available. Add or extend tests under `tests/ops/` so the
131+
same operation can run with `backend="tilecpp"`, and add benchmark coverage
132+
under `tests/benchmark/` when there is a corresponding CuTile benchmark.
133+
134+
## Compiling a `.cuh` Kernel Standalone with nvcc 13.3+
135+
136+
You can compile a CUDA Tile C++ `.cuh` kernel directly with the CUDA 13.3+ toolkit
137+
without going through TileGym. This is useful for verifying a kernel builds
138+
cleanly outside the framework or sharing a self-contained reproducer.
139+
140+
You need one extra `.cu` driver file that:
141+
142+
1. Includes the `.cuh` so the template is in scope.
143+
2. Adds at least one **explicit template instantiation**.
144+
3. Provides host-side setup: device buffers, `cudaMemcpy`, the kernel
145+
launch, and copy-back/cleanup.
146+
147+
Example driver (`my_op_main.cu`) for the `my_kernel` template shown earlier:
148+
149+
```cpp
150+
#include <cstdio>
151+
#include <vector>
152+
#include <cuda_runtime.h>
153+
154+
#include "my_op.cuh"
155+
156+
template __tile_global__ void my_kernel<float, 128, 128>(
157+
const float* __restrict__, float* __restrict__, int);
158+
159+
int main() {
160+
constexpr int N = 1 << 20;
161+
std::vector<float> h_x(N, 1.0f), h_y(N);
162+
163+
float *d_x = nullptr, *d_y = nullptr;
164+
cudaMalloc(&d_x, N * sizeof(float));
165+
cudaMalloc(&d_y, N * sizeof(float));
166+
cudaMemcpy(d_x, h_x.data(), N * sizeof(float), cudaMemcpyHostToDevice);
167+
168+
/* Tile C++ kernels are tile-centric: the launch always uses
169+
* block=1, and the kernel uses ct::bid() for parallelism. The
170+
* grid covers ceil(N / BLOCK_SIZE) tiles. */
171+
dim3 grid((N + 127) / 128), block(1);
172+
my_kernel<float, 128, 128><<<grid, block>>>(d_x, d_y, N);
173+
cudaDeviceSynchronize();
174+
175+
cudaMemcpy(h_y.data(), d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
176+
printf("y[0] = %f\n", h_y[0]);
177+
178+
cudaFree(d_x); cudaFree(d_y);
179+
return 0;
180+
}
181+
```
182+
183+
Compile with nvcc 13.3 or newer. Set `-arch` to match your target GPU
184+
(`sm_80` and newer architectures are supported):
185+
186+
```bash
187+
/usr/local/cuda-13.3/bin/nvcc \
188+
-enable-tile \
189+
-std=c++20 \
190+
-arch=sm_100 \
191+
-I src/tilegym/ops/tilecpp \
192+
my_op_main.cu \
193+
-o my_op_main
194+
195+
./my_op_main
196+
```
197+
198+
The `-enable-tile` flag turns on the Tile C++ extensions (`__tile_global__`,
199+
the `cuda::tiles` namespace, etc.); without it nvcc treats the `.cuh` as
200+
plain CUDA and rejects the tile syntax.
201+
202+
The same toolchain can produce a cubin-only artifact (the form TileGym caches
203+
internally) by adding `-tilecubin --tile-only` and dropping the host driver
204+
code from the `.cu` file.
205+
206+
## Cache Management
207+
208+
The CUDA Tile C++ cache stores compiled cubin files to avoid recompilation. Cache files are named using a hash of the source code and template parameters.
209+
210+
To clear the cache:
211+
212+
```bash
213+
rm -rf ~/.cache/tilecpp/*
214+
```

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,6 @@ numpy
1313
cuda-tile>=1.3.0 # Or use: pip install cuda-tile[tileiras] for bundled tileiras compiler
1414
filelock>=3.20.3 # CVE fix: GHSA-w853-jp5j-5j7f, GHSA-qmgc-5h2g-mvrw
1515
pillow>=12.1.1 # CVE fix: GHSA-cfh3-3jmp-rvhc
16+
cuda-bindings>=13.2.0
17+
cuda-core>=0.7.0
1618
# nvidia-ml-py # optional

src/tilegym/backend/dispatcher.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from tilegym.logger import get_logger
1818

1919
from .selector import get_current_backend
20+
from .selector import is_tilecpp_available
2021

2122

2223
def _is_fallback_disabled() -> bool:
@@ -81,6 +82,15 @@ def wrapper(*args, **kwargs):
8182

8283
logger.debug(f"[Backend Dispatch] Function: '{name}', Current backend: '{current_backend}'")
8384

85+
# Defer the tilecpp nvcc-version probe until the first actual
86+
# dispatch to tilecpp. is_tilecpp_available() is cached, so the
87+
# subprocess runs at most once per process. If unavailable, fall
88+
# through to the registered fallback so the user gets a useful
89+
# result (or a clear DISABLE_FALLBACK error below) instead of a
90+
# tilecpp launch failure.
91+
if current_backend == "tilecpp" and not is_tilecpp_available():
92+
current_backend = fallback_backend
93+
8494
# Try implementation from current backend
8595
if name in _REGISTRY and current_backend in _REGISTRY[name]:
8696
logger.debug(f"[Backend Dispatch] Using '{current_backend}' implementation for '{name}'")

src/tilegym/backend/selector.py

Lines changed: 116 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
Used to manage backend implementations of various operations in TileGym library
88
"""
99

10+
import functools
1011
import os
1112
from typing import Dict
1213
from typing import Set
@@ -31,13 +32,109 @@ def is_cutile_available():
3132
return CUTILE_AVAILABLE
3233

3334

35+
_TILECPP_MIN_NVCC = (13, 3)
36+
37+
38+
def _nvcc_version_supported() -> bool:
39+
"""Return True iff a usable nvcc with a supported CUDA version is found.
40+
41+
Resolution order: ``$TILECPP_NVCC_PATH`` first, then ``nvcc`` on PATH.
42+
The release version reported by ``nvcc --version`` must be at least
43+
``_TILECPP_MIN_NVCC`` (currently 13.3).
44+
"""
45+
import re
46+
import shutil
47+
import subprocess
48+
49+
nvcc = os.environ.get("TILECPP_NVCC_PATH", "nvcc")
50+
if not os.path.isabs(nvcc):
51+
resolved = shutil.which(nvcc)
52+
if resolved is None:
53+
return False
54+
nvcc = resolved
55+
elif not os.path.exists(nvcc):
56+
return False
57+
58+
try:
59+
result = subprocess.run([nvcc, "--version"], capture_output=True, text=True, timeout=10)
60+
except (OSError, subprocess.SubprocessError):
61+
return False
62+
if result.returncode != 0:
63+
return False
64+
m = re.search(r"release\s+(\d+)\.(\d+)", result.stdout)
65+
if not m:
66+
return False
67+
return (int(m.group(1)), int(m.group(2))) >= _TILECPP_MIN_NVCC
68+
69+
70+
def _check_tilecpp_module_importable():
71+
"""Cheap eager check: can we locate and import the TileCpp _cuda_utils module?
72+
73+
Does NOT spawn any subprocess, so it is safe to call at module load time
74+
even on hosts without nvcc / without CUDA. Returns ``(ok, err)`` where
75+
``err`` is the captured exception when ``ok`` is False.
76+
"""
77+
try:
78+
from importlib import util as importlib_util
79+
from pathlib import Path
80+
81+
_tilecpp_cuda_utils_path = Path(__file__).resolve().parents[1] / "ops" / "tilecpp" / "utils" / "_cuda_utils.py"
82+
_tilecpp_cuda_utils_spec = importlib_util.spec_from_file_location(
83+
"_tilegym_tilecpp_cuda_utils_availability",
84+
_tilecpp_cuda_utils_path,
85+
)
86+
if _tilecpp_cuda_utils_spec is None or _tilecpp_cuda_utils_spec.loader is None:
87+
raise ImportError("Failed to locate TileCpp _cuda_utils module")
88+
_tilecpp_cuda_utils = importlib_util.module_from_spec(_tilecpp_cuda_utils_spec)
89+
_tilecpp_cuda_utils_spec.loader.exec_module(_tilecpp_cuda_utils)
90+
if not hasattr(_tilecpp_cuda_utils, "TileCppKernel"):
91+
raise ImportError("TileCppKernel is not available")
92+
except (ImportError, FileNotFoundError) as err:
93+
return False, err
94+
return True, None
95+
96+
97+
_TILECPP_MODULE_IMPORTABLE, _tilecpp_unavailable_err = _check_tilecpp_module_importable()
98+
99+
100+
@functools.cache
101+
def is_tilecpp_available() -> bool:
102+
"""Check if the CUDA Tile C++ backend is available.
103+
104+
The expensive ``nvcc --version`` subprocess is deferred to the first call
105+
of this function (cached thereafter), so ``import tilegym`` on a non-CUDA
106+
host has no subprocess overhead. The check is invoked by the dispatcher
107+
on the first actual tilecpp dispatch. When the check fails, a
108+
``UserWarning`` is emitted at the caller's frame (``stacklevel=2``) and
109+
suppressed for subsequent calls.
110+
"""
111+
import warnings
112+
113+
if not _TILECPP_MODULE_IMPORTABLE:
114+
warnings.warn(
115+
f"TileCpp backend is not available: {_tilecpp_unavailable_err}",
116+
stacklevel=2,
117+
)
118+
return False
119+
if not _nvcc_version_supported():
120+
warnings.warn(
121+
f"TileCpp backend is not available: nvcc >= {_TILECPP_MIN_NVCC[0]}.{_TILECPP_MIN_NVCC[1]} "
122+
"is required (set TILECPP_NVCC_PATH or install CUDA "
123+
f"{_TILECPP_MIN_NVCC[0]}.{_TILECPP_MIN_NVCC[1]} or newer on PATH)",
124+
stacklevel=2,
125+
)
126+
return False
127+
return True
128+
129+
34130
_AVAILABLE_BACKENDS: Set[str] = set()
35131
_CURRENT_BACKENDS: str = "cutile"
36132

37133

38134
def _check_backends_availability() -> Dict[str, bool]:
39135
availability = {
40136
"cutile": is_cutile_available(),
137+
"tilecpp": _TILECPP_MODULE_IMPORTABLE,
41138
}
42139
return availability
43140

@@ -75,13 +172,31 @@ def set_backend(backend: str) -> None:
75172
global _CURRENT_BACKENDS
76173
if backend not in _AVAILABLE_BACKENDS:
77174
raise ValueError(f"Unknown backend: {backend}, available backends: {_AVAILABLE_BACKENDS}")
175+
# tilecpp is in _AVAILABLE_BACKENDS based on a cheap module-importability
176+
# check; verify the full runtime requirement (nvcc >= 13.3) here so callers
177+
# opting in to tilecpp fail fast instead of silently falling back at dispatch.
178+
if backend == "tilecpp" and not is_tilecpp_available():
179+
raise ValueError(
180+
f"Backend 'tilecpp' is not available on this system: nvcc >= "
181+
f"{_TILECPP_MIN_NVCC[0]}.{_TILECPP_MIN_NVCC[1]} is required "
182+
"(set TILECPP_NVCC_PATH or install CUDA "
183+
f"{_TILECPP_MIN_NVCC[0]}.{_TILECPP_MIN_NVCC[1]} or newer on PATH)"
184+
)
78185
_CURRENT_BACKENDS = backend
79186
logger.info(f"Set backend to {backend}")
80187

81188

82189
def is_backend_available(backend: str) -> bool:
83190
"""check if the backend is available"""
84-
return backend in _AVAILABLE_BACKENDS
191+
if backend not in _AVAILABLE_BACKENDS:
192+
return False
193+
# tilecpp's entry in _AVAILABLE_BACKENDS reflects only the cheap module-
194+
# importability check; the runtime nvcc>=13.3 requirement is verified
195+
# lazily here (cached) so test gates like
196+
# ``if is_backend_available("tilecpp"):`` skip on hosts without nvcc.
197+
if backend == "tilecpp":
198+
return is_tilecpp_available()
199+
return True
85200

86201

87202
def assert_backend_available(backend: str) -> None:

0 commit comments

Comments
 (0)