Summary
The CuTe DSL ArgmaxKernel raises CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) on NVIDIA H20 GPUs when the vocab size (N) is large (e.g., 151936 for Qwen3). The H20 is sm_90a (Hopper architecture) but lacks TMA cluster launch hardware support, causing any kernel configuration with cluster_dims > (1,1,1) to fail.
Environment
Python: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20-3e
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.48
CUDA Driver Version: 570.133.20
PyTorch: 2.11.0+cu130
tokenspeed: 0.1.0
aiohttp: 3.13.5
compressed-tensors: 0.15.0.1
dill: 0.4.1
einops: 0.8.2
fastapi: 0.136.1
hf_transfer: 0.1.9
huggingface_hub: 1.14.0
modelscope: 1.36.3
msgspec: 0.21.1
ninja: 1.13.0
numpy: 1.26.4
nvidia-cutlass-dsl: 4.5.2
nvtx: 0.2.15
openai: 2.33.0
openai-harmony: 0.0.4
orjson: 3.11.9
packaging: 26.2
partial-json-parser: 0.2.1.1.post7
peft: 0.19.1
pillow: 12.2.0
prometheus-client: 0.25.0
psutil: 7.2.2
pybase64: 1.4.3
pybind11: 3.0.4
pydantic: 2.13.4
py-spy: 0.4.2
python-multipart: 0.0.28
pyzmq: 27.1.0
requests: 2.34.0
setproctitle: 1.3.7
tiktoken: 0.12.0
tokenspeed-smg: 1.4.1.post20260527
tokenspeed-smg-grpc-proto: 0.4.8.post20260527
tokenspeed-smg-grpc-servicer: 0.5.3.post20260527
torch: 2.11.0+cu130
torchvision: 0.26.0+cu130
tqdm: 4.67.3
transformers: 5.6.2
uv: 0.11.14
uvicorn: 0.46.0
uvloop: 0.22.1
viztracer: 1.1.1
xgrammar: 0.1.33
Reproducer
import torch
import cuda.bindings.driver as cuda
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
from tokenspeed_kernel.thirdparty.cute_dsl.argmax import (
ArgmaxKernel, CUDAGraphCompatibleWrapper, torch2cute_dtype_map
)
M, N = 1, 151936 # Qwen3 vocab size triggers it; N=32768 does NOT
dtype_torch = torch.bfloat16
dtype = torch2cute_dtype_map[dtype_torch]
logits = torch.randn(M, N, dtype=dtype_torch, device='cuda')
out_max = torch.empty(M, dtype=torch.float32, device='cuda')
out_idx = torch.empty(M, dtype=torch.int64, device='cuda')
x = from_dlpack(CUDAGraphCompatibleWrapper(logits.detach()), assumed_align=16) \
.mark_compact_shape_dynamic(mode=0, stride_order=(0, 1))
m = from_dlpack(CUDAGraphCompatibleWrapper(out_max.detach()), assumed_align=16) \
.mark_compact_shape_dynamic(mode=0, stride_order=(0,))
i = from_dlpack(CUDAGraphCompatibleWrapper(out_idx.detach()), assumed_align=16) \
.mark_compact_shape_dynamic(mode=0, stride_order=(0,))
stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)
kernel = ArgmaxKernel(dtype, N, use_redux=False)
compiled = cute.compile(kernel, x, m, i, stream)
compiled(x, m, i, stream) # CRASHES HERE
torch.cuda.synchronize()
Error Output
Traceback (most recent call last):
File "/tmp/test.py", line 27, in <module>
compiled(x, m, i, stream) # CRASHES HERE
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/jit_executor.py", line 694, in run_compiled_program
raise DSLCudaRuntimeError(error_code, error_name)
cutlass.base_dsl.common.DSLCudaRuntimeError: DSLCudaRuntimeError: CUDA_ERROR_INVALID_CLUSTER_SIZE (error code: 912)
❌ Invalid cluster size.
Error Code: 912
🔍 Additional Context:
- Error name: CUDA_ERROR_INVALID_CLUSTER_SIZE
- Error code: 912
- CUDA_TOOLKIT_PATH: not set
- Target SM ARCH: not set
📊 GPU Information:
- CUDA devices available: 8 (current: <CUdevice 0>)
- Architecture: Hopper (sm_90a)
- Compatible SM archs: sm_90a
- Total Memory: 139.81 GB
Compatibility Check:
❌ Error: Target SM ARCH unknown is not compatible
💡 Please use one of SM ARCHs: sm_90a
Root Cause
- The
ArgmaxKernel (from CUTLASS DSL / TRT-LLM) uses TMA cluster launches with cluster_dims > (1,1,1) for large N values.
- NVIDIA H20 reports
sm_90a (Hopper) but does not have cluster launch hardware — it's a cut-down SKU without this capability.
- The current gate
_ts_supported_arch() only checks 90 <= sm < 120, which passes for H20.
- Small N (e.g., 32768) does NOT trigger the bug because the kernel selects
cluster_dims = (1,1,1).
- Real LLM vocab sizes (≥100K) trigger multi-cluster configurations that fail on H20.
Trigger Conditions
| N |
Result on H20 |
| 32768 |
OK (cluster_dims=1) |
| 151936 |
CRASH (cluster_dims>1) |
All M values (1–256) fail identically when N=151936.
Suggested Fix
Add an H20 detection gate before enabling _CUTE_AVAILABLE.
def _has_cluster_launch_support() -> bool:
# ... platform checks ...
# H20 is the only sm_90 SKU known to lack cluster launch support
device_name = current_platform().device_name.upper()
if "H20" in device_name:
return False
return True
Alternatively, query CU_DEVICE_ATTRIBUTE_CLUSTER_LAUNCH (attribute 106) from the CUDA driver for a hardware-accurate check.
Workaround
Disable CuTe DSL on H20 by patching cute_dsl.py:
_CUTE_AVAILABLE = False
import torch as _torch_check
if _ARCH_SUPPORTED and "H20" in _torch_check.cuda.get_device_name(0).upper():
_ARCH_SUPPORTED = False
This routes through torch.argmax fallback with no functional impact.
Summary
The CuTe DSL
ArgmaxKernelraisesCUDA_ERROR_INVALID_CLUSTER_SIZE(error 912) on NVIDIA H20 GPUs when the vocab size (N) is large (e.g., 151936 for Qwen3). The H20 is sm_90a (Hopper architecture) but lacks TMA cluster launch hardware support, causing any kernel configuration withcluster_dims > (1,1,1)to fail.Environment
Python: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20-3e
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.48
CUDA Driver Version: 570.133.20
PyTorch: 2.11.0+cu130
tokenspeed: 0.1.0
aiohttp: 3.13.5
compressed-tensors: 0.15.0.1
dill: 0.4.1
einops: 0.8.2
fastapi: 0.136.1
hf_transfer: 0.1.9
huggingface_hub: 1.14.0
modelscope: 1.36.3
msgspec: 0.21.1
ninja: 1.13.0
numpy: 1.26.4
nvidia-cutlass-dsl: 4.5.2
nvtx: 0.2.15
openai: 2.33.0
openai-harmony: 0.0.4
orjson: 3.11.9
packaging: 26.2
partial-json-parser: 0.2.1.1.post7
peft: 0.19.1
pillow: 12.2.0
prometheus-client: 0.25.0
psutil: 7.2.2
pybase64: 1.4.3
pybind11: 3.0.4
pydantic: 2.13.4
py-spy: 0.4.2
python-multipart: 0.0.28
pyzmq: 27.1.0
requests: 2.34.0
setproctitle: 1.3.7
tiktoken: 0.12.0
tokenspeed-smg: 1.4.1.post20260527
tokenspeed-smg-grpc-proto: 0.4.8.post20260527
tokenspeed-smg-grpc-servicer: 0.5.3.post20260527
torch: 2.11.0+cu130
torchvision: 0.26.0+cu130
tqdm: 4.67.3
transformers: 5.6.2
uv: 0.11.14
uvicorn: 0.46.0
uvloop: 0.22.1
viztracer: 1.1.1
xgrammar: 0.1.33
Reproducer
Error Output
Root Cause
ArgmaxKernel(from CUTLASS DSL / TRT-LLM) uses TMA cluster launches withcluster_dims > (1,1,1)for large N values.sm_90a(Hopper) but does not have cluster launch hardware — it's a cut-down SKU without this capability._ts_supported_arch()only checks90 <= sm < 120, which passes for H20.cluster_dims = (1,1,1).Trigger Conditions
All M values (1–256) fail identically when N=151936.
Suggested Fix
Add an H20 detection gate before enabling
_CUTE_AVAILABLE.Alternatively, query
CU_DEVICE_ATTRIBUTE_CLUSTER_LAUNCH(attribute 106) from the CUDA driver for a hardware-accurate check.Workaround
Disable CuTe DSL on H20 by patching
cute_dsl.py:This routes through
torch.argmaxfallback with no functional impact.