CuTe DSL ArgmaxKernel crashes with CUDA_ERROR_INVALID_CLUSTER_SIZE on NVIDIA H20

## Summary

The CuTe DSL `ArgmaxKernel` raises `CUDA_ERROR_INVALID_CLUSTER_SIZE` (error 912) on NVIDIA H20 GPUs when the vocab size (N) is large (e.g., 151936 for Qwen3). The H20 is sm_90a (Hopper architecture) but lacks TMA cluster launch hardware support, causing any kernel configuration with `cluster_dims > (1,1,1)` to fail.

## Environment

Python: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20-3e
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 13.0, V13.0.48
CUDA Driver Version: 570.133.20
PyTorch: 2.11.0+cu130
tokenspeed: 0.1.0
aiohttp: 3.13.5
compressed-tensors: 0.15.0.1
dill: 0.4.1
einops: 0.8.2
fastapi: 0.136.1
hf_transfer: 0.1.9
huggingface_hub: 1.14.0
modelscope: 1.36.3
msgspec: 0.21.1
ninja: 1.13.0
numpy: 1.26.4
nvidia-cutlass-dsl: 4.5.2
nvtx: 0.2.15
openai: 2.33.0
openai-harmony: 0.0.4
orjson: 3.11.9
packaging: 26.2
partial-json-parser: 0.2.1.1.post7
peft: 0.19.1
pillow: 12.2.0
prometheus-client: 0.25.0
psutil: 7.2.2
pybase64: 1.4.3
pybind11: 3.0.4
pydantic: 2.13.4
py-spy: 0.4.2
python-multipart: 0.0.28
pyzmq: 27.1.0
requests: 2.34.0
setproctitle: 1.3.7
tiktoken: 0.12.0
tokenspeed-smg: 1.4.1.post20260527
tokenspeed-smg-grpc-proto: 0.4.8.post20260527
tokenspeed-smg-grpc-servicer: 0.5.3.post20260527
torch: 2.11.0+cu130
torchvision: 0.26.0+cu130
tqdm: 4.67.3
transformers: 5.6.2
uv: 0.11.14
uvicorn: 0.46.0
uvloop: 0.22.1
viztracer: 1.1.1
xgrammar: 0.1.33

## Reproducer

```python
import torch
import cuda.bindings.driver as cuda
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
from tokenspeed_kernel.thirdparty.cute_dsl.argmax import (
    ArgmaxKernel, CUDAGraphCompatibleWrapper, torch2cute_dtype_map
)

M, N = 1, 151936  # Qwen3 vocab size triggers it; N=32768 does NOT
dtype_torch = torch.bfloat16
dtype = torch2cute_dtype_map[dtype_torch]

logits = torch.randn(M, N, dtype=dtype_torch, device='cuda')
out_max = torch.empty(M, dtype=torch.float32, device='cuda')
out_idx = torch.empty(M, dtype=torch.int64, device='cuda')

x = from_dlpack(CUDAGraphCompatibleWrapper(logits.detach()), assumed_align=16) \
    .mark_compact_shape_dynamic(mode=0, stride_order=(0, 1))
m = from_dlpack(CUDAGraphCompatibleWrapper(out_max.detach()), assumed_align=16) \
    .mark_compact_shape_dynamic(mode=0, stride_order=(0,))
i = from_dlpack(CUDAGraphCompatibleWrapper(out_idx.detach()), assumed_align=16) \
    .mark_compact_shape_dynamic(mode=0, stride_order=(0,))
stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)

kernel = ArgmaxKernel(dtype, N, use_redux=False)
compiled = cute.compile(kernel, x, m, i, stream)
compiled(x, m, i, stream)  # CRASHES HERE
torch.cuda.synchronize()
```

## Error Output

```
Traceback (most recent call last):
  File "/tmp/test.py", line 27, in <module>
    compiled(x, m, i, stream)  # CRASHES HERE
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nvidia_cutlass_dsl/python_packages/cutlass/base_dsl/jit_executor.py", line 694, in run_compiled_program
    raise DSLCudaRuntimeError(error_code, error_name)
cutlass.base_dsl.common.DSLCudaRuntimeError: DSLCudaRuntimeError: CUDA_ERROR_INVALID_CLUSTER_SIZE (error code: 912) 
❌ Invalid cluster size.


Error Code: 912

🔍 Additional Context: 
- Error name: CUDA_ERROR_INVALID_CLUSTER_SIZE
- Error code: 912
- CUDA_TOOLKIT_PATH: not set
- Target SM ARCH: not set

📊 GPU Information:
- CUDA devices available: 8 (current: <CUdevice 0>)
- Architecture: Hopper (sm_90a)
- Compatible SM archs: sm_90a
- Total Memory: 139.81 GB

Compatibility Check:
❌ Error: Target SM ARCH unknown is not compatible
💡 Please use one of SM ARCHs: sm_90a
```

## Root Cause

1. The `ArgmaxKernel` (from CUTLASS DSL / TRT-LLM) uses TMA cluster launches with `cluster_dims > (1,1,1)` for large N values.
2. NVIDIA H20 reports `sm_90a` (Hopper) but **does not have cluster launch hardware** — it's a cut-down SKU without this capability.
3. The current gate `_ts_supported_arch()` only checks `90 <= sm < 120`, which passes for H20.
4. Small N (e.g., 32768) does NOT trigger the bug because the kernel selects `cluster_dims = (1,1,1)`.
5. Real LLM vocab sizes (≥100K) trigger multi-cluster configurations that fail on H20.

## Trigger Conditions

| N | Result on H20 |
|---|---|
| 32768 | OK (cluster_dims=1) |
| 151936 | CRASH (cluster_dims>1) |

All M values (1–256) fail identically when N=151936.

## Suggested Fix

Add an H20 detection gate before enabling `_CUTE_AVAILABLE`.

```python
def _has_cluster_launch_support() -> bool:
    # ... platform checks ...
    # H20 is the only sm_90 SKU known to lack cluster launch support
    device_name = current_platform().device_name.upper()
    if "H20" in device_name:
        return False
    return True
```

Alternatively, query `CU_DEVICE_ATTRIBUTE_CLUSTER_LAUNCH` (attribute 106) from the CUDA driver for a hardware-accurate check.

## Workaround

Disable CuTe DSL on H20 by patching `cute_dsl.py`:

```python
_CUTE_AVAILABLE = False
import torch as _torch_check
if _ARCH_SUPPORTED and "H20" in _torch_check.cuda.get_device_name(0).upper():
    _ARCH_SUPPORTED = False
```

This routes through `torch.argmax` fallback with no functional impact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CuTe DSL ArgmaxKernel crashes with CUDA_ERROR_INVALID_CLUSTER_SIZE on NVIDIA H20 #310

Summary

Environment

Reproducer

Error Output

Root Cause

Trigger Conditions

Suggested Fix

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CuTe DSL ArgmaxKernel crashes with CUDA_ERROR_INVALID_CLUSTER_SIZE on NVIDIA H20 #310

Description

Summary

Environment

Reproducer

Error Output

Root Cause

Trigger Conditions

Suggested Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions