fix: flash attention CUDA kernel truncating head dimensions > 32 by MankyDanky · Pull Request #39 · mni-ml/framework

MankyDanky · 2026-04-12T20:43:01Z

The CUDA launch config capped blockDim.x at min(D, 32), but the kernel assigns threadIdx.x as the head-dimension index. For headDim=64 (the common case with nEmbd=384, nHead=6), only the first 32 of 64 elements were computed — the rest stayed zero from store.zeros(). This broke both forward (half the attention output was zeros) and backward (half the Q/K/V gradients were zero, so those weights never updated).

Fix: use the full head dimension as blockDim.x and adjust blockDim.y to stay within CUDA's 1024 threads-per-block limit. Replace the hardcoded BLOCK_SIZE macro with blockDim.y so the kernel adapts to the actual launch configuration.

The CUDA launch config capped blockDim.x at min(D, 32), but the kernel assigns threadIdx.x as the head-dimension index. For headDim=64 (the common case with nEmbd=384, nHead=6), only the first 32 of 64 elements were computed — the rest stayed zero from store.zeros(). This broke both forward (half the attention output was zeros) and backward (half the Q/K/V gradients were zero, so those weights never updated). Fix: use the full head dimension as blockDim.x and adjust blockDim.y to stay within CUDA's 1024 threads-per-block limit. Replace the hardcoded BLOCK_SIZE macro with blockDim.y so the kernel adapts to the actual launch configuration. Made-with: Cursor

fix: flash attention CUDA kernel truncating head dimensions > 32

MankyDanky and others added 2 commits April 12, 2026 16:33

Merge pull request #38 from mni-ml/aadi/fix-cuda-bug

155f6c4

fix: flash attention CUDA kernel truncating head dimensions > 32

r-chong approved these changes Apr 12, 2026

View reviewed changes

MankyDanky merged commit dd49235 into main Apr 13, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: flash attention CUDA kernel truncating head dimensions > 32#39

fix: flash attention CUDA kernel truncating head dimensions > 32#39
MankyDanky merged 2 commits intomainfrom
staging

MankyDanky commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MankyDanky commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants