Skip to content

fix: flash attention CUDA kernel truncating head dimensions > 32#39

Merged
MankyDanky merged 2 commits intomainfrom
staging
Apr 13, 2026
Merged

fix: flash attention CUDA kernel truncating head dimensions > 32#39
MankyDanky merged 2 commits intomainfrom
staging

Conversation

@MankyDanky
Copy link
Copy Markdown
Collaborator

The CUDA launch config capped blockDim.x at min(D, 32), but the kernel assigns threadIdx.x as the head-dimension index. For headDim=64 (the common case with nEmbd=384, nHead=6), only the first 32 of 64 elements were computed — the rest stayed zero from store.zeros(). This broke both forward (half the attention output was zeros) and backward (half the Q/K/V gradients were zero, so those weights never updated).

Fix: use the full head dimension as blockDim.x and adjust blockDim.y to stay within CUDA's 1024 threads-per-block limit. Replace the hardcoded BLOCK_SIZE macro with blockDim.y so the kernel adapts to the actual launch configuration.

MankyDanky and others added 2 commits April 12, 2026 16:33
The CUDA launch config capped blockDim.x at min(D, 32), but the kernel
assigns threadIdx.x as the head-dimension index. For headDim=64 (the
common case with nEmbd=384, nHead=6), only the first 32 of 64 elements
were computed — the rest stayed zero from store.zeros(). This broke both
forward (half the attention output was zeros) and backward (half the
Q/K/V gradients were zero, so those weights never updated).

Fix: use the full head dimension as blockDim.x and adjust blockDim.y
to stay within CUDA's 1024 threads-per-block limit. Replace the
hardcoded BLOCK_SIZE macro with blockDim.y so the kernel adapts to
the actual launch configuration.

Made-with: Cursor
fix: flash attention CUDA kernel truncating head dimensions > 32
@MankyDanky MankyDanky merged commit dd49235 into main Apr 13, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants