cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previous Kernel 4: 1.74ms
Kernel 4 with TF32: 1.70ms
Kernel 5 (4 with BF16 I/O): 0.91ms
Kernel 6 (5 without permute, not realistic): 0.76ms
Kernel 10 (cuDNN BF16, with FP32 conversion): 0.33ms
Kernel 11 (cuDNN BF16 with direct BF16 inputs): 0.13ms
This has been a mess to get working, e.g. wasted 3+ hours until I realised that even for cuBLASLt with explicit type parameters for them, alpha/beta need to be FP16 with CUBLAS_COMPUTE_16F and need to be FP32 with _32F, but there are zero warnings if you get it wrong, it just returns garbage :(
I still haven't managed to get the cuDNN backward pass to give the correct results, means I can't integrate the forward pass as an option for the full training pass in train_gpt2.cu because our current backwards pass requires "att" which cuDNN doesn't provide (it needs its own stats tensor instead) unfortunately.