-
Notifications
You must be signed in to change notification settings - Fork 657
dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518. |
26f24c9
to
cd565eb
Compare
…o host (pytorch#4869) Summary: X-link: facebookresearch/FBGEMM#1891 ## What Move the device-side assertions to the host since all the kernels share the same assertion. ## Why When running evals with symmetric quantization, I ran into the following error. > CUDA error: too many resources requested for launch It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block. - `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66` - P1908720668 - That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536. Differential Revision: D82320518
…rch#4868) Summary: X-link: facebookresearch/FBGEMM#1890 Differential Revision: D82320500
…o host (pytorch#4869) Summary: X-link: facebookresearch/FBGEMM#1891 ## What Move the device-side assertions to the host since all the kernels share the same assertion. ## Why When running evals with symmetric quantization, I ran into the following error. > CUDA error: too many resources requested for launch It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block. - `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66` - P1908720668 - That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536. Differential Revision: D82320518
@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518. |
cd565eb
to
7cf05c6
Compare
@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518. |
Summary:
What
Move the device-side assertions to the host since all the kernels share the same assertion.
Why
When running evals with symmetric quantization, I ran into the following error.
It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block.
$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'
gives mevoid fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66
Differential Revision: D82320518