dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869

ColinPeppler · 2025-09-12T20:26:17Z

Summary:

What

Move the device-side assertions to the host since all the kernels share the same assertion.

Why

When running evals with symmetric quantization, I ran into the following error.

CUDA error: too many resources requested for launch

It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block.

$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel' gives me
- void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66
- P1908720668
That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536.

Differential Revision: D82320518

netlify · 2025-09-12T20:26:22Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`7cf05c6`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68cc3b4880441e00087be8d2
😎 Deploy Preview	https://deploy-preview-4869--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-09-12T20:26:40Z

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

…o host (pytorch#4869) Summary: X-link: facebookresearch/FBGEMM#1891 ## What Move the device-side assertions to the host since all the kernels share the same assertion. ## Why When running evals with symmetric quantization, I ran into the following error. > CUDA error: too many resources requested for launch It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block. - `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66` - P1908720668 - That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536. Differential Revision: D82320518

…rch#4868) Summary: X-link: facebookresearch/FBGEMM#1890 Differential Revision: D82320500

…o host (pytorch#4869) Summary: X-link: facebookresearch/FBGEMM#1891 ## What Move the device-side assertions to the host since all the kernels share the same assertion. ## Why When running evals with symmetric quantization, I ran into the following error. > CUDA error: too many resources requested for launch It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block. - `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66` - P1908720668 - That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536. Differential Revision: D82320518

facebook-github-bot · 2025-09-18T17:02:56Z

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

facebook-github-bot · 2025-09-18T17:03:09Z

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

meta-cla bot added the cla signed label Sep 12, 2025

facebook-github-bot added fb-exported meta-exported labels Sep 12, 2025

ColinPeppler force-pushed the export-D82320518 branch from 26f24c9 to cd565eb Compare September 18, 2025 17:02

ColinPeppler added 2 commits September 18, 2025 10:02

symmetric quantization to FBGEMM prefill token-wise FP8 (fixed) (pyto…

86359f1

…rch#4868) Summary: X-link: facebookresearch/FBGEMM#1890 Differential Revision: D82320500

ColinPeppler force-pushed the export-D82320518 branch from cd565eb to 7cf05c6 Compare September 18, 2025 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869

dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869

Uh oh!

ColinPeppler commented Sep 12, 2025

Uh oh!

netlify bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

Uh oh!

dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869

Are you sure you want to change the base?

dequantize_fp8_cache_kernel: Move D=128 device-side-assertion check to host #4869

Uh oh!

Conversation

ColinPeppler commented Sep 12, 2025

What

Why

Uh oh!

netlify bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Sep 12, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

Uh oh!

netlify bot commented Sep 12, 2025 •

edited

Loading