Skip to content

Conversation

ColinPeppler
Copy link
Contributor

Summary:

What

Move the device-side assertions to the host since all the kernels share the same assertion.

Why

When running evals with symmetric quantization, I ran into the following error.

CUDA error: too many resources requested for launch

It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block.

  • $ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel' gives me
    • void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66
    • P1908720668
  • That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536.

Differential Revision: D82320518

Copy link

netlify bot commented Sep 12, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 7cf05c6
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68cc3b4880441e00087be8d2
😎 Deploy Preview https://deploy-preview-4869--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Sep 12, 2025
@facebook-github-bot
Copy link
Contributor

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

ColinPeppler added a commit to ColinPeppler/FBGEMM that referenced this pull request Sep 18, 2025
…o host (pytorch#4869)

Summary:
X-link: facebookresearch/FBGEMM#1891


## What
Move the device-side assertions to the host since all the kernels share the same assertion.

## Why
When running evals with symmetric quantization, I ran into the following error.

> CUDA error: too many resources requested for launch

 It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block.
- `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me
     - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66`
  - P1908720668
- That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536.

Differential Revision: D82320518
…o host (pytorch#4869)

Summary:
X-link: facebookresearch/FBGEMM#1891


## What
Move the device-side assertions to the host since all the kernels share the same assertion.

## Why
When running evals with symmetric quantization, I ran into the following error.

> CUDA error: too many resources requested for launch

 It failed with this launch configuration: blockDim = (32, 32) = 1024 threads per block.
- `$ cuobjdump --dump-resource-usage kv_cache.cu.pic.o.sm_90.cubin | c++filt | grep -A 1 'dequantize_fp8_cache_kernel'` gives me
     - `void fbgemm_gpu::dequantize_fp8_cache_kernel<true, true>... REG:66`
  - P1908720668
- That means one threadblock has 66 * 1024 = 67584 registers which exceeds the limit of 65,536.

Differential Revision: D82320518
@facebook-github-bot
Copy link
Contributor

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

@facebook-github-bot
Copy link
Contributor

@ColinPeppler has exported this pull request. If you are a Meta employee, you can view the originating diff in D82320518.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants