Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF16I4 Preshuffled Grouped Gemm #3917

Closed
wants to merge 2 commits into from
Closed

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Apr 2, 2025

Summary:
This diff adds a preshuffled variant of BF16I4 Grouped Gemm. Notably, cutlass does not currently support zero points for grouped gemm, so this kernel must be used without them. That said, the accuracy of the kernel appears reasonable and the performance is very compelling.

{F1976716898}

Differential Revision: D72337760

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72337760

Copy link

netlify bot commented Apr 2, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit f0dfb6f
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67edbca04c4f4e00086f28dc
😎 Deploy Preview https://deploy-preview-3917--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added 2 commits April 2, 2025 15:39
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer.

{F1976677491}

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D72270467
Summary:
X-link: facebookresearch/FBGEMM#1006


This diff adds a preshuffled variant of BF16I4 Grouped Gemm. Notably, cutlass does not currently support zero points for grouped gemm, so this kernel must be used without them. That said, the accuracy of the kernel appears reasonable and the performance is very compelling.

 {F1976716898}

Reviewed By: jiawenliu64

Differential Revision: D72337760
@jwfromm jwfromm force-pushed the export-D72337760 branch from 61b5a99 to f0dfb6f Compare April 2, 2025 22:39
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72337760

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c407f65.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
Pull Request resolved: facebookresearch/FBGEMM#1006

X-link: pytorch#3917

This diff adds a preshuffled variant of BF16I4 Grouped Gemm. Notably, cutlass does not currently support zero points for grouped gemm, so this kernel must be used without them. That said, the accuracy of the kernel appears reasonable and the performance is very compelling.

 {F1976716898}

Reviewed By: jiawenliu64

Differential Revision: D72337760

fbshipit-source-id: a2cf9e913d095da42f1cf88a5c08dbbe1f2794c9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants