Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preshuffled BF16I4 Gemm Kernel #3913

Closed
wants to merge 1 commit into from
Closed

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Apr 2, 2025

Summary: This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas.

Differential Revision: D72270467

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72270467

Copy link

netlify bot commented Apr 2, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 2ac7e7d
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67edaee3820297000850e6f7
😎 Deploy Preview https://deploy-preview-3913--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Apr 2, 2025
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas.

Differential Revision: D72270467
@jwfromm jwfromm force-pushed the export-D72270467 branch from 3260bd3 to f4c60d4 Compare April 2, 2025 00:59
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72270467

@jwfromm jwfromm force-pushed the export-D72270467 branch from f4c60d4 to fc2e0a8 Compare April 2, 2025 02:27
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Apr 2, 2025
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Differential Revision: D72270467
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72270467

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Apr 2, 2025
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Differential Revision: D72270467
@jwfromm jwfromm force-pushed the export-D72270467 branch from fc2e0a8 to 38fb7bb Compare April 2, 2025 02:32
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72270467

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Apr 2, 2025
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer.

{F1976677491}

Reviewed By: jianyuh

Differential Revision: D72270467
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer.

{F1976677491}

Reviewed By: jianyuh

Differential Revision: D72270467
@jwfromm jwfromm force-pushed the export-D72270467 branch from 38fb7bb to 2ac7e7d Compare April 2, 2025 21:40
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D72270467

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Apr 2, 2025
Summary:
X-link: facebookresearch/FBGEMM#1003


This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer.

{F1976677491}

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D72270467
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 8cbb32c.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
Pull Request resolved: facebookresearch/FBGEMM#1003

X-link: pytorch#3913

This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes.

Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer.

{F1976677491}

Reviewed By: jianyuh, jiawenliu64

Differential Revision: D72270467

fbshipit-source-id: 8426afd6587547083b8307f515cda49145939554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants