Preshuffled BF16I4 Gemm Kernel #3913

jwfromm · 2025-04-02T00:55:22Z

Summary: This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas.

Differential Revision: D72270467

facebook-github-bot · 2025-04-02T00:55:31Z

This pull request was exported from Phabricator. Differential Revision: D72270467

netlify · 2025-04-02T00:55:44Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`2ac7e7d`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67edaee3820297000850e6f7
😎 Deploy Preview	https://deploy-preview-3913--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#1003 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Differential Revision: D72270467

facebook-github-bot · 2025-04-02T00:59:14Z

This pull request was exported from Phabricator. Differential Revision: D72270467

Summary: X-link: facebookresearch/FBGEMM#1003 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes. Differential Revision: D72270467

facebook-github-bot · 2025-04-02T02:27:31Z

This pull request was exported from Phabricator. Differential Revision: D72270467

Summary: X-link: facebookresearch/FBGEMM#1003 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes. Differential Revision: D72270467

facebook-github-bot · 2025-04-02T02:32:41Z

This pull request was exported from Phabricator. Differential Revision: D72270467

Summary: X-link: facebookresearch/FBGEMM#1003 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes. Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer. {F1976677491} Reviewed By: jianyuh Differential Revision: D72270467

facebook-github-bot · 2025-04-02T21:40:58Z

This pull request was exported from Phabricator. Differential Revision: D72270467

Summary: X-link: facebookresearch/FBGEMM#1003 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes. Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer. {F1976677491} Reviewed By: jianyuh, jiawenliu64 Differential Revision: D72270467

facebook-github-bot · 2025-04-04T19:05:47Z

This pull request has been merged in 8cbb32c.

Summary: Pull Request resolved: facebookresearch/FBGEMM#1003 X-link: pytorch#3913 This diff adds a preshuffled BF16I4 mixed dtype kernel using cutlass. Performance is quite compelling and shows substantial speedups for some shapes compared to bf16 x bf16 gemm backed by cublas. Notably, this preshuffle approach is 1.5-2X faster than the standard bf16i4 gemm for most shapes. Compared to other mixed dtype kernels like marlin and machete, we see that this new kernel is probably the best average performer. {F1976677491} Reviewed By: jianyuh, jiawenliu64 Differential Revision: D72270467 fbshipit-source-id: 8426afd6587547083b8307f515cda49145939554

facebook-github-bot added the cla signed label Apr 2, 2025

facebook-github-bot added the fb-exported label Apr 2, 2025

jwfromm force-pushed the export-D72270467 branch from 3260bd3 to f4c60d4 Compare April 2, 2025 00:59

jwfromm force-pushed the export-D72270467 branch from f4c60d4 to fc2e0a8 Compare April 2, 2025 02:27

jwfromm force-pushed the export-D72270467 branch from fc2e0a8 to 38fb7bb Compare April 2, 2025 02:32

jwfromm force-pushed the export-D72270467 branch from 38fb7bb to 2ac7e7d Compare April 2, 2025 21:40

facebook-github-bot closed this in 8cbb32c Apr 4, 2025

facebook-github-bot added the Merged label Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preshuffled BF16I4 Gemm Kernel #3913

Preshuffled BF16I4 Gemm Kernel #3913

jwfromm commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

netlify bot commented Apr 2, 2025 •

edited

Loading

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 4, 2025

Preshuffled BF16I4 Gemm Kernel #3913

Preshuffled BF16I4 Gemm Kernel #3913

Conversation

jwfromm commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

netlify bot commented Apr 2, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 2, 2025

facebook-github-bot commented Apr 4, 2025

netlify bot commented Apr 2, 2025 •

edited

Loading