[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221

SS-JIA · 2026-02-04T20:25:06Z

Stack from ghstack (oldest at bottom):

This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.

Key Components Added

Two New GLSL Compute Shaders

q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).

Both shaders use:

4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
dotPacked4x8AccSatEXT for efficient int8 dot products
Texture2D for weight storage, buffers for input/output
Per-channel weight quantization with symmetric int8 weights

C++ Operator Implementation (Q8taConv2dPW.cpp)

prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection

Test Infrastructure Updates

TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
  kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation

Architecture

The shader handles layout flexibility via:

Layout specialization constants (outp_layout, inp_layout) passed from C++
BufferMetadata UBOs providing runtime strides for input/output tensors
compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns

Differential Revision: D92307253

This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/) [ghstack-poisoned]

pytorch-bot · 2026-02-04T20:25:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Pending, 2 Unrelated Failures

As of commit 00a6700 with merge base 1cffd23 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t d0d615d4b51504bc6c25bea60b1a0f92f53b3e4e1cf312c9363899c799a09cb9 /exec failed with exit code 1
pull / unittest / macos / macos-job (gh)
examples/models/test/test_export.py::ExportTest::test_mv3_export_to_executorch
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 542fed88d69febe52cf111a0888e1e5e24de2c84dee32c06575dd2cb190ad517 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 489242242b10fbad860593518d73c47a0565a600edd4b52e429f570f44b9864b /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-samsung-quantmodels-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/) ghstack-source-id: 338324595 Pull Request resolved: #17221

github-actions · 2026-02-04T20:26:30Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…twise conv" This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/) [ghstack-poisoned]

Pull Request resolved: #17221 This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns ghstack-source-id: 338638556 @exported-using-ghexport Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)

This was referenced Feb 4, 2026

[ET-VK][testing] Add per-shader timing breakdown to benchmark output #17105

Merged

[ET-VK] Add alignment fields to PackedDimInfo for padded size calculation #17170

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 4, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 4, 2026

SS-JIA mentioned this pull request Feb 5, 2026

[ET-VK][qconv] Add flexible layout impl for im2col #17249

Merged

manuelcandales approved these changes Feb 5, 2026

View reviewed changes

meta-codesync bot merged commit b1d0159 into gh/SS-JIA/410/base Feb 5, 2026
176 of 184 checks passed

meta-codesync bot deleted the gh/SS-JIA/410/head branch February 5, 2026 23:29

meta-codesync bot temporarily deployed to cherry-pick-bot February 5, 2026 23:29 Inactive

pytorchbot mentioned this pull request Feb 5, 2026

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221

Uh oh!

SS-JIA commented Feb 4, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221

[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221

Uh oh!

Conversation

SS-JIA commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221

❌ 4 New Failures, 1 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Feb 4, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Feb 4, 2026 •

edited

Loading

pytorch-bot bot commented Feb 4, 2026 •

edited

Loading

This PR needs a `release notes:` label