-
Notifications
You must be signed in to change notification settings - Fork 831
[ET-VK][qconv] Add flexible layout impl for quantized pointwise conv #17221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17221
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Pending, 2 Unrelated FailuresAs of commit 00a6700 with merge base 1cffd23 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
ghstack-source-id: 338324595
Pull Request resolved: #17221
This PR needs a
|
…twise conv"
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
[ghstack-poisoned]
…twise conv"
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
1. Two New GLSL Compute Shaders
- q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
- q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
- 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread)
- dotPacked4x8AccSatEXT for efficient int8 dot products
- Texture2D for weight storage, buffers for input/output
- Per-channel weight quantization with symmetric int8 weights
2. C++ Operator Implementation (Q8taConv2dPW.cpp)
- prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D
format optimized for the shader's access pattern
- add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer
metadata UBOs
- add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader
- q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight
prepacking, and kernel selection
3. Test Infrastructure Updates
- TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps
quantize → conv2d_pw → dequantize for end-to-end testing
- test_q8ta_conv2d_pw.cpp: Comprehensive test suite with:
- Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.)
- Performance test cases (480→160, 48→22, 128→128, 576→64 channels)
- Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C,
kPackedInt8_4C
- Both texture and buffer storage types for floating-point tensors
- Reference implementation comparison for correctness validation
Architecture
The shader handles layout flexibility via:
1. Layout specialization constants (outp_layout, inp_layout) passed from C++
2. BufferMetadata UBOs providing runtime strides for input/output tensors
3. compute_outp_buffer_idx() function that computes correct buffer indices based
on layout
4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine
stride patterns
Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
[ghstack-poisoned]
b1d0159
into
gh/SS-JIA/410/base
Pull Request resolved: #17221 This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns ghstack-source-id: 338638556 @exported-using-ghexport Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
Pull Request resolved: #17221 This commit adds a flexible memory layout implementation for quantized pointwise (1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory layouts, rather than being restricted to a single fixed layout. Key Components Added 1. Two New GLSL Compute Shaders - q8ta_conv2d_pw.glsl: The primary flexible-layout shader that uses BufferMetadata UBOs and layout specialization constants to support multiple memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses scalar array indexing for output writes to handle different stride patterns. - q8ta_conv2d_pw_4w4c_ref.glsl: A reference implementation specifically for 4W4C layout that uses simpler ivec4 indexing. Currently not enabled in production (gated by if (false) in C++). Both shaders use: - 4×8 output tiling (TILE_M=4 widths × TILE_N=8 channels per thread) - dotPacked4x8AccSatEXT for efficient int8 dot products - Texture2D for weight storage, buffers for input/output - Per-channel weight quantization with symmetric int8 weights 2. C++ Operator Implementation (Q8taConv2dPW.cpp) - prepack_quantized_conv2d_pw_weight(): Prepacks int8 weights into texture2D format optimized for the shader's access pattern - add_q8ta_conv2d_pw_node(): Dispatches the flexible-layout shader with buffer metadata UBOs - add_q8ta_conv2d_pw_4w4c_node(): Dispatches the 4W4C-specific reference shader - q8ta_conv2d_pw(): High-level operator that handles argument parsing, weight prepacking, and kernel selection 3. Test Infrastructure Updates - TestQ8taConv2d.cpp: Added test_q8ta_conv2d_pw() test operator that wraps quantize → conv2d_pw → dequantize for end-to-end testing - test_q8ta_conv2d_pw.cpp: Comprehensive test suite with: - Multiple input sizes (3→32, 32→64, 64→96, 7→13, 40→80 channels, etc.) - Performance test cases (480→160, 48→22, 128→128, 576→64 channels) - Tests across 3 memory layouts: kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C - Both texture and buffer storage types for floating-point tensors - Reference implementation comparison for correctness validation Architecture The shader handles layout flexibility via: 1. Layout specialization constants (outp_layout, inp_layout) passed from C++ 2. BufferMetadata UBOs providing runtime strides for input/output tensors 3. compute_outp_buffer_idx() function that computes correct buffer indices based on layout 4. get_outer_packed_dim_block_size() from block_indexing.glslh to determine stride patterns ghstack-source-id: 338638556 @exported-using-ghexport Differential Revision: [D92307253](https://our.internmc.facebook.com/intern/diff/D92307253/)
Stack from ghstack (oldest at bottom):
This commit adds a flexible memory layout implementation for quantized pointwise
(1x1) convolution in the ExecuTorch Vulkan backend. The key changes introduce a
new operator (etvk.q8ta_conv2d_pw) that can handle multiple int8 tensor memory
layouts, rather than being restricted to a single fixed layout.
Key Components Added
BufferMetadata UBOs and layout specialization constants to support multiple
memory layouts (kPackedInt8_4C1W, kPackedInt8_4W4C, kPackedInt8_4C). Uses
scalar array indexing for output writes to handle different stride patterns.
layout that uses simpler ivec4 indexing. Currently not enabled in production
(gated by if (false) in C++).
Both shaders use:
format optimized for the shader's access pattern
metadata UBOs
prepacking, and kernel selection
quantize → conv2d_pw → dequantize for end-to-end testing
kPackedInt8_4C
Architecture
The shader handles layout flexibility via:
on layout
stride patterns
Differential Revision: D92307253