-
Notifications
You must be signed in to change notification settings - Fork 831
[ET-VK] Add alignment fields to PackedDimInfo for padded size calculation #17260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… per-shader timing Pull Request resolved: #17105 This change improves the benchmark test harness in three ways: 1. **Reference computation caching**: Test cases are now grouped by a `ReferenceKey` that captures the inputs affecting reference output (sizes, dtype, data generation type). Reference computation runs once per group and results are reused, significantly speeding up test suites with many storage/layout variations of the same logical test case. 2. **Per-shader timing breakdown**: Benchmark output now shows individual shader execution times with global and local workgroup sizes, making it easier to identify performance bottlenecks when multiple shaders participate in an operator. 3. **Deferred data generation**: Tensor data is now generated lazily with explicit seeding, enabling deterministic data sharing across grouped test cases. This ensures identical inputs produce identical reference outputs for caching correctness. Also adds string input support (`ValueSpec::make_string()`) and helper functions for concise test case naming (`layout_abbrev`, `repr_str`, `shape_string`). ghstack-source-id: 338638546 @exported-using-ghexport Differential Revision: [D91945038](https://our.internmc.facebook.com/intern/diff/D91945038/)
…tion Pull Request resolved: #17170 This change introduces separate alignment fields to PackedDimInfo, decoupling the alignment used for padding tensor dimensions from the block size used for packing. Previously, `calculate_padded_sizes` used `packed_dim_block_size` and `outer_packed_dim_block_size` directly to determine how much to pad tensor dimensions. This works but limits flexibility - there are scenarios where we want to pad dimensions to a larger alignment than the block size for performance reasons, such as ensuring loads are aligned to cache lines or removing the need for bounds checking in shaders. The new fields `packed_dim_align` and `outer_packed_dim_align` allow specifying the alignment independently. For now, these are initialized to match the corresponding block sizes, preserving existing behavior. Future changes can set larger alignment values when beneficial for performance. Authored with Claude. ghstack-source-id: 338638551 @exported-using-ghexport Differential Revision: [D92196649](https://our.internmc.facebook.com/intern/diff/D92196649/)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17260
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 119 PendingAs of commit 694f9b8 with merge base 1cffd23 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…perators (#17261) Implemented quantize_per_tensor and dequantize_per_tensor GLSL shaders and C++ dispatch logic to support the new single-dimension packed INT8 layouts (kPackedInt8_4W, kPackedInt8_4C, kPackedInt8_4H). These operators enable conversion between floating-point tensors and packed int8 representations with per-tensor scale and zero-point parameters. The implementation includes: - GLSL shaders: quantize_per_tensor and dequantize_per_tensor with support for both texture->buffer and buffer->buffer data flows, including GL_EXT_debug_printf statements for debugging - QuantizeDequantize.cpp: Added dispatch functions for the new layouts and registered etvk.q_dq_8bit_per_tensor.default operator - Test infrastructure: Created q_dq_8bit_per_tensor test binary with DEBUG_MODE support and reference CPU implementation for validation The shaders implement the quantization formula Q = clamp(round(x/scale) + zp, -128, 127) and dequantization formula x' = (Q - zp) * scale, with proper int8 packing/unpacking using little-endian byte ordering and sign extension. Differential Revision: [D92061370](https://our.internmc.facebook.com/intern/diff/D92061370/) [ghstack-poisoned]
This PR needs a
|
This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #17170 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/405/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/405/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/398/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/405/orig
Differential Revision: D92196649
@diff-train-skip-merge