64-bit offset support in hipBLASLt#7585
Conversation
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (77.89%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #7585 +/- ##
===========================================
- Coverage 71.45% 71.45% -0.00%
===========================================
Files 2612 2612
Lines 407793 407925 +132
Branches 60982 61006 +24
===========================================
+ Hits 291377 291467 +90
- Misses 95068 95090 +22
- Partials 21348 21368 +20
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
6312fcd to
8009839
Compare
a5d7f24 to
6175079
Compare
|
@shbae, we need account for post GSU scenario and KernelOutputConversion.py should be updated to add the offsets for General Batched GEMM scenario. |
90e3d97 to
a4a15d7
Compare
|
@jichangjichang this will greatly increase the sgpr usage and affect the preload data. |
1e8f008 to
a68b688
Compare
59fabe4 to
0da219f
Compare
d8b4b88 to
3bb065b
Compare
58f2d93 to
669e2d1
Compare
Hi @KKyang and @jichangjichang, this PR is ready to be reviewed, and I've implemented it with minimum usage of SGPR, which requires only 2 temporary SGPRs during updating corresponding offset to each matrix pointer. Please, let me know if you have any comments or questions for this PR. Thank you! |
|
Could you add test to verify it with all solution for some small sizes for batch offset test? |
There was a problem hiding this comment.
Pull request overview
This PR adds end-to-end 64-bit batch offset support for hipBLASLt general-batched (pointer-array) GEMM by plumbing new matrix-layout offset attributes through the rocblaslt/hipblaslt API layers, TensileLite host argument packing, and TensileLite kernel generation/assembly address calculations, plus introducing a dedicated test suite.
Changes:
- Extend matrix layout descriptors and validation to carry per-matrix 64-bit batch offsets (A/B/C/D) and pass them through rocblaslt → TensileLite inputs/args.
- Update TensileLite kernel signature generation and assembly/kernel writers to load/apply 64-bit offsets when computing per-batch base addresses in pointer-array mode.
- Add new
matmul_batch_offsetgtest entry + YAML coverage and a dedicated client-side test implementation.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| projects/hipblaslt/tensilelite/Tensile/KernelWriterConversion.py | Adds offset args to conversion kernel signature and applies offsets when indexing pointer arrays for C/D. |
| projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py | Loads batch offsets from kernargs and applies 64-bit address arithmetic for A/B loads and C/D stores in pointer-array mode. |
| projects/hipblaslt/tensilelite/Tensile/KernelWriter.py | Tracks kernarg byte offsets for batchOffset* fields in writer state. |
| projects/hipblaslt/tensilelite/Tensile/Components/Signature.py | Appends batchOffsetA/B/C/D u64 args to kernarg tail and records their byte offsets for assembly loaders. |
| projects/hipblaslt/tensilelite/src/ContractionSolution.cpp | Appends batchOffset* args to kernel invocations (SupportUserArgs and conversion paths). |
| projects/hipblaslt/tensilelite/rocisa/rocisa/src/code.cpp | Exposes signature offset metadata to Python bindings. |
| projects/hipblaslt/tensilelite/include/Tensile/ContractionProblem.hpp | Extends ContractionInputs with batchOffsetA/B/C/D. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/utility.cpp | Updates layout-attribute stringification and adds OFFSET attribute name. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/tensile_host.cpp | Converts user offsets (elements) to byte offsets for kernel consumption. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp | Plumbs batch_offset_* through problem construction and kernel selection paths. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp | Implements matrix layout OFFSET attribute and fixes destroy API typo in implementation. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/rocblaslt_mat_utils.hpp | Adds offset validation rules (incl. MX-type restriction) and plumbs offsets through arg validation. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/handle.h | Adds batch_offset field to matrix layout descriptor. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-types.h | Adds ROCBLASLT_MATRIX_LAYOUT_OFFSET enum and batch_offset_* fields to RocblasltContractionProblem. |
| projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-auxiliary.h | Renames rocblaslt_matrix_layout_destory → rocblaslt_matrix_layout_destroy in public header. |
| projects/hipblaslt/library/src/amd_detail/include/auxiliary.hpp | Adds hip_datatype_is_mxtype helper for sub-byte datatype checks. |
| projects/hipblaslt/library/src/amd_detail/hipblaslt.cpp | Updates hipblasLt wrapper to call the corrected destroy function name. |
| projects/hipblaslt/library/include/hipblaslt/hipblaslt.h | Adds HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to public hipblasLt API. |
| projects/hipblaslt/clients/tests/src/matmul_gtest.cpp | Wires new matmul_batch_offset test function into gtest dispatch/filter. |
| projects/hipblaslt/clients/tests/data/matmul_gtest.yaml | Adds quick/pre_checkin/nightly batch-offset test cases (including very large offsets). |
| projects/hipblaslt/clients/tests/data/hipblaslt_common.yaml | Adds CLI/YAML argument definitions and defaults for batch_offset_{a,b,c,d}. |
| projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp | New test implementation validating offset behavior vs CPU reference. |
| projects/hipblaslt/clients/common/include/hipblaslt_arguments.hpp | Adds batch_offset_{a,b,c,d} fields to Arguments struct and serialization macros. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # signature.offset counts from the very first arg including the common header. | ||
| # The assembly loads these args with KernArgAddress already advanced past | ||
| # that header by commonArgsSize, so subtract it. | ||
| if not kernel["ProblemType"]["GroupedGemm"]: |
There was a problem hiding this comment.
Do we need this for sparse kernel?
There was a problem hiding this comment.
Honestly, I don't know this feature is necessary for the sparse kernel, since I haven't heard about the need from sparse kernel yet.
669e2d1 to
fb27276
Compare
Hi @jichangjichang, I've added test to verify it with all solutions by 74324ee. Thanks! |
| unit_check: 1 | ||
|
|
||
| # Test with negative batch offsets to verify proper memory layout handling | ||
| - name: matmul_batch_offset_negative |
There was a problem hiding this comment.
@TorreZuk I've added matmul_batch_offset_negative and matmul_batch_offset_mixed tests here and they are all PASSED locally.
… to add arguments appropriately and to use only two extra SGPRs
…d excludes custom kernels solutions for General batched GEMM.
64434c9 to
24de4c7
Compare
Motivation
hipBLASLt currently lacks support for 64-bit batch offsets in matrix operations. This feature enables batched GEMM operations to specify element-level offsets for input/output matrices, allowing computation on specific regions within larger buffers without requiring data copies. This is critical for applications that manage large pre-allocated memory pools or need to operate on sub-matrices within batched operations, which is directly related to rocblas backend unification efforts. The feature is currently supported in
rocblasand used byrocsolver, and it requires hipMemcpy overhead with hipblaslt backend, and this new feature would avoid that unnecessary hipMemcpy overhead.Technical Details
This PR implements end-to-end 64-bit batch offset support across the hipBLASLt stack:
API Layer:
Host Implementation:
Kernel Generation (TensileLite):
Test Infrastructure:
testing_matmul_batch_offset.hppwith dual-validation approach:a. Offset API results vs manual pointer adjustment (validates implementation correctness)
b. GPU results vs CPU reference (validates numerical accuracy)
Misc.
destory()--> rocblaslt_matrix_layout_destroy().Test Plan
Test Result
matmul_batch_offsettests passing across all categoriesRisk level
Low
Submission Checklist
Associated ticket: AIHPBLAS-1456