Skip to content

64-bit offset support in hipBLASLt#7585

Open
shbae wants to merge 25 commits into
developfrom
users/sbae/64bit_offset_support
Open

64-bit offset support in hipBLASLt#7585
shbae wants to merge 25 commits into
developfrom
users/sbae/64bit_offset_support

Conversation

@shbae

@shbae shbae commented May 19, 2026

Copy link
Copy Markdown
Contributor

Motivation

hipBLASLt currently lacks support for 64-bit batch offsets in matrix operations. This feature enables batched GEMM operations to specify element-level offsets for input/output matrices, allowing computation on specific regions within larger buffers without requiring data copies. This is critical for applications that manage large pre-allocated memory pools or need to operate on sub-matrices within batched operations, which is directly related to rocblas backend unification efforts. The feature is currently supported in rocblas and used by rocsolver, and it requires hipMemcpy overhead with hipblaslt backend, and this new feature would avoid that unnecessary hipMemcpy overhead.

Technical Details

This PR implements end-to-end 64-bit batch offset support across the hipBLASLt stack:

API Layer:

  • Extended HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to accept 64-bit offset values for A, B, C, D matrices
  • Batch offsets are specified in elements (not bytes) for consistency with matrix dimensions
  • Offsets are applied per-batch in pointer array mode (batch_mode=1)

Host Implementation:

  • Modified tensile_host.cpp to pass offset values as kernel arguments
  • Placed batch offset arguments at the tail of the kernarg buffer for backward compatibility
  • Updated kernel dispatch logic to handle 64-bit offset arithmetic

Kernel Generation (TensileLite):

  • Updated kernel signature generation to include offset parameters in kernarg buffer
  • Modified KernelWriterAssembly.py to:
    • Use only 2 additional temporary SGPRs for 64-bit offset handling (minimal register pressure)
    • Generate s_load_b64 instructions to load 64-bit offset values
    • Insert proper s_waitcnt synchronization after scalar loads
    • Apply 64-bit address arithmetic when computing buffer pointers
  • Extended KernelWriterConversion.py for Conversion kernel types
  • Updated computeStoreSrd() to properly handle offset calculations

Test Infrastructure:

  • Created dedicated test suite testing_matmul_batch_offset.hpp with dual-validation approach:
    a. Offset API results vs manual pointer adjustment (validates implementation correctness)
    b. GPU results vs CPU reference (validates numerical accuracy)
  • Added 5 test categories in matmul_gtest.yaml:
    • matmul_batch_offset_quick: smoke test (category: quick)
    • matmul_batch_offset_values: various offset values 0-512 (category: pre_checkin)
    • matmul_batch_offset_transpose: transposed matrix combinations (category: pre_checkin)
    • matmul_batch_offset_alpha_beta: various alpha/beta combinations (category: pre_checkin)
    • matmul_batch_offset_large: matrices with very large offsets which requires 64-bit integer type (category: nightly)
  • Scoped large tests to tested GPU architectures to avoid CI failures due to limited device memory resources.

Misc.

  • Fixed minor typo of an internal function name:
    • rocblaslt_matrix_layout_destory() --> rocblaslt_matrix_layout_destroy().

Test Plan

  1. Unit tests: Run new matmul_batch_offset test suite across quick/pre_checkin/nightly categories
  2. Precision coverage: All tests execute across f32, f16, bf16 data types
  3. Transpose modes: Validated with NN, NT, TN, TT matrix configurations
  4. Alpha/Beta combinations: Tested all GEMM modes (alpha-only, beta-only, alpha+beta)
  5. Offset values: Validated with offsets of various number of elements, including very large offset values, which actually requires 64-bit integer type.
  6. Batch counts: Tested with 1-4 batches
  7. Locally build and run relevant tests as well as look at the CI test results.

Test Result

  • All matmul_batch_offset tests passing across all categories
  • No regressions in existing test suites
  • Successful builds and test execution on gfx942 / gfx950 locally
  • CI tests PASSED

Risk level

Low

  • Changes are feature-additive (no modification to existing behavior when offset=0)
  • Kernel changes are scoped to new offset parameter handling
  • Minimal register pressure impact (only 2 extra temporary SGPRs)
  • Offset arguments placed at kernarg buffer tail to avoid breaking existing kernel binaries

Submission Checklist

Associated ticket: AIHPBLAS-1456

@codecov-commenter

codecov-commenter commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 59.09091% with 63 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...tail/rocblaslt/src/include/rocblaslt_mat_utils.hpp 30.95% 21 Missing and 8 partials ⚠️
...c/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp 46.88% 15 Missing and 2 partials ⚠️
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp 60.00% 11 Missing and 1 partial ⚠️
...laslt/library/src/amd_detail/include/auxiliary.hpp 54.55% 1 Missing and 4 partials ⚠️

❌ Your project status has failed because the head coverage (77.89%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7585      +/-   ##
===========================================
- Coverage    71.45%   71.45%   -0.00%     
===========================================
  Files         2612     2612              
  Lines       407793   407925     +132     
  Branches     60982    61006      +24     
===========================================
+ Hits        291377   291467      +90     
- Misses       95068    95090      +22     
- Partials     21348    21368      +20     
Flag Coverage Δ *Carryforward flag
TensileLite 76.92% <ø> (-<0.01%) ⬇️ Carriedforward from 9b6f8c2
hipBLAS 90.81% <ø> (ø) Carriedforward from 9b6f8c2
hipBLASLt 41.63% <59.09%> (+0.27%) ⬆️
hipCUB 82.68% <ø> (ø) Carriedforward from 9b6f8c2
hipDNN 86.75% <ø> (ø) Carriedforward from 9b6f8c2
hipFFT 50.17% <ø> (ø) Carriedforward from 9b6f8c2
hipRAND 76.12% <ø> (ø) Carriedforward from 9b6f8c2
hipSOLVER 69.18% <ø> (ø) Carriedforward from 9b6f8c2
hipSPARSE 86.55% <ø> (ø) Carriedforward from 9b6f8c2
rocBLAS 48.08% <ø> (ø) Carriedforward from 9b6f8c2
rocFFT 47.16% <ø> (ø) Carriedforward from 9b6f8c2
rocRAND 57.07% <ø> (ø) Carriedforward from 9b6f8c2
rocSOLVER 77.89% <ø> (ø) Carriedforward from 9b6f8c2
rocSPARSE 72.37% <ø> (ø) Carriedforward from 9b6f8c2
rocThrust 91.34% <ø> (ø) Carriedforward from 9b6f8c2

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...ts/hipblaslt/library/include/hipblaslt/hipblaslt.h 75.00% <ø> (ø)
...cts/hipblaslt/library/src/amd_detail/hipblaslt.cpp 47.54% <100.00%> (ø)
...rary/src/amd_detail/rocblaslt/src/include/handle.h 84.44% <ø> (ø)
...ary/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp 83.59% <100.00%> (+0.43%) ⬆️
...t/library/src/amd_detail/rocblaslt/src/utility.cpp 28.17% <100.00%> (+0.81%) ⬆️
...blaslt/tensilelite/Tensile/Components/Signature.py 91.74% <ø> (ø)
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py 70.72% <ø> (ø)
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 69.51% <ø> (-<0.01%) ⬇️
...aslt/tensilelite/Tensile/KernelWriterConversion.py 83.58% <ø> (-0.06%) ⬇️
...laslt/library/src/amd_detail/include/auxiliary.hpp 2.36% <54.55%> (+2.36%) ⬆️
... and 3 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 6312fcd to 8009839 Compare May 20, 2026 17:59
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from a5d7f24 to 6175079 Compare May 29, 2026 02:14
@mpanoop

mpanoop commented May 29, 2026

Copy link
Copy Markdown
Contributor

@shbae, we need account for post GSU scenario and KernelOutputConversion.py should be updated to add the offsets for General Batched GEMM scenario.

@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 90e3d97 to a4a15d7 Compare May 29, 2026 23:18
@KKyang KKyang requested a review from jichangjichang May 30, 2026 09:43
@KKyang

KKyang commented May 30, 2026

Copy link
Copy Markdown
Contributor

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

@randyh62 randyh62 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 1e8f008 to a68b688 Compare June 3, 2026 00:09
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch 2 times, most recently from 59fabe4 to 0da219f Compare June 12, 2026 02:57
@shbae shbae changed the title [Draft] 64-bit offset support in hipBLASLt 64-bit offset support in hipBLASLt Jun 12, 2026
@shbae shbae marked this pull request as ready for review June 12, 2026 21:39
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from d8b4b88 to 3bb065b Compare June 15, 2026 21:46
@mpanoop mpanoop mentioned this pull request Jun 16, 2026
1 task
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 58f2d93 to 669e2d1 Compare June 17, 2026 16:58
@shbae

shbae commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

Hi @KKyang and @jichangjichang, this PR is ready to be reviewed, and I've implemented it with minimum usage of SGPR, which requires only 2 temporary SGPRs during updating corresponding offset to each matrix pointer. Please, let me know if you have any comments or questions for this PR. Thank you!

@jichangjichang

Copy link
Copy Markdown
Contributor

Could you add test to verify it with all solution for some small sizes for batch offset test?
You can refer to "matmul_heuristic_all_solutions"

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end 64-bit batch offset support for hipBLASLt general-batched (pointer-array) GEMM by plumbing new matrix-layout offset attributes through the rocblaslt/hipblaslt API layers, TensileLite host argument packing, and TensileLite kernel generation/assembly address calculations, plus introducing a dedicated test suite.

Changes:

  • Extend matrix layout descriptors and validation to carry per-matrix 64-bit batch offsets (A/B/C/D) and pass them through rocblaslt → TensileLite inputs/args.
  • Update TensileLite kernel signature generation and assembly/kernel writers to load/apply 64-bit offsets when computing per-batch base addresses in pointer-array mode.
  • Add new matmul_batch_offset gtest entry + YAML coverage and a dedicated client-side test implementation.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
projects/hipblaslt/tensilelite/Tensile/KernelWriterConversion.py Adds offset args to conversion kernel signature and applies offsets when indexing pointer arrays for C/D.
projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Loads batch offsets from kernargs and applies 64-bit address arithmetic for A/B loads and C/D stores in pointer-array mode.
projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Tracks kernarg byte offsets for batchOffset* fields in writer state.
projects/hipblaslt/tensilelite/Tensile/Components/Signature.py Appends batchOffsetA/B/C/D u64 args to kernarg tail and records their byte offsets for assembly loaders.
projects/hipblaslt/tensilelite/src/ContractionSolution.cpp Appends batchOffset* args to kernel invocations (SupportUserArgs and conversion paths).
projects/hipblaslt/tensilelite/rocisa/rocisa/src/code.cpp Exposes signature offset metadata to Python bindings.
projects/hipblaslt/tensilelite/include/Tensile/ContractionProblem.hpp Extends ContractionInputs with batchOffsetA/B/C/D.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/utility.cpp Updates layout-attribute stringification and adds OFFSET attribute name.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/tensile_host.cpp Converts user offsets (elements) to byte offsets for kernel consumption.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp Plumbs batch_offset_* through problem construction and kernel selection paths.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp Implements matrix layout OFFSET attribute and fixes destroy API typo in implementation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/rocblaslt_mat_utils.hpp Adds offset validation rules (incl. MX-type restriction) and plumbs offsets through arg validation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/handle.h Adds batch_offset field to matrix layout descriptor.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-types.h Adds ROCBLASLT_MATRIX_LAYOUT_OFFSET enum and batch_offset_* fields to RocblasltContractionProblem.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-auxiliary.h Renames rocblaslt_matrix_layout_destoryrocblaslt_matrix_layout_destroy in public header.
projects/hipblaslt/library/src/amd_detail/include/auxiliary.hpp Adds hip_datatype_is_mxtype helper for sub-byte datatype checks.
projects/hipblaslt/library/src/amd_detail/hipblaslt.cpp Updates hipblasLt wrapper to call the corrected destroy function name.
projects/hipblaslt/library/include/hipblaslt/hipblaslt.h Adds HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to public hipblasLt API.
projects/hipblaslt/clients/tests/src/matmul_gtest.cpp Wires new matmul_batch_offset test function into gtest dispatch/filter.
projects/hipblaslt/clients/tests/data/matmul_gtest.yaml Adds quick/pre_checkin/nightly batch-offset test cases (including very large offsets).
projects/hipblaslt/clients/tests/data/hipblaslt_common.yaml Adds CLI/YAML argument definitions and defaults for batch_offset_{a,b,c,d}.
projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp New test implementation validating offset behavior vs CPU reference.
projects/hipblaslt/clients/common/include/hipblaslt_arguments.hpp Adds batch_offset_{a,b,c,d} fields to Arguments struct and serialization macros.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp Outdated
# signature.offset counts from the very first arg including the common header.
# The assembly loads these args with KernArgAddress already advanced past
# that header by commonArgsSize, so subtract it.
if not kernel["ProblemType"]["GroupedGemm"]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for sparse kernel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know this feature is necessary for the sparse kernel, since I haven't heard about the need from sparse kernel yet.

@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 669e2d1 to fb27276 Compare June 24, 2026 23:01
@shbae

shbae commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Could you add test to verify it with all solution for some small sizes for batch offset test? You can refer to "matmul_heuristic_all_solutions"

Hi @jichangjichang, I've added test to verify it with all solutions by 74324ee. Thanks!

unit_check: 1

# Test with negative batch offsets to verify proper memory layout handling
- name: matmul_batch_offset_negative

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TorreZuk I've added matmul_batch_offset_negative and matmul_batch_offset_mixed tests here and they are all PASSED locally.

shbae added 25 commits June 25, 2026 00:34
… to add arguments appropriately and to use only two extra SGPRs
…d excludes custom kernels solutions for General batched GEMM.
@shbae shbae force-pushed the users/sbae/64bit_offset_support branch from 64434c9 to 24de4c7 Compare June 25, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants