64-bit offset support in hipBLASLt by shbae · Pull Request #7585 · ROCm/rocm-libraries

shbae · 2026-05-19T19:12:51Z

Motivation

hipBLASLt currently lacks support for 64-bit batch offsets in matrix operations. This feature enables batched GEMM operations to specify element-level offsets for input/output matrices, allowing computation on specific regions within larger buffers without requiring data copies. This is critical for applications that manage large pre-allocated memory pools or need to operate on sub-matrices within batched operations, which is directly related to rocblas backend unification efforts. The feature is currently supported in rocblas and used by rocsolver, and it requires hipMemcpy overhead with hipblaslt backend, and this new feature would avoid that unnecessary hipMemcpy overhead.

Technical Details

This PR implements end-to-end 64-bit batch offset support across the hipBLASLt stack:

API Layer:

Extended HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to accept 64-bit offset values for A, B, C, D matrices
Batch offsets are specified in elements (not bytes) for consistency with matrix dimensions
Offsets are applied per-batch in pointer array mode (batch_mode=1)

Host Implementation:

Modified tensile_host.cpp to pass offset values as kernel arguments
Placed batch offset arguments at the tail of the kernarg buffer for backward compatibility
Updated kernel dispatch logic to handle 64-bit offset arithmetic

Kernel Generation (TensileLite):

Updated kernel signature generation to include offset parameters in kernarg buffer
Modified KernelWriterAssembly.py to:
- Use only 2 additional temporary SGPRs for 64-bit offset handling (minimal register pressure)
- Generate s_load_b64 instructions to load 64-bit offset values
- Insert proper s_waitcnt synchronization after scalar loads
- Apply 64-bit address arithmetic when computing buffer pointers
Extended KernelWriterConversion.py for Conversion kernel types
Updated computeStoreSrd() to properly handle offset calculations

Test Infrastructure:

Created dedicated test suite testing_matmul_batch_offset.hpp with dual-validation approach:
a. Offset API results vs manual pointer adjustment (validates implementation correctness)
b. GPU results vs CPU reference (validates numerical accuracy)
Added 5 test categories in matmul_gtest.yaml:
- matmul_batch_offset_quick: smoke test (category: quick)
- matmul_batch_offset_values: various offset values 0-512 (category: pre_checkin)
- matmul_batch_offset_transpose: transposed matrix combinations (category: pre_checkin)
- matmul_batch_offset_alpha_beta: various alpha/beta combinations (category: pre_checkin)
- matmul_batch_offset_large: matrices with very large offsets which requires 64-bit integer type (category: nightly)
Scoped large tests to tested GPU architectures to avoid CI failures due to limited device memory resources.

Misc.

Fixed minor typo of an internal function name:
- rocblaslt_matrix_layout_destory() --> rocblaslt_matrix_layout_destroy().

Test Plan

Unit tests: Run new matmul_batch_offset test suite across quick/pre_checkin/nightly categories
Precision coverage: All tests execute across f32, f16, bf16 data types
Transpose modes: Validated with NN, NT, TN, TT matrix configurations
Alpha/Beta combinations: Tested all GEMM modes (alpha-only, beta-only, alpha+beta)
Offset values: Validated with offsets of various number of elements, including very large offset values, which actually requires 64-bit integer type.
Batch counts: Tested with 1-4 batches
Locally build and run relevant tests as well as look at the CI test results.

Test Result

All matmul_batch_offset tests passing across all categories
No regressions in existing test suites
Successful builds and test execution on gfx942 / gfx950 locally
CI tests PASSED

Risk level

Low

Changes are feature-additive (no modification to existing behavior when offset=0)
Kernel changes are scoped to new offset parameter handling
Minimal register pressure impact (only 2 extra temporary SGPRs)
Offset arguments placed at kernarg buffer tail to avoid breaking existing kernel binaries

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Associated ticket: AIHPBLAS-1456

codecov-commenter · 2026-05-19T22:15:20Z

Codecov Report

❌ Patch coverage is 59.09091% with 63 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...tail/rocblaslt/src/include/rocblaslt_mat_utils.hpp	30.95%	21 Missing and 8 partials ⚠️
...c/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp	46.88%	15 Missing and 2 partials ⚠️
...rary/src/amd_detail/rocblaslt/src/tensile_host.cpp	60.00%	11 Missing and 1 partial ⚠️
...laslt/library/src/amd_detail/include/auxiliary.hpp	54.55%	1 Missing and 4 partials ⚠️

❌ Your project status has failed because the head coverage (77.89%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7585      +/-   ##
===========================================
- Coverage    71.45%   71.45%   -0.00%     
===========================================
  Files         2612     2612              
  Lines       407793   407925     +132     
  Branches     60982    61006      +24     
===========================================
+ Hits        291377   291467      +90     
- Misses       95068    95090      +22     
- Partials     21348    21368      +20

Flag	Coverage Δ		*Carryforward flag
TensileLite	`76.92% <ø> (-<0.01%)`	⬇️	Carriedforward from 9b6f8c2
hipBLAS	`90.81% <ø> (ø)`		Carriedforward from 9b6f8c2
hipBLASLt	`41.63% <59.09%> (+0.27%)`	⬆️
hipCUB	`82.68% <ø> (ø)`		Carriedforward from 9b6f8c2
hipDNN	`86.75% <ø> (ø)`		Carriedforward from 9b6f8c2
hipFFT	`50.17% <ø> (ø)`		Carriedforward from 9b6f8c2
hipRAND	`76.12% <ø> (ø)`		Carriedforward from 9b6f8c2
hipSOLVER	`69.18% <ø> (ø)`		Carriedforward from 9b6f8c2
hipSPARSE	`86.55% <ø> (ø)`		Carriedforward from 9b6f8c2
rocBLAS	`48.08% <ø> (ø)`		Carriedforward from 9b6f8c2
rocFFT	`47.16% <ø> (ø)`		Carriedforward from 9b6f8c2
rocRAND	`57.07% <ø> (ø)`		Carriedforward from 9b6f8c2
rocSOLVER	`77.89% <ø> (ø)`		Carriedforward from 9b6f8c2
rocSPARSE	`72.37% <ø> (ø)`		Carriedforward from 9b6f8c2
rocThrust	`91.34% <ø> (ø)`		Carriedforward from 9b6f8c2

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines	Coverage Δ
...ts/hipblaslt/library/include/hipblaslt/hipblaslt.h	`75.00% <ø> (ø)`
...cts/hipblaslt/library/src/amd_detail/hipblaslt.cpp	`47.54% <100.00%> (ø)`
...rary/src/amd_detail/rocblaslt/src/include/handle.h	`84.44% <ø> (ø)`
...ary/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp	`83.59% <100.00%> (+0.43%)`	⬆️
...t/library/src/amd_detail/rocblaslt/src/utility.cpp	`28.17% <100.00%> (+0.81%)`	⬆️
...blaslt/tensilelite/Tensile/Components/Signature.py	`91.74% <ø> (ø)`
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py	`70.72% <ø> (ø)`
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py	`69.51% <ø> (-<0.01%)`	⬇️
...aslt/tensilelite/Tensile/KernelWriterConversion.py	`83.58% <ø> (-0.06%)`	⬇️
...laslt/library/src/amd_detail/include/auxiliary.hpp	`2.36% <54.55%> (+2.36%)`	⬆️
... and 3 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mpanoop · 2026-05-29T02:31:55Z

@shbae, we need account for post GSU scenario and KernelOutputConversion.py should be updated to add the offsets for General Batched GEMM scenario.

KKyang · 2026-05-30T09:44:19Z

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

randyh62

looks good to me

shbae · 2026-06-18T17:04:38Z

@jichangjichang this will greatly increase the sgpr usage and affect the preload data.

Hi @KKyang and @jichangjichang, this PR is ready to be reviewed, and I've implemented it with minimum usage of SGPR, which requires only 2 temporary SGPRs during updating corresponding offset to each matrix pointer. Please, let me know if you have any comments or questions for this PR. Thank you!

jichangjichang · 2026-06-22T07:52:37Z

Could you add test to verify it with all solution for some small sizes for batch offset test?
You can refer to "matmul_heuristic_all_solutions"

Copilot

Pull request overview

This PR adds end-to-end 64-bit batch offset support for hipBLASLt general-batched (pointer-array) GEMM by plumbing new matrix-layout offset attributes through the rocblaslt/hipblaslt API layers, TensileLite host argument packing, and TensileLite kernel generation/assembly address calculations, plus introducing a dedicated test suite.

Changes:

Extend matrix layout descriptors and validation to carry per-matrix 64-bit batch offsets (A/B/C/D) and pass them through rocblaslt → TensileLite inputs/args.
Update TensileLite kernel signature generation and assembly/kernel writers to load/apply 64-bit offsets when computing per-batch base addresses in pointer-array mode.
Add new matmul_batch_offset gtest entry + YAML coverage and a dedicated client-side test implementation.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
projects/hipblaslt/tensilelite/Tensile/KernelWriterConversion.py	Adds offset args to conversion kernel signature and applies offsets when indexing pointer arrays for C/D.
projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py	Loads batch offsets from kernargs and applies 64-bit address arithmetic for A/B loads and C/D stores in pointer-array mode.
projects/hipblaslt/tensilelite/Tensile/KernelWriter.py	Tracks kernarg byte offsets for batchOffset* fields in writer state.
projects/hipblaslt/tensilelite/Tensile/Components/Signature.py	Appends batchOffsetA/B/C/D u64 args to kernarg tail and records their byte offsets for assembly loaders.
projects/hipblaslt/tensilelite/src/ContractionSolution.cpp	Appends batchOffset* args to kernel invocations (SupportUserArgs and conversion paths).
projects/hipblaslt/tensilelite/rocisa/rocisa/src/code.cpp	Exposes signature offset metadata to Python bindings.
projects/hipblaslt/tensilelite/include/Tensile/ContractionProblem.hpp	Extends ContractionInputs with batchOffsetA/B/C/D.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/utility.cpp	Updates layout-attribute stringification and adds OFFSET attribute name.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/tensile_host.cpp	Converts user offsets (elements) to byte offsets for kernel consumption.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_mat.cpp	Plumbs batch_offset_* through problem construction and kernel selection paths.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/rocblaslt_auxiliary.cpp	Implements matrix layout OFFSET attribute and fixes destroy API typo in implementation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/rocblaslt_mat_utils.hpp	Adds offset validation rules (incl. MX-type restriction) and plumbs offsets through arg validation.
projects/hipblaslt/library/src/amd_detail/rocblaslt/src/include/handle.h	Adds `batch_offset` field to matrix layout descriptor.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-types.h	Adds ROCBLASLT_MATRIX_LAYOUT_OFFSET enum and batch_offset_* fields to RocblasltContractionProblem.
projects/hipblaslt/library/src/amd_detail/rocblaslt/include/rocblaslt-auxiliary.h	Renames `rocblaslt_matrix_layout_destory` → `rocblaslt_matrix_layout_destroy` in public header.
projects/hipblaslt/library/src/amd_detail/include/auxiliary.hpp	Adds `hip_datatype_is_mxtype` helper for sub-byte datatype checks.
projects/hipblaslt/library/src/amd_detail/hipblaslt.cpp	Updates hipblasLt wrapper to call the corrected destroy function name.
projects/hipblaslt/library/include/hipblaslt/hipblaslt.h	Adds HIPBLASLT_MATRIX_LAYOUT_OFFSET attribute to public hipblasLt API.
projects/hipblaslt/clients/tests/src/matmul_gtest.cpp	Wires new `matmul_batch_offset` test function into gtest dispatch/filter.
projects/hipblaslt/clients/tests/data/matmul_gtest.yaml	Adds quick/pre_checkin/nightly batch-offset test cases (including very large offsets).
projects/hipblaslt/clients/tests/data/hipblaslt_common.yaml	Adds CLI/YAML argument definitions and defaults for batch_offset_{a,b,c,d}.
projects/hipblaslt/clients/common/include/testing_matmul_batch_offset.hpp	New test implementation validating offset behavior vs CPU reference.
projects/hipblaslt/clients/common/include/hipblaslt_arguments.hpp	Adds batch_offset_{a,b,c,d} fields to Arguments struct and serialization macros.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jichangjichang · 2026-06-22T11:38:16Z

+        # signature.offset counts from the very first arg including the common header.
+        # The assembly loads these args with KernArgAddress already advanced past
+        # that header by commonArgsSize, so subtract it.
+        if not kernel["ProblemType"]["GroupedGemm"]:


Do we need this for sparse kernel?

Honestly, I don't know this feature is necessary for the sparse kernel, since I haven't heard about the need from sparse kernel yet.

shbae · 2026-06-24T23:06:30Z

Could you add test to verify it with all solution for some small sizes for batch offset test? You can refer to "matmul_heuristic_all_solutions"

Hi @jichangjichang, I've added test to verify it with all solutions by 74324ee. Thanks!

shbae · 2026-06-24T23:17:28Z

+  unit_check: 1
+
+# Test with negative batch offsets to verify proper memory layout handling
+- name: matmul_batch_offset_negative


@TorreZuk I've added matmul_batch_offset_negative and matmul_batch_offset_mixed tests here and they are all PASSED locally.

…hipblaslt.

…mode.

…e offset.

… to add arguments appropriately and to use only two extra SGPRs

…g buffer

…offset inputs.

…d excludes custom kernels solutions for General batched GEMM.

…ode only.

shbae requested review from a team as code owners May 19, 2026 19:12

shbae added the project: hipblaslt label May 19, 2026

shbae marked this pull request as draft May 19, 2026 19:13

github-actions Bot added project: hipsparselt ci:hipsparselt-fast labels May 19, 2026

assistant-librarian Bot added the organization: ROCm label May 19, 2026

nakajee reviewed May 20, 2026

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

shbae force-pushed the users/sbae/64bit_offset_support branch from 6312fcd to 8009839 Compare May 20, 2026 17:59

shbae force-pushed the users/sbae/64bit_offset_support branch from a5d7f24 to 6175079 Compare May 29, 2026 02:14

shbae force-pushed the users/sbae/64bit_offset_support branch from 90e3d97 to a4a15d7 Compare May 29, 2026 23:18

KKyang requested a review from jichangjichang May 30, 2026 09:43

mpanoop reviewed Jun 1, 2026

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriterAssembly.py Outdated

randyh62 approved these changes Jun 2, 2026

View reviewed changes

shbae force-pushed the users/sbae/64bit_offset_support branch from 1e8f008 to a68b688 Compare June 3, 2026 00:09

shbae force-pushed the users/sbae/64bit_offset_support branch 2 times, most recently from 59fabe4 to 0da219f Compare June 12, 2026 02:57

shbae changed the title ~~[Draft] 64-bit offset support in hipBLASLt~~ 64-bit offset support in hipBLASLt Jun 12, 2026

shbae marked this pull request as ready for review June 12, 2026 21:39

shbae force-pushed the users/sbae/64bit_offset_support branch from d8b4b88 to 3bb065b Compare June 15, 2026 21:46

mpanoop mentioned this pull request Jun 16, 2026

New rocblas hipblaslt integration #8082

Merged

1 task

shbae force-pushed the users/sbae/64bit_offset_support branch from 58f2d93 to 669e2d1 Compare June 17, 2026 16:58

jichangjichang requested a review from Copilot June 22, 2026 07:38

Copilot started reviewing on behalf of jichangjichang June 22, 2026 07:39 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

jichangjichang reviewed Jun 22, 2026

View reviewed changes

shbae force-pushed the users/sbae/64bit_offset_support branch from 669e2d1 to fb27276 Compare June 24, 2026 23:01

shbae commented Jun 24, 2026

View reviewed changes

shbae added 25 commits June 25, 2026 00:34

minor fix of the type of an internal function name.

68bb8e4

[hipblaslt] implement host-side code for the 64bit offset support in …

5ae84f2

…hipblaslt.

[hipblaslt] pass batch offset values as kernel arguments.

299f898

[hipblaslt] add offset parameters into the kernel signature.

122c1cc

[hipblaslt] update kernel generation to use offsets in general batch …

5a8e6ae

…mode.

[hipblaslt] add waitcnt instruction after SLoadB64 before updating th…

ef58b1d

…e offset.

[hipblaslt] implement tests for 64-bit offset support

f86ea62

[hipblaslt] update tensilelite code-gen part for 64bit offset support…

c4767cc

… to add arguments appropriately and to use only two extra SGPRs

[hipblaslt] fix bugs in computeStoreSrd() and remove unnecessary change.

95f65e7

[hipblaslt] update 64-bit offset support for BetaOnly and Conversion.

bc1e76a

[hipblaslt] fix bugs in the changes of KernelWriterConversion.py

75c41a7

[hipblaslt] remove unnecessary if-condition and minor update.

9687d09

[hipblaslt] fix CI test failures.

a1e6023

remove temporary debug implementation.

a5c0c37

[hipblaslt] fix the bug related to CI failures.

6a04c1c

[hipblaslt] place the batch offset argument at the tail of the kernar…

dfcb804

…g buffer

[hipblaslt] modify the batch offset value as in elements.

f506496

remove temporary debug test and update matmul_batch_offset_large test.

314abe7

[hipblaslt] modify matmul_batch_offset_large test with actual 64-bit …

a1aa531

…offset inputs.

limit the tested gpu_arch for matmul_batch_offset_large tests.

19ad018

[hipblaslt] remove checking the positive offset value

536d137

[hipblaslt] add matmul_batch_offset_all_solutions test.

74324ee

[hipblaslt] don't add batchOffset arguments to the Custom Kernels, an…

905a59e

…d excludes custom kernels solutions for General batched GEMM.

[hipblaslt] add tests for negative batch offset values.

d36670a

[hipblaslt] check validity with non-zero offsets with POINTER_ARRAY m…

24de4c7

…ode only.

shbae force-pushed the users/sbae/64bit_offset_support branch from 64434c9 to 24de4c7 Compare June 25, 2026 00:34

Uh oh!

Conversation

shbae commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Risk level

Submission Checklist

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

mpanoop commented May 29, 2026

Uh oh!

KKyang commented May 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

randyh62 left a comment

Choose a reason for hiding this comment

Uh oh!

shbae commented Jun 18, 2026

Uh oh!

jichangjichang commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jichangjichang Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

shbae Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

shbae commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shbae Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

shbae commented May 19, 2026 •

edited

Loading

codecov-commenter commented May 19, 2026 •

edited

Loading

shbae commented Jun 24, 2026 •

edited

Loading