Skip to content

GEMM reference computation offload#392

Merged
matthiasdiener merged 17 commits into
devfrom
compute-ref-offload
Jan 27, 2026
Merged

GEMM reference computation offload#392
matthiasdiener merged 17 commits into
devfrom
compute-ref-offload

Conversation

@matthiasdiener

@matthiasdiener matthiasdiener commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

Description

Introduce a HIP implementation of the GEMM reference computation to speed up these computations.

Partly addresses https://github.com/ROCm/frameworks-internal/issues/14746

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Introduce a HIP implementation of the GEMM reference computation

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener force-pushed the compute-ref-offload branch 7 times, most recently from 44df11e to e60b912 Compare December 9, 2025 20:53
@matthiasdiener matthiasdiener changed the title [WIP] GEMM reference compute offload GEMM reference computation offload Dec 9, 2025
@matthiasdiener matthiasdiener self-assigned this Dec 9, 2025
@matthiasdiener matthiasdiener requested review from alextmagro and Copilot and removed request for alextmagro December 10, 2025 00:19
@matthiasdiener matthiasdiener marked this pull request as ready for review December 10, 2025 00:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a GPU-accelerated implementation of the GEMM reference computation using HIP/CUDA kernels to improve performance over the previous CPU-based implementation. The reference computation is critical for validating GEMM operations, and offloading it to the GPU significantly speeds up testing.

Key Changes

  • Replaced CPU OpenMP-based reference GEMM computation with GPU kernel implementation
  • Introduced compute_ref_kernel to perform matrix multiplication, bias addition, GELU activation, and scaling on GPU
  • Refactored both tensor-wise and MXFP8 code paths to use a common compute_ref_impl function that manages device memory and kernel execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated

@alextmagro alextmagro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matthias! Looks great, just have a couple performance questions

Comment thread tests/cpp/operator/test_cublaslt_gemm.cu
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
fp8e8m0* dB_scale = nullptr;

// Allocations and H2D transfers
NVTE_CHECK_CUDA(cudaMalloc(&dA, lenA * sizeof(A_Type)));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adapt existing test tensor classes (

Tensor::Tensor(const std::string& name,
) and their space allocation functions (
Tensor output_c("output_c", shape, otype, rowwise, colwise, NVTE_MXFP8_1D_SCALING);
) defined in tests/cpp/test_common.cu instead of reinventing.

In fact, we can change the api of reference computing by taking directly const tensor& therefore we don't need to re-allocate the input and do one extra copy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of 3ecea7f? This also merges the mxfp8/non-mxfp8 paths.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for consolidating with existing apis in test_common.cu.

In fact, I still see some cudaMalloc and cudaFree, which can be replaced by using existing test tensor class apis.
For example, the device pointer for scale (

NVTE_CHECK_CUDA(cudaMalloc(&d_a_scale_packed, a_scale_packed.size() * sizeof(fp8e8m0)));
), its corresponding test tensor allocation can be found here:
if (rowwise) {
(void)cudaMalloc((void**)&rowwise_scale_inv, rowwise_scale_size); // NOLINT(*)
(void)cudaMemset(rowwise_scale_inv, 0, rowwise_scale_size);
rowwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(rowwise_scale_size);
std::fill_n(rowwise_scale_inv_cpu_data_.get(), rowwise_scale_size, 0);
auto scale_dtype = rowwise_scale_meta.type;
tensor_.set_rowwise_scale_inv(rowwise_scale_inv, scale_dtype, scale_shape);
}
if (columnwise) {
(void)cudaMalloc((void**)&columnwise_scale_inv, columnwise_scale_size); // NOLINT(*)
(void)cudaMemset(columnwise_scale_inv, 0, columnwise_scale_size);
columnwise_scale_inv_cpu_data_ = std::make_unique<unsigned char[]>(columnwise_scale_size);
std::fill_n(columnwise_scale_inv_cpu_data_.get(), columnwise_scale_size, 0);
auto scale_dtype = colwise_scale_meta.type;
tensor_.set_columnwise_scale_inv(columnwise_scale_inv, scale_dtype, columnwise_scale_shape);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the remaining raw allocations in the reference path with test::Tensor for the temporary device buffers (RefD/RefGelu/RefAmax) in e11e400.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah, it indeed saved some cudaMalloc/cudaFrees.

How about we put the RefD instantiation inside PerformTest, and pass the Tensor RefD (including its RefAmax D) and RefPreGeluOut to run_reference directly (instead of std::unique_ptr<D_Type[]>& ref_D, float* ref_amax_d, std::unique_ptr<Gelu_Type[]>& ref_pre_gelu_out). Then this can save some ref cpu ptr allocation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of 325ece6?

Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated
Comment thread tests/cpp/operator/test_cublaslt_gemm.cu Outdated

@wangye805 wangye805 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matthiasdiener matthiasdiener merged commit 2652462 into dev Jan 27, 2026
2 checks passed
@matthiasdiener matthiasdiener deleted the compute-ref-offload branch January 27, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants