Skip to content

rocm-jaxlib-v0.9.2 build failure: stale xla_gpu_cublaslt_algorithm3.patch after XLA pin bump #777

Description

@srinivamd

Summary

The rocm-jaxlib-v0.9.2 branch fails to build because xla_gpu_cublaslt_algorithm3.patch is stale — it tries to add code that already exists in the pinned XLA commit.

Build log: https://github.com/ROCm/aisw-hud/actions/runs/25841974606

Error

ERROR: Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/.../external/xla/temp15380457254727495498/fix_xla_gpu_cublaslt_algorithm3.patch",
  line 5, column 30, in patch_file
    error applying patch /root/.cache/bazel/...fix_xla_gpu_cublaslt_algorithm3.patch:
    --- a/xla/backends/gpu/autotuner/cublaslt.cc
    +++ b/xla/backends/gpu/autotuner/cublaslt.cc
    @@ -124,6 +124,11 @@ CublasLtBackend::GetSupportedConfigs(const HloInstruction& instr) {
    Near line 124, CONTENT_DOES_NOT_MATCH_TARGET

Root Cause

  1. Commit e0bb88af (2026-03-17) added 4 Bazel patches to third_party/xla/workspace.bzl to cherry-pick upstream XLA/Shardy fixes without bumping the XLA pin.

  2. Commit aae3e281 (2026-04-30) bumped the XLA pin to d8b2a5f5ece8af6b45e0fbfd6d3dbea1958d2f7e (for rocprofiler-sdk v3 + roctracer v1 build fix) but did not remove the now-stale patches.

  3. The new XLA commit already contains the algorithm-3 skip fix (from [CublasLt] Disable Cublaslt algorithm 3 which is numerical unstable for complex numbers. openxla/xla#39277), so xla_gpu_cublaslt_algorithm3.patch tries to insert lines that are already present → CONTENT_DOES_NOT_MATCH_TARGET.

Fix

Remove the stale patch from third_party/xla/workspace.bzl:

# Before (4 patches):
patch_file = [
    "//third_party/xla:shardy_temporary.patch",
    "//third_party/xla:xla_gpu_cublaslt_default.patch",
    "//third_party/xla:xla_gpu_cublaslt_algorithm3.patch",   # ← REMOVE
    "//third_party/xla:xla_nccl_comm_split_deadlock.patch",
],

# After (3 patches):
patch_file = [
    "//third_party/xla:shardy_temporary.patch",
    "//third_party/xla:xla_gpu_cublaslt_default.patch",
    "//third_party/xla:xla_nccl_comm_split_deadlock.patch",
],

Optionally also delete the patch file: third_party/xla/xla_gpu_cublaslt_algorithm3.patch.

Other Patches

The remaining 3 patches are not stale and should be kept:

Patch Status Reason
xla_gpu_cublaslt_default.patch Keep ROCm-specific behavioral change (disables cublaslt by default), not an upstream cherry-pick
xla_nccl_comm_split_deadlock.patch Keep Fix not yet upstreamed (CheckCliqueIsNotStale, IsParentSupersetOf don't exist in upstream XLA)
shardy_temporary.patch Keep Targets Shardy dependency, not affected by the XLA pin bump

Affected Branch

rocm-jaxlib-v0.9.2

Context

  • v0.9.1 builds successfully (no patches in its workspace.bzl)
  • Both v0.9.1 and v0.9.2 builds use ROCm 7.13.0 from TheRock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions