Skip to content

PGLE/FDO profiling broken on all ROCm GPUs — needs XLA pin bump for rocprofiler-sdk v3 #791

Description

@srinivamd

Summary

JAX PGLE (Profile-Guided Latency Estimation) / FDO profiling does not work on any ROCm GPU. All 5 tests in tests/pgle_test.py fail. This is a tracking issue for the long-term fix.

Immediate workaround PRs (deselect tests):

  • #788rocm-jaxlib-v0.9.1
  • #790rocm-jaxlib-v0.9.2

Jira: ROCM-25581 (gfx120X), ROCM-24266 (gfx94X/gfx950)

Root Cause

Two layers:

1. XLA ROCm profiler uses legacy roctracer v2 API

The current XLA pins for both rocm-jaxlib-v0.9.1 (3cc8846c) and rocm-jaxlib-v0.9.2 (d8b2a5f5) use rocm_tracer_v1.cc / device_tracer_rocm.cc with the legacy roctracer/rocprofiler v2 API.

  • On gfx120X (RDNA4): rocprofiler v2 support was deliberately removed (merged 2025-09-22). Every HIP API callback is dropped with "No capturing function was found" — 28,746 dropped callbacks in CI logs, producing empty FDO profiles.
  • On gfx94X/gfx950 (CDNA): v2 still works for basic tracing, but the ROCm collector doesn't produce CUPTI-equivalent XPlane annotations needed by ConvertXplaneToProfiledInstructionsProto.

2. Upstream migration exists but is not in the XLA pins

openxla/xla#29769 by cj401-amd ported XLA from roctracer v2 to rocprofiler-sdk v3. The PR was closed (copybara), but the code appears to have partially landed in upstream XLA main (files renamed from rocm_tracer_v1.cc to rocm_tracer.cc with rocprofiler-sdk references).

Four follow-up fix PRs for the v3 code were also closed without merging:

What needs to happen

  1. Bump the XLA pin on rocm-jaxlib-v0.9.2 (and future branches) to a commit that includes the rocprofiler-sdk v3 profiling backend
  2. Cherry-pick or resubmit the follow-up fix PRs if they haven't landed upstream
  3. Verify FDO profile generation produces valid custom kernel cost entries on both gfx94X and gfx120X
  4. Address rocprofiler-sdk gfx120X gaps: rocm-systems PR #5271 documented a 75% dispatch miss rate on RDNA4 DefaultSignal path — closed without merge
  5. Re-enable PGLE tests by removing --deselect entries from ci/run_pytest_rocm.sh

AMD contributors on the upstream migration

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions