Summary
JAX PGLE (Profile-Guided Latency Estimation) / FDO profiling does not work on any ROCm GPU. All 5 tests in tests/pgle_test.py fail. This is a tracking issue for the long-term fix.
Immediate workaround PRs (deselect tests):
- #788 —
rocm-jaxlib-v0.9.1
- #790 —
rocm-jaxlib-v0.9.2
Jira: ROCM-25581 (gfx120X), ROCM-24266 (gfx94X/gfx950)
Root Cause
Two layers:
1. XLA ROCm profiler uses legacy roctracer v2 API
The current XLA pins for both rocm-jaxlib-v0.9.1 (3cc8846c) and rocm-jaxlib-v0.9.2 (d8b2a5f5) use rocm_tracer_v1.cc / device_tracer_rocm.cc with the legacy roctracer/rocprofiler v2 API.
- On gfx120X (RDNA4): rocprofiler v2 support was deliberately removed (merged 2025-09-22). Every HIP API callback is dropped with
"No capturing function was found" — 28,746 dropped callbacks in CI logs, producing empty FDO profiles.
- On gfx94X/gfx950 (CDNA): v2 still works for basic tracing, but the ROCm collector doesn't produce CUPTI-equivalent XPlane annotations needed by
ConvertXplaneToProfiledInstructionsProto.
2. Upstream migration exists but is not in the XLA pins
openxla/xla#29769 by cj401-amd ported XLA from roctracer v2 to rocprofiler-sdk v3. The PR was closed (copybara), but the code appears to have partially landed in upstream XLA main (files renamed from rocm_tracer_v1.cc to rocm_tracer.cc with rocprofiler-sdk references).
Four follow-up fix PRs for the v3 code were also closed without merging:
What needs to happen
- Bump the XLA pin on
rocm-jaxlib-v0.9.2 (and future branches) to a commit that includes the rocprofiler-sdk v3 profiling backend
- Cherry-pick or resubmit the follow-up fix PRs if they haven't landed upstream
- Verify FDO profile generation produces valid
custom kernel cost entries on both gfx94X and gfx120X
- Address rocprofiler-sdk gfx120X gaps: rocm-systems PR #5271 documented a 75% dispatch miss rate on RDNA4
DefaultSignal path — closed without merge
- Re-enable PGLE tests by removing
--deselect entries from ci/run_pytest_rocm.sh
AMD contributors on the upstream migration
Summary
JAX PGLE (Profile-Guided Latency Estimation) / FDO profiling does not work on any ROCm GPU. All 5 tests in
tests/pgle_test.pyfail. This is a tracking issue for the long-term fix.Immediate workaround PRs (deselect tests):
rocm-jaxlib-v0.9.1rocm-jaxlib-v0.9.2Jira: ROCM-25581 (gfx120X), ROCM-24266 (gfx94X/gfx950)
Root Cause
Two layers:
1. XLA ROCm profiler uses legacy roctracer v2 API
The current XLA pins for both
rocm-jaxlib-v0.9.1(3cc8846c) androcm-jaxlib-v0.9.2(d8b2a5f5) userocm_tracer_v1.cc/device_tracer_rocm.ccwith the legacy roctracer/rocprofiler v2 API."No capturing function was found"— 28,746 dropped callbacks in CI logs, producing empty FDO profiles.ConvertXplaneToProfiledInstructionsProto.2. Upstream migration exists but is not in the XLA pins
openxla/xla#29769 by
cj401-amdported XLA from roctracer v2 to rocprofiler-sdk v3. The PR was closed (copybara), but the code appears to have partially landed in upstream XLAmain(files renamed fromrocm_tracer_v1.cctorocm_tracer.ccwithrocprofiler-sdkreferences).Four follow-up fix PRs for the v3 code were also closed without merging:
What needs to happen
rocm-jaxlib-v0.9.2(and future branches) to a commit that includes the rocprofiler-sdk v3 profiling backendcustomkernel cost entries on both gfx94X and gfx120XDefaultSignalpath — closed without merge--deselectentries fromci/run_pytest_rocm.shAMD contributors on the upstream migration
cj401-amd(chunyjin@amd.com) — original migration PRmagaonka-amd(magaonka@amd.com) — follow-up fix PRs