Skip to content

gfx1201 gpu_navi4x runners have broken ROCm environments: rocminfo segfaults #53

Description

@benvanik

gfx1201 gpu_navi4x runners have broken ROCm environments: rocminfo segfaults

Summary

Several recent CI CMake Linux gfx1201 package-test jobs on the gpu_navi4x
runner pool show the same runner-health failure: packaged rocminfo resolves
from the installed public dependency package, then immediately segfaults.

This is not a normal test failure. When the workflow temporarily treated
rocminfo as a warning and continued, the installed GPU suite produced a large
cascade of process segfaults, usually:

42% tests passed, 38 tests failed out of 66
Errors while running CTest

PR #52 changes the workflow to use rocminfo as a hard ROCm environment
canary again. While gfx1201 is marked experimental, the job can report the
broken runner and skip the installed suite instead of generating the cascade.

Impacted Runners

Sampled the last 80 CI CMake Linux workflow runs on 2026-06-09.

Matching rocminfo segfaults were observed on:

Runner label Runner name Matching jobs
gpu_navi4x CS-RORDMZ-DT145 7
gpu_navi4x CS-RORDMZ-DT147 4

No matching rocminfo segfault was found on CS-RORDMZ-DT143 in this sampled
window. The latest PR #52 gfx1201 job on CS-RORDMZ-DT143 succeeded:

Failure Signature

Pre-PR #52 warning-mode example:

rocminfo path: /home/ubuntu/actions-runner/_work/hrx-system/hrx-system/build/linux-gpu/install/public-deps/bin/rocminfo
/home/ubuntu/actions-runner/_work/_temp/9ead8cff-499a-488f-a48b-fa504e5798ed.sh: line 4: 382707 Segmentation fault      (core dumped) rocminfo
Warning: rocminfo failed; continuing to installed GPU tests
42% tests passed, 38 tests failed out of 66
Errors while running CTest

PR #52 canary-mode example:

rocminfo path: /home/ubuntu/actions-runner/_work/hrx-system/hrx-system/build/linux-gpu/install/public-deps/bin/rocminfo
.github/scripts/check_rocm_environment.sh: line 19: 347376 Segmentation fault      (core dumped) rocminfo
Error: ROCm environment on this runner is broken; rocminfo failed before GPU tests.

Matching Jobs

Started UTC Workflow run Job Result Runner Symptom
2026-06-09 06:36:42 27188339883 80262712962 failure CS-RORDMZ-DT145 rocminfo segfaulted before installed tests
2026-06-09 06:42:45 27188655682 80263583859 failure CS-RORDMZ-DT147 rocminfo segfaulted before installed tests
2026-06-09 15:43:43 27217788198 80364536623 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 16:39:42 27220546268 80376180903 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 16:54:59 27220546396 80379157442 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:00:14 27222201440 80380071263 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:04:45 27222459403 80381012028 failure CS-RORDMZ-DT147 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:07:41 27222637646 80381605883 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:18:41 27223228760 80383743191 failure CS-RORDMZ-DT145 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:27:34 27223713966 80385468837 failure CS-RORDMZ-DT147 rocminfo segfaulted; workflow continued; 38/66 installed tests failed
2026-06-09 17:29:49 27223853693 80385861126 success due to experimental lane CS-RORDMZ-DT147 PR #52 canary detected rocminfo segfault and skipped installed tests

Nearby Non-Matches

These failed gfx1201 jobs in the sampled window did not match the rocminfo
segfault pattern and look like real test/application failures instead of this
runner-health issue:

Request

Please inspect or recycle the affected gpu_navi4x runners:

  • CS-RORDMZ-DT145
  • CS-RORDMZ-DT147

The first health check should be running the packaged/public-deps rocminfo
path used by CI:

/home/ubuntu/actions-runner/_work/hrx-system/hrx-system/build/linux-gpu/install/public-deps/bin/rocminfo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions