rocm-ci: scope test container to pod-allocated GPUs by okakarpa · Pull Request #611 · ROCm/TransformerEngine

okakarpa · 2026-06-04T16:46:56Z

The sGPU/mGPU jobs launched the test container with '--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the nested (privileged-dind) container. Combined with the hard-coded absolute HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while GPUs 4-7 sat idle. Jobs only passed when the node was otherwise idle -- arch-independent (seen on mi300x and mi35x).

Forward only the GPUs the AMD device plugin allocated to this pod: it restricts /dev/dri/renderD*+card* in the runner container to the pod's GPUs, so enumerate those and pass only them. /dev/kfd (single system-wide compute node) is always passed. Fail fast if zero GPUs are allocated.

The container now sees only its allocated GPUs as 0..N-1, so the existing per-suite HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across co-scheduled pods.

The sGPU/mGPU jobs launched the test container with '--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the nested (privileged-dind) container regardless of the GPUs Kubernetes allocated to the pod. Combined with the hard-coded absolute HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while 4-7 sat idle. Jobs only passed when the node was otherwise idle -- arch-independent (mi300x and mi35x). Build GPU_FLAG from /etc/podinfo/gha-render-devices, which the runner populates with this pod's allocated '--device /dev/dri/renderD*' flags (falls back to all GPUs on bare metal). /dev/kfd is always passed. The container now sees only its allocated GPUs as 0..N-1, so the per-suite HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across co-scheduled pods. Requires the runner ScaleSet to populate /etc/podinfo/gha-render-devices (see companion rocOps change).

ipanfilo · 2026-06-04T19:09:26Z

+          # --group-add daemon/bin cover the video-group GID -> subgid 1 mapping
+          # across Ubuntu 24.04 / AlmaLinux base images. /dev/kfd is the single
+          # system-wide compute node and is always required.
+          GPU_FLAG="--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined"


Because GID is not guaranteed to be consistent between host and container, please avoid using group names at least for video. And better used getent vs 'cat /etc/group'

okakarpa requested review from ipanfilo, wangye805 and wenchenvincent as code owners June 4, 2026 16:46

alextmagro added the ci-level 3 CI test level 3 label Jun 4, 2026

okakarpa force-pushed the rocm-ci-gpu-isolation-podinfo branch from fe29122 to 6c43fb4 Compare June 4, 2026 17:03

ipanfilo reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm-ci: scope test container to pod-allocated GPUs#611

rocm-ci: scope test container to pod-allocated GPUs#611
okakarpa wants to merge 1 commit into
devfrom
rocm-ci-gpu-isolation-podinfo

okakarpa commented Jun 4, 2026

Uh oh!

ipanfilo Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

okakarpa commented Jun 4, 2026

Uh oh!

ipanfilo Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants