rocm-ci: scope test container to pod-allocated GPUs#611
Open
okakarpa wants to merge 1 commit into
Open
Conversation
The sGPU/mGPU jobs launched the test container with '--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the nested (privileged-dind) container regardless of the GPUs Kubernetes allocated to the pod. Combined with the hard-coded absolute HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while 4-7 sat idle. Jobs only passed when the node was otherwise idle -- arch-independent (mi300x and mi35x). Build GPU_FLAG from /etc/podinfo/gha-render-devices, which the runner populates with this pod's allocated '--device /dev/dri/renderD*' flags (falls back to all GPUs on bare metal). /dev/kfd is always passed. The container now sees only its allocated GPUs as 0..N-1, so the per-suite HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across co-scheduled pods. Requires the runner ScaleSet to populate /etc/podinfo/gha-render-devices (see companion rocOps change).
fe29122 to
6c43fb4
Compare
ipanfilo
reviewed
Jun 4, 2026
| # --group-add daemon/bin cover the video-group GID -> subgid 1 mapping | ||
| # across Ubuntu 24.04 / AlmaLinux base images. /dev/kfd is the single | ||
| # system-wide compute node and is always required. | ||
| GPU_FLAG="--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined" |
Collaborator
There was a problem hiding this comment.
Because GID is not guaranteed to be consistent between host and container, please avoid using group names at least for video. And better used getent vs 'cat /etc/group'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The sGPU/mGPU jobs launched the test container with '--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the nested (privileged-dind) container. Combined with the hard-coded absolute HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while GPUs 4-7 sat idle. Jobs only passed when the node was otherwise idle -- arch-independent (seen on mi300x and mi35x).
Forward only the GPUs the AMD device plugin allocated to this pod: it restricts /dev/dri/renderD*+card* in the runner container to the pod's GPUs, so enumerate those and pass only them. /dev/kfd (single system-wide compute node) is always passed. Fail fast if zero GPUs are allocated.
The container now sees only its allocated GPUs as 0..N-1, so the existing per-suite HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across co-scheduled pods.