Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 32 additions & 6 deletions .github/workflows/rocm-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -144,15 +144,28 @@ jobs:

- name: Run Container
run: |
# All GPUs are visible to the runner; per-suite visibility is set later
# via HIP_VISIBLE_DEVICES in the test step. Add render group for the
# container. Under kubernetes, GPU isolation comes from DEVICE_FLAG:
# the runner writes this pod's allocated render devices into
# /etc/podinfo/gha-render-devices; fall back to all GPUs on bare metal.
render_gid=$(cat /etc/group | grep render | cut -d: -f3)
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi
# --group-add daemon/bin cover the video-group GID -> subgid 1 mapping
# across Ubuntu 24.04 / AlmaLinux base images. /dev/kfd is the single
# system-wide compute node and is always required.
GPU_FLAG="--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because GID is not guaranteed to be consistent between host and container, please avoid using group names at least for video. And better used getent vs 'cat /etc/group'

docker run -dt \
--rm \
--name te-runner \
--network=host \
--device=/dev/dri --device=/dev/kfd \
$GPU_FLAG \
--shm-size=16G \
--pid=host \
--group-add $(getent group render | cut -d: -f3) \
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ needs.select_image.outputs.image-tag }}
Expand Down Expand Up @@ -341,15 +354,28 @@ jobs:

- name: Run Container
run: |
# All GPUs are visible to the runner; per-suite visibility is set later
# via HIP_VISIBLE_DEVICES in the test step. Add render group for the
# container. Under kubernetes, GPU isolation comes from DEVICE_FLAG:
# the runner writes this pod's allocated render devices into
# /etc/podinfo/gha-render-devices; fall back to all GPUs on bare metal.
render_gid=$(cat /etc/group | grep render | cut -d: -f3)
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi
# --group-add daemon/bin cover the video-group GID -> subgid 1 mapping
# across Ubuntu 24.04 / AlmaLinux base images. /dev/kfd is the single
# system-wide compute node and is always required.
GPU_FLAG="--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined"
docker run -dt \
--rm \
--name te-runner \
--network=host \
--device=/dev/dri --device=/dev/kfd \
$GPU_FLAG \
--shm-size=16G \
--pid=host \
--group-add $(getent group render | cut -d: -f3) \
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ needs.select_image.outputs.image-tag }}
Expand Down
Loading