entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

relyt0925 · 2022-10-27T01:33:09Z

When following the steps defined at:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#create-the-clusterpolicy-instance

I don't ultimately get an entitlement free build system. I do see the driver container

Tylers-MacBook-Pro:~ tylerlisowski$ oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   411.86.202210032349-0,latest   55 minutes ago

I use Nvidia GPU operator 22.9

And then follow the following cluster policy steps for 4.11

oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
oc apply -f clusterpolicy.json 
clusterpolicy.nvidia.com/gpu-cluster-policy configured

However the driver container still tries to fallback and pull from yum repos:

error: a container name must be specified for pod nvidia-driver-daemonset-411.86.202210072320-0-lvwjl, choose one of: [nvidia-driver-ctr openshift-driver-toolkit-ctr] or one of the init containers: [k8s-driver-manager]
Tylers-MacBook-Pro:~ tylerlisowski$ oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202210072320-0-lvwjl -c nvidia-driver-ctr -f
Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''411.86.202210072320-0'\'' imagetag missing, using entitlement-based fallback'
WARNING: RHCOS '411.86.202210072320-0' imagetag missing, using entitlement-based fallback
+ exec bash -x nvidia-driver init
+ set -eu

Does a manual image tag need to be created to use this path?

The text was updated successfully, but these errors were encountered:

relyt0925 · 2022-10-27T01:52:07Z

I fixed by manually running a tag:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d driver-toolkit:411.86.202210072320-0

That I generated by looking at the existing driver-toolkit image in place with

Tylers-MacBook-Pro:~ tylerlisowski$ oc -n openshift get imagetag | grep driver
driver-toolkit:411.86.202210032349-0                        Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago
driver-toolkit:411.86.202210072320-0                        Tag         image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         3 minutes ago
driver-toolkit:latest                                       Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago

shivamerla · 2022-10-27T16:44:48Z

Thanks for reporting this @relyt0925. Working with RH to understand why the tag was missing from the imagestream. This version is picked from the NFD label feature.node.kubernetes.io/system-os_release.OSTREE_VERSION and we expect this tag to be present to match the current running RHCOS version.

mikehollinger · 2022-10-27T19:13:40Z

Thanks @shivamerla . Yeah we figured as much from a past conversation, but couldn't tell if we'd managed to hit some degenerate case bc of a particular version of either the operator or OpenShift itself. For our mutual clients, what would we want to offer as guidance, ie "if you see ^^, then ___." Would @relyt0925 's workaround be a reasonable interim fix?

shivamerla · 2022-10-27T21:48:45Z

@mikehollinger yes, the workaround to tag manually seems reasonable until the root cause is identified. @fabiendupont Can you help to identify why the version mismatch here? May be NFD didn't label the version correctly?

shivamerla · 2022-11-01T23:35:52Z

@relyt0925 @mikehollinger we have tried to reproduce this and see that tag 411.86.202210072320-0 was present by default.

$  oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY   TAGS                           UPDATED
driver-toolkit                      411.86.202210072320-0,latest   3 days ago

also, label:

feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202210072320-0

need to understand why the DTK tag is missing in your case. Do you see this happening with other clusters too?

mikehollinger · 2022-11-02T00:05:59Z

@relyt0925 ^^ ?

rhocheck · 2023-03-30T15:38:51Z

I hit the same issue while installing the GPU Operator on IBM RH Openshift 4.10

Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''410.84.202303060052-0'\'' imagetag missing, using entitlement-based fallback'
+ exec bash -x nvidia-driver init
WARNING: RHCOS '410.84.202303060052-0' imagetag missing, using entitlement-based fallback

de213022@Rainers-MBP ~ % oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   410.84.202302090253-0,latest   2 hours ago

Creating the tag manually fixed it.

shivamerla · 2023-03-30T19:33:41Z

@fabiendupont can you please check why the DTK image tags can be missing in this case?

relyt0925 · 2023-04-26T14:08:18Z

@shivamerla it is expected in the Hypershift flavor of openshift:
https://github.com/openshift/hypershift/tree/main/hypershift-operator/controllers

That masters and workers can be at different patch versions: hence the difference in tags and the need currently to do the additional tagging. We do have clusters that can be looked at to see this.

Ultimately I believe the operator has to have a bit more logic to appropriately choose the tags in these environments (and in the interim this workaround can be done)

relyt0925 · 2024-02-13T03:28:39Z

Note: the algortihm and something that I believe should be handled long term in the NVIDIA gpu operator in hypershift environments is the following.

relyt0925 · 2024-02-13T03:30:24Z

With NFD deployed each node is labeled with the appropriate OS tree version

      feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202310141028-0

(note that is the last part of the driver tag)

Then if the release image associated with the RHCOS node can be looked up the following can be ran
oc adm release info RELEASEIMAGE | grep driver-toolkit
example

oc adm release info us.icr.io/armada-master/ocp-release:4.12.47-x86_64 | grep driver-toolkit
     driver-toolkit                sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab

with that data everything is there to perform tag

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@DRIVER_TOOLKIT_SHA_VAL driver-toolkit:OSTREE_VERSION_VAL

from the example above:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab driver-toolkit:412.86.202310141028-0

brucejcong · 2024-08-07T18:44:38Z

Is there any updates on updating the operator to choose the appropriate tags for hypershift environments? This mismatch happens whenever the master and worker patch version is different, where the gpu operator looks for the driver-toolkit image tag based on the worker's OS tree version while the cluster will have the driver-toolkit tagged with the version derived from the master’s OCP release version.

relyt0925 changed the title ~~Cannot get entitlement free system to work with RHCOS on OCP 4.11~~ entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) Nov 1, 2022

lohwanjing mentioned this issue Oct 1, 2023

Daemonset Unable to find RHCOS toolkit-driver image when installing on OKD 4.13 #592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

relyt0925 commented Oct 27, 2022 •

edited

Loading

relyt0925 commented Oct 27, 2022 •

edited

Loading

shivamerla commented Oct 27, 2022 •

edited

Loading

mikehollinger commented Oct 27, 2022

shivamerla commented Oct 27, 2022

shivamerla commented Nov 1, 2022

mikehollinger commented Nov 2, 2022 •

edited

Loading

rhocheck commented Mar 30, 2023

shivamerla commented Mar 30, 2023

relyt0925 commented Apr 26, 2023

relyt0925 commented Feb 13, 2024

relyt0925 commented Feb 13, 2024 •

edited

Loading

brucejcong commented Aug 7, 2024

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

Comments

relyt0925 commented Oct 27, 2022 • edited Loading

relyt0925 commented Oct 27, 2022 • edited Loading

shivamerla commented Oct 27, 2022 • edited Loading

mikehollinger commented Oct 27, 2022

shivamerla commented Oct 27, 2022

shivamerla commented Nov 1, 2022

mikehollinger commented Nov 2, 2022 • edited Loading

rhocheck commented Mar 30, 2023

shivamerla commented Mar 30, 2023

relyt0925 commented Apr 26, 2023

relyt0925 commented Feb 13, 2024

relyt0925 commented Feb 13, 2024 • edited Loading

brucejcong commented Aug 7, 2024

relyt0925 commented Oct 27, 2022 •

edited

Loading

relyt0925 commented Oct 27, 2022 •

edited

Loading

shivamerla commented Oct 27, 2022 •

edited

Loading

mikehollinger commented Nov 2, 2022 •

edited

Loading

relyt0925 commented Feb 13, 2024 •

edited

Loading