Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428

Open
relyt0925 opened this issue Oct 27, 2022 · 12 comments

Comments

@relyt0925
Copy link

relyt0925 commented Oct 27, 2022

When following the steps defined at:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#create-the-clusterpolicy-instance

I don't ultimately get an entitlement free build system. I do see the driver container

Tylers-MacBook-Pro:~ tylerlisowski$ oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   411.86.202210032349-0,latest   55 minutes ago

I use Nvidia GPU operator 22.9
Screen Shot 2022-10-26 at 8 31 14 PM

And then follow the following cluster policy steps for 4.11

oc get csv -n nvidia-gpu-operator gpu-operator-certified.v22.9.0 -ojsonpath={.metadata.annotations.alm-examples} | jq .[0] > clusterpolicy.json
oc apply -f clusterpolicy.json 
clusterpolicy.nvidia.com/gpu-cluster-policy configured

However the driver container still tries to fallback and pull from yum repos:

error: a container name must be specified for pod nvidia-driver-daemonset-411.86.202210072320-0-lvwjl, choose one of: [nvidia-driver-ctr openshift-driver-toolkit-ctr] or one of the init containers: [k8s-driver-manager]
Tylers-MacBook-Pro:~ tylerlisowski$ oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-411.86.202210072320-0-lvwjl -c nvidia-driver-ctr -f
Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''411.86.202210072320-0'\'' imagetag missing, using entitlement-based fallback'
WARNING: RHCOS '411.86.202210072320-0' imagetag missing, using entitlement-based fallback
+ exec bash -x nvidia-driver init
+ set -eu

Does a manual image tag need to be created to use this path?

@relyt0925
Copy link
Author

relyt0925 commented Oct 27, 2022

I fixed by manually running a tag:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d driver-toolkit:411.86.202210072320-0

That I generated by looking at the existing driver-toolkit image in place with

Tylers-MacBook-Pro:~ tylerlisowski$ oc -n openshift get imagetag | grep driver
driver-toolkit:411.86.202210032349-0                        Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago
driver-toolkit:411.86.202210072320-0                        Tag         image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         3 minutes ago
driver-toolkit:latest                                       Scheduled   image/sha256:1f5f1ae25db67aa82707e1b1dc96c8a53ef7094f320b7eeaef12be9a13fa251d   1         About an hour ago

@shivamerla
Copy link
Contributor

shivamerla commented Oct 27, 2022

Thanks for reporting this @relyt0925. Working with RH to understand why the tag was missing from the imagestream. This version is picked from the NFD label feature.node.kubernetes.io/system-os_release.OSTREE_VERSION and we expect this tag to be present to match the current running RHCOS version.

@mikehollinger
Copy link

Thanks @shivamerla . Yeah we figured as much from a past conversation, but couldn't tell if we'd managed to hit some degenerate case bc of a particular version of either the operator or OpenShift itself. For our mutual clients, what would we want to offer as guidance, ie "if you see ^^, then ___." Would @relyt0925 's workaround be a reasonable interim fix?

@shivamerla
Copy link
Contributor

@mikehollinger yes, the workaround to tag manually seems reasonable until the root cause is identified. @fabiendupont Can you help to identify why the version mismatch here? May be NFD didn't label the version correctly?

@relyt0925 relyt0925 changed the title Cannot get entitlement free system to work with RHCOS on OCP 4.11 entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) Nov 1, 2022
@shivamerla
Copy link
Contributor

@relyt0925 @mikehollinger we have tried to reproduce this and see that tag 411.86.202210072320-0 was present by default.

$  oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY   TAGS                           UPDATED
driver-toolkit                      411.86.202210072320-0,latest   3 days ago

also, label:

feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202210072320-0

need to understand why the DTK tag is missing in your case. Do you see this happening with other clusters too?

@mikehollinger
Copy link

mikehollinger commented Nov 2, 2022

@relyt0925 ^^ ?

@rhocheck
Copy link

I hit the same issue while installing the GPU Operator on IBM RH Openshift 4.10

Running nv-ctr-run-with-dtk
+ [[ true == \t\r\u\e ]]
+ echo 'WARNING: RHCOS '\''410.84.202303060052-0'\'' imagetag missing, using entitlement-based fallback'
+ exec bash -x nvidia-driver init
WARNING: RHCOS '410.84.202303060052-0' imagetag missing, using entitlement-based fallback
de213022@Rainers-MBP ~ % oc get imagestream -n openshift driver-toolkit
NAME             IMAGE REPOSITORY                                                            TAGS                           UPDATED
driver-toolkit   image-registry.openshift-image-registry.svc:5000/openshift/driver-toolkit   410.84.202302090253-0,latest   2 hours ago

Creating the tag manually fixed it.

@shivamerla
Copy link
Contributor

@fabiendupont can you please check why the DTK image tags can be missing in this case?

@relyt0925
Copy link
Author

@shivamerla it is expected in the Hypershift flavor of openshift:
https://github.com/openshift/hypershift/tree/main/hypershift-operator/controllers

That masters and workers can be at different patch versions: hence the difference in tags and the need currently to do the additional tagging. We do have clusters that can be looked at to see this.

Ultimately I believe the operator has to have a bit more logic to appropriately choose the tags in these environments (and in the interim this workaround can be done)

@relyt0925
Copy link
Author

Note: the algortihm and something that I believe should be handled long term in the NVIDIA gpu operator in hypershift environments is the following.

@relyt0925
Copy link
Author

relyt0925 commented Feb 13, 2024

With NFD deployed each node is labeled with the appropriate OS tree version

      feature.node.kubernetes.io/system-os_release.OSTREE_VERSION: 412.86.202310141028-0

(note that is the last part of the driver tag)

Then if the release image associated with the RHCOS node can be looked up the following can be ran
oc adm release info RELEASEIMAGE | grep driver-toolkit
example

oc adm release info us.icr.io/armada-master/ocp-release:4.12.47-x86_64 | grep driver-toolkit
     driver-toolkit                sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab

with that data everything is there to perform tag

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@DRIVER_TOOLKIT_SHA_VAL driver-toolkit:OSTREE_VERSION_VAL

from the example above:

oc -n openshift tag quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5e4c83a34f34bbb8d07891afa5090539aed9c0a6511c3be5655e18f1a32f90ab driver-toolkit:412.86.202310141028-0

@brucejcong
Copy link

Is there any updates on updating the operator to choose the appropriate tags for hypershift environments? This mismatch happens whenever the master and worker patch version is different, where the gpu operator looks for the driver-toolkit image tag based on the worker's OS tree version while the cluster will have the driver-toolkit tagged with the version derived from the master’s OCP release version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants