-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag) #428
Comments
I fixed by manually running a tag:
That I generated by looking at the existing driver-toolkit image in place with
|
Thanks for reporting this @relyt0925. Working with RH to understand why the tag was missing from the imagestream. This version is picked from the NFD label |
Thanks @shivamerla . Yeah we figured as much from a past conversation, but couldn't tell if we'd managed to hit some degenerate case bc of a particular version of either the operator or OpenShift itself. For our mutual clients, what would we want to offer as guidance, ie "if you see ^^, then ___." Would @relyt0925 's workaround be a reasonable interim fix? |
@mikehollinger yes, the workaround to tag manually seems reasonable until the root cause is identified. @fabiendupont Can you help to identify why the version mismatch here? May be NFD didn't label the version correctly? |
@relyt0925 @mikehollinger we have tried to reproduce this and see that tag
also, label:
need to understand why the DTK tag is missing in your case. Do you see this happening with other clusters too? |
@relyt0925 ^^ ? |
I hit the same issue while installing the GPU Operator on IBM RH Openshift 4.10
Creating the tag manually fixed it. |
@fabiendupont can you please check why the DTK image tags can be missing in this case? |
@shivamerla it is expected in the Hypershift flavor of openshift: That masters and workers can be at different patch versions: hence the difference in tags and the need currently to do the additional tagging. We do have clusters that can be looked at to see this. Ultimately I believe the operator has to have a bit more logic to appropriately choose the tags in these environments (and in the interim this workaround can be done) |
Note: the algortihm and something that I believe should be handled long term in the NVIDIA gpu operator in hypershift environments is the following. |
With NFD deployed each node is labeled with the appropriate OS tree version
(note that is the last part of the driver tag) Then if the release image associated with the RHCOS node can be looked up the following can be ran
with that data everything is there to perform tag
from the example above:
|
Is there any updates on updating the operator to choose the appropriate tags for hypershift environments? This mismatch happens whenever the master and worker patch version is different, where the gpu operator looks for the driver-toolkit image tag based on the worker's OS tree version while the cluster will have the driver-toolkit tagged with the version derived from the master’s OCP release version. |
When following the steps defined at:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#create-the-clusterpolicy-instance
I don't ultimately get an entitlement free build system. I do see the driver container
I use Nvidia GPU operator 22.9

And then follow the following cluster policy steps for 4.11
However the driver container still tries to fallback and pull from yum repos:
Does a manual image tag need to be created to use this path?
The text was updated successfully, but these errors were encountered: