Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgmproftester pod from install docs using outdated cuda #493

Closed
benlsheets opened this issue Feb 26, 2023 · 6 comments
Closed

dcgmproftester pod from install docs using outdated cuda #493

benlsheets opened this issue Feb 26, 2023 · 6 comments

Comments

@benlsheets
Copy link

The install docs suggest validating the install with an example pod that runs dcgmproftester here. When following the install docs with the v22.9.2 operator this pod errors out saying Wrong version of dcgmproftester is used. Expected Cuda version is 11; Installed Cuda version is 12.

The container image also comes from a repository that is explicitly deprecated. I couldn't find a similar dcgmproftester container in the Cuda Samples at the NGC. It looks like the cuda validator pod essentially runs the vector add from the previous step in the docs but I couldn't find the dcgmproftester binary in that image.

@shivamerla
Copy link
Contributor

@benlsheets We need to update the dcgmproftester image in NGC to work with Cuda 12. I will work with internal teams to publish this.

@rajan123456
Copy link

+1 any update on this? Thanks!

@allkusary
Copy link

+1

2 similar comments
@PrometheusComing
Copy link

+1

@PrometheusComing
Copy link

+1

@shivamerla
Copy link
Contributor

shivamerla commented Dec 21, 2023

dcgmproftester is part of dcgm image itself, so the test can be run using below pod spec. No need of separate sample container.


 apiVersion: v1
 kind: Pod
 metadata:
   name: dcgmproftester
 spec:
   restartPolicy: OnFailure
   containers:
    - name: dcgmproftester12
      image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04
      commmand: ["/usr/bin/dcgmproftester12"]
      args: ["--no-dcgm-validation", "-t 1004", "-d 30"] 
      resources:
        limits:
          nvidia.com/gpu: 1
      securityContext:
        capabilities:
          add: ["SYS_ADMIN"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants