Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to run sm read cases on l40s platform #19

Open
gongysh2004 opened this issue Feb 29, 2024 · 9 comments
Open

failed to run sm read cases on l40s platform #19

gongysh2004 opened this issue Feb 29, 2024 · 9 comments

Comments

@gongysh2004
Copy link

I have a compute node with 8 L40s gpus. when I run nvbandwidth, the following cases failed/aborted:

  • device_to_device_memcpy_read_sm
  • device_to_device_bidirectional_memcpy_read_sm
  • all_to_one_read_sm
  • one_to_all_read_sm

All these cases aborted with message like:

Invalid value when checking the pattern

The following is one of them:

$ nvbandwidth -t  one_to_all_read_sm.
nvbandwidth Version: v0.4
Built from Git version: v0.4

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12020
CUDA Driver Version: 12020
Driver Version: 535.154.05

Device 0: NVIDIA L40S
Device 1: NVIDIA L40S
Device 2: NVIDIA L40S
Device 3: NVIDIA L40S
Device 4: NVIDIA L40S
Device 5: NVIDIA L40S
Device 6: NVIDIA L40S
Device 7: NVIDIA L40S

Running one_to_all_read_sm.
 Invalid value when checking the pattern at <0x7fac6200e480>
 Current offset [ 58496/66306048]
Aborted (core dumped)

please help to check what is the problem, thanks.

@deepakcu
Copy link
Collaborator

deepakcu commented Mar 1, 2024

Can you provide the toolkit and driver versions installed on this system? By any chance, does the system have more than one toolkit installed?

@gongysh2004
Copy link
Author

gongysh2004 commented Mar 1, 2024

thanks for your response. The following is the info about nvidia-related versions:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
# ls -l /usr/local/
total 36
drwxr-xr-x  2 root root 4096 Apr 19  2022 bin
lrwxrwxrwx  1 root root   21 Feb  7 11:31 cuda -> /usr/local/cuda-12.2/
drwxr-xr-x 17 root root 4096 Feb  7 11:33 cuda-12.2
drwxr-xr-x  2 root root 4096 Apr 19  2022 etc
drwxr-xr-x  2 root root 4096 Apr 19  2022 games
drwxr-xr-x  2 root root 4096 Apr 19  2022 include
drwxr-xr-x  3 root root 4096 Apr 19  2022 lib
lrwxrwxrwx  1 root root    9 Feb  6 13:45 man -> share/man
drwxr-xr-x  2 root root 4096 Apr 19  2022 sbin
drwxr-xr-x  7 root root 4096 Apr 19  2022 share
drwxr-xr-x  2 root root 4096 Apr 19  2022 src
# modinfo nvidia
filename:       /lib/modules/5.15.0-91-generic/updates/dkms/nvidia.ko
firmware:       nvidia/535.154.05/gsp_tu10x.bin
firmware:       nvidia/535.154.05/gsp_ga10x.bin
alias:          char-major-195-*
version:        535.154.05
supported:      external
license:        NVIDIA

In addition, I installed the toolkit without GPU driver by:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run

@yaxinchen666
Copy link

Hi @deepakcu ,

Is there any solution to this issue? I met a similar problem.

I have a compute node with 8 L20 GPUs. Testcases device_to_device_memcpy_read_sm and device_to_device_bidirectional_memcpy_read_sm failed with the same errors as @gongysh2004 .

Nvidia driver version: 535.161.08
cuda version: 12.2

@deepakcu
Copy link
Collaborator

deepakcu commented Aug 3, 2024

Do the devices have nvlink connectivity ? What does nvidia-smi nvlink -s report?

@yaxinchen666
Copy link

Do the devices have nvlink connectivity ? What does nvidia-smi nvlink -s report?

No they do not have nvlink. nvidia-smi nvlink -s shows nothing.

@jodelek
Copy link

jodelek commented Aug 5, 2024

What's the IOMMU configuration?

@yaxinchen666
Copy link

What's the IOMMU configuration?

I think it is not enabled. There is nothing under /sys/kernel/iommu_groups/.

@ywxc1997
Copy link

When I compiled on A100 and run it on H100, I get the same error
After trying to recompile on H100, the problem is solved

@imihic
Copy link

imihic commented Oct 11, 2024

Have the same error on system running 4xL40S. Are there any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants