failed to run sm read cases on l40s platform #19

gongysh2004 · 2024-02-29T09:43:39Z

I have a compute node with 8 L40s gpus. when I run nvbandwidth, the following cases failed/aborted:

device_to_device_memcpy_read_sm
device_to_device_bidirectional_memcpy_read_sm
all_to_one_read_sm
one_to_all_read_sm

All these cases aborted with message like:

Invalid value when checking the pattern

The following is one of them:

$ nvbandwidth -t  one_to_all_read_sm.
nvbandwidth Version: v0.4
Built from Git version: v0.4

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12020
CUDA Driver Version: 12020
Driver Version: 535.154.05

Device 0: NVIDIA L40S
Device 1: NVIDIA L40S
Device 2: NVIDIA L40S
Device 3: NVIDIA L40S
Device 4: NVIDIA L40S
Device 5: NVIDIA L40S
Device 6: NVIDIA L40S
Device 7: NVIDIA L40S

Running one_to_all_read_sm.
 Invalid value when checking the pattern at <0x7fac6200e480>
 Current offset [ 58496/66306048]
Aborted (core dumped)

please help to check what is the problem, thanks.

The text was updated successfully, but these errors were encountered:

deepakcu · 2024-03-01T00:02:51Z

Can you provide the toolkit and driver versions installed on this system? By any chance, does the system have more than one toolkit installed?

gongysh2004 · 2024-03-01T01:49:57Z

thanks for your response. The following is the info about nvidia-related versions:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

# ls -l /usr/local/
total 36
drwxr-xr-x  2 root root 4096 Apr 19  2022 bin
lrwxrwxrwx  1 root root   21 Feb  7 11:31 cuda -> /usr/local/cuda-12.2/
drwxr-xr-x 17 root root 4096 Feb  7 11:33 cuda-12.2
drwxr-xr-x  2 root root 4096 Apr 19  2022 etc
drwxr-xr-x  2 root root 4096 Apr 19  2022 games
drwxr-xr-x  2 root root 4096 Apr 19  2022 include
drwxr-xr-x  3 root root 4096 Apr 19  2022 lib
lrwxrwxrwx  1 root root    9 Feb  6 13:45 man -> share/man
drwxr-xr-x  2 root root 4096 Apr 19  2022 sbin
drwxr-xr-x  7 root root 4096 Apr 19  2022 share
drwxr-xr-x  2 root root 4096 Apr 19  2022 src

# modinfo nvidia
filename:       /lib/modules/5.15.0-91-generic/updates/dkms/nvidia.ko
firmware:       nvidia/535.154.05/gsp_tu10x.bin
firmware:       nvidia/535.154.05/gsp_ga10x.bin
alias:          char-major-195-*
version:        535.154.05
supported:      external
license:        NVIDIA

In addition, I installed the toolkit without GPU driver by:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run

yaxinchen666 · 2024-08-02T23:24:34Z

Hi @deepakcu ,

Is there any solution to this issue? I met a similar problem.

I have a compute node with 8 L20 GPUs. Testcases device_to_device_memcpy_read_sm and device_to_device_bidirectional_memcpy_read_sm failed with the same errors as @gongysh2004 .

Nvidia driver version: 535.161.08
cuda version: 12.2

deepakcu · 2024-08-03T00:07:36Z

Do the devices have nvlink connectivity ? What does nvidia-smi nvlink -s report?

yaxinchen666 · 2024-08-05T18:26:12Z

Do the devices have nvlink connectivity ? What does nvidia-smi nvlink -s report?

No they do not have nvlink. nvidia-smi nvlink -s shows nothing.

jodelek · 2024-08-05T18:33:08Z

What's the IOMMU configuration?

yaxinchen666 · 2024-08-05T23:51:24Z

What's the IOMMU configuration?

I think it is not enabled. There is nothing under /sys/kernel/iommu_groups/.

ywxc1997 · 2024-09-10T15:17:13Z

When I compiled on A100 and run it on H100, I get the same error
After trying to recompile on H100, the problem is solved

imihic · 2024-10-11T11:28:29Z

Have the same error on system running 4xL40S. Are there any updates on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to run sm read cases on l40s platform #19

failed to run sm read cases on l40s platform #19

gongysh2004 commented Feb 29, 2024

deepakcu commented Mar 1, 2024

gongysh2004 commented Mar 1, 2024 •

edited

Loading

yaxinchen666 commented Aug 2, 2024

deepakcu commented Aug 3, 2024

yaxinchen666 commented Aug 5, 2024

jodelek commented Aug 5, 2024

yaxinchen666 commented Aug 5, 2024

ywxc1997 commented Sep 10, 2024

imihic commented Oct 11, 2024

failed to run sm read cases on l40s platform #19

failed to run sm read cases on l40s platform #19

Comments

gongysh2004 commented Feb 29, 2024

deepakcu commented Mar 1, 2024

gongysh2004 commented Mar 1, 2024 • edited Loading

yaxinchen666 commented Aug 2, 2024

deepakcu commented Aug 3, 2024

yaxinchen666 commented Aug 5, 2024

jodelek commented Aug 5, 2024

yaxinchen666 commented Aug 5, 2024

ywxc1997 commented Sep 10, 2024

imihic commented Oct 11, 2024

gongysh2004 commented Mar 1, 2024 •

edited

Loading