[BUG]: nsight system does not work with --gpu-metrics-device option #4797

SeungsuBaek · 2024-12-04T04:29:52Z

Version

https://hub.docker.com/r/rapidsai/base

version : 24.10

Which installation method(s) does this occur on?

No response

Describe the bug.

Hi.

I want to profile cugraph with nsight system to check gpu dram bandwidth and pcie bandwidth.

For this, i use nsys profile --gpu-metrics-device=0 command.

I got the profiling result, but result has some error.

Below is nsys profile --gpu-metrics-device=0 command output.

Importer error status: Importation succeeded with non-fatal errors.
**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  Props {
    Items {
      Type: ErrorText
      Value: "GPU Metrics [0]: NVPA_STATUS_ERROR\n- API function: Nvpw.GPU_PeriodicSampler_DecodeCounters_V2(&params)\n- Error code: 1\n- Source function: virtual QuadDDaemon::EventSource::PwMetrics::PeriodicSampler::DecodeResult QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::DecodeCounters(uint8_t*, size_t) const\n- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/Daemon/EventSource/GpuMetrics.cpp:242"
    }
  }
}

The image show that collecting gpu metrics suddenly stoped.

import dask_cudf
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cugraph
import cugraph.dask as dask_cugraph
import cugraph.dask.comms.comms as Comms
from cugraph.generators.rmat import rmat
import time
import argparse
import rmm

def main():
    parser = argparse.ArgumentParser()
    description = '''python bfs.py --n_workers 1 --visible_devices 0,1,2,3
                    --dataset /HUVM/dataset/graph/soc-twitter-2010.csv --loop'''
    parser.add_argument('--n_workers', type=int, required=True, help='number of workers')
    parser.add_argument('--visible_devices', type=str, required=True,
                        help='comma-separated CUDA_VISIBLE_DEVICES (e.g. 0,1,2,3)')
    parser.add_argument('--dataset', type=str, required=True, help='path to graph dataset')
    parser.add_argument('--loop', default=False, action='store_true', help='run one time or in loop')
    args = parser.parse_args()

    # Initialize the CUDA cluster
    cluster = LocalCUDACluster(
               rmm_managed_memory=True,
               rmm_pool_size="50GB",
               CUDA_VISIBLE_DEVICES=args.visible_devices,
               n_workers=args.n_workers
    )
    client = Client(cluster)
    Comms.initialize(p2p=True)

    # Initialize multi-GPU communication
    # Set the reader chunk size to automatically get one partition per GPU
    chunksize = dask_cugraph.get_chunksize(args.dataset)

    # Multi-GPU CSV reader
    e_list = dask_cudf.read_csv(
        args.dataset, chunksize=chunksize, delimiter=' ',
        names=['src', 'dst'], dtype=['int32', 'int32']
    )

    # Create a directed graph from the edge list
    G = cugraph.Graph(directed=True)
    G.from_dask_cudf_edgelist(e_list, source='src', destination='dst')

    # Run BFS in loop or once based on the argument
    if args.loop:
        while True:
            t_start = time.time()
            result = dask_cugraph.bfs(G, start=1)  # Use 'start' argument
#            wait(result)  # Ensure computation finishes
            print("Execution time: ", time.time() - t_start)
    else:
        t_start = time.time()
        result = dask_cugraph.bfs(G, start=1)  # Use 'start' argument
#        wait(result)  # Ensure computation finishes
        print("Execution time: ", time.time() - t_start)

    # Clean up
    Comms.destroy()
    client.close()
    cluster.close()

if __name__ == "__main__":
    main()

This is my bfs benchmark code.

Is there any bug about cugraph application with gpu performance counter?

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

SeungsuBaek added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: nsight system does not work with --gpu-metrics-device option #4797

[BUG]: nsight system does not work with --gpu-metrics-device option #4797

SeungsuBaek commented Dec 4, 2024 •

edited

Loading

[BUG]: nsight system does not work with --gpu-metrics-device option #4797

[BUG]: nsight system does not work with --gpu-metrics-device option #4797

Comments

SeungsuBaek commented Dec 4, 2024 • edited Loading

Version

Which installation method(s) does this occur on?

Describe the bug.

Code of Conduct

SeungsuBaek commented Dec 4, 2024 •

edited

Loading