[cudadev] Update the CUDA implementation to the version in CMSSW 12.0.0-pre3 #191

fwyzard · 2021-06-23T03:29:05Z

No description provided.

fwyzard · 2021-06-23T03:49:15Z

Mhm, I seem to have broken something:

$ CUDA_VISIBLE_DEVICES=0 ./cudadev --validation
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Event 1 
 N(vertices) is 0 expected 23, difference 23 is outside tolerance 1
Event 2 
 N(vertices) is 0 expected 31, difference 31 is outside tolerance 1
...

fwyzard · 2021-06-23T09:52:15Z

OK, I had picked one change too much (from cms-sw/cmssw#33889, which is post-CMSSW_12_0_0_pre3) in the vertex finder.
Everything should be fine now.

fwyzard · 2021-06-23T09:56:03Z

$ CUDA_VISIBLE_DEVICES=0 ./cudadev --validation
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
CountValidator: all 1000 events passed validation
 Average relative track difference 0.00088923 (all within tolerance)
 Average absolute vertex difference 0.001 (all within tolerance)
Processed 1000 events in 2.920028e+00 seconds, throughput 342.462 events/s.

fwyzard · 2021-06-23T10:18:13Z

Note that the new code is measurable slower.

cuda

[email protected]:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -c 0 ./run-scan.py --overwrite --minThreads 6 --maxThreads 10 --eventsPerStream 1000 --warmup --repeat 3 ./cuda
21-06-23 11:59:17 Warming up
21-06-23 11:59:26 Processed 6000 events in 6.619445e+00 seconds, throughput 906.42 events/s.

21-06-23 11:59:26 Number of streams 6 threads 6 events 6000
21-06-23 11:59:34 Processed 6000 events in 6.624209e+00 seconds, throughput 905.769 events/s.
21-06-23 11:59:42 Processed 6000 events in 6.600303e+00 seconds, throughput 909.049 events/s.
21-06-23 11:59:51 Processed 6000 events in 6.625149e+00 seconds, throughput 905.64 events/s.
21-06-23 11:59:51 Number of streams 6 threads 6, average throughput 906.819333

21-06-23 11:59:51 Number of streams 7 threads 7 events 7000
21-06-23 12:00:00 Processed 7000 events in 7.586786e+00 seconds, throughput 922.657 events/s.
21-06-23 12:00:09 Processed 7000 events in 7.587357e+00 seconds, throughput 922.587 events/s.
21-06-23 12:00:19 Processed 7000 events in 7.624599e+00 seconds, throughput 918.081 events/s.
21-06-23 12:00:19 Number of streams 7 threads 7, average throughput 921.108333

21-06-23 12:00:19 Number of streams 8 threads 8 events 8000
21-06-23 12:00:29 Processed 8000 events in 8.647812e+00 seconds, throughput 925.089 events/s.
21-06-23 12:00:39 Processed 8000 events in 8.669792e+00 seconds, throughput 922.744 events/s.
21-06-23 12:00:50 Processed 8000 events in 8.672280e+00 seconds, throughput 922.479 events/s.
21-06-23 12:00:50 Number of streams 8 threads 8, average throughput 923.437333

21-06-23 12:00:50 Number of streams 9 threads 9 events 9000
21-06-23 12:01:01 Processed 9000 events in 9.738684e+00 seconds, throughput 924.15 events/s.
21-06-23 12:01:13 Processed 9000 events in 9.747927e+00 seconds, throughput 923.273 events/s.
21-06-23 12:01:24 Processed 9000 events in 9.766345e+00 seconds, throughput 921.532 events/s.
21-06-23 12:01:24 Number of streams 9 threads 9, average throughput 922.985000

21-06-23 12:01:24 Number of streams 10 threads 10 events 10000
21-06-23 12:01:37 Processed 10000 events in 1.085532e+01 seconds, throughput 921.207 events/s.
21-06-23 12:01:50 Processed 10000 events in 1.089492e+01 seconds, throughput 917.859 events/s.
21-06-23 12:02:02 Processed 10000 events in 1.089915e+01 seconds, throughput 917.503 events/s.
21-06-23 12:02:02 Number of streams 10 threads 10, average throughput 918.856333

cudadev

[email protected]:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -c 0 ./run-scan.py --overwrite --minThreads 6 --maxThreads 10 --eventsPerStream 1000 --warmup --repeat 3 ./cudadev
21-06-23 12:13:51 Warming up
21-06-23 12:13:59 Processed 6000 events in 6.865620e+00 seconds, throughput 873.92 events/s.

21-06-23 12:13:59 Number of streams 6 threads 6 events 6000
21-06-23 12:14:08 Processed 6000 events in 6.850016e+00 seconds, throughput 875.91 events/s.
21-06-23 12:14:16 Processed 6000 events in 6.895238e+00 seconds, throughput 870.166 events/s.
21-06-23 12:14:25 Processed 6000 events in 6.931269e+00 seconds, throughput 865.642 events/s.
21-06-23 12:14:25 Number of streams 6 threads 6, average throughput 870.572667

21-06-23 12:14:25 Number of streams 7 threads 7 events 7000
21-06-23 12:14:35 Processed 7000 events in 7.913887e+00 seconds, throughput 884.521 events/s.
21-06-23 12:14:44 Processed 7000 events in 7.942972e+00 seconds, throughput 881.282 events/s.
21-06-23 12:14:54 Processed 7000 events in 7.938204e+00 seconds, throughput 881.812 events/s.
21-06-23 12:14:54 Number of streams 7 threads 7, average throughput 882.538333

21-06-23 12:14:54 Number of streams 8 threads 8 events 8000
21-06-23 12:15:05 Processed 8000 events in 9.040098e+00 seconds, throughput 884.946 events/s.
21-06-23 12:15:15 Processed 8000 events in 9.040252e+00 seconds, throughput 884.931 events/s.
21-06-23 12:15:26 Processed 8000 events in 9.045760e+00 seconds, throughput 884.392 events/s.
21-06-23 12:15:26 Number of streams 8 threads 8, average throughput 884.756333

21-06-23 12:15:26 Number of streams 9 threads 9 events 9000
21-06-23 12:15:38 Processed 9000 events in 1.017927e+01 seconds, throughput 884.15 events/s.
21-06-23 12:15:50 Processed 9000 events in 1.018134e+01 seconds, throughput 883.97 events/s.
21-06-23 12:16:02 Processed 9000 events in 1.019767e+01 seconds, throughput 882.555 events/s.
21-06-23 12:16:02 Number of streams 9 threads 9, average throughput 883.558333

21-06-23 12:16:02 Number of streams 10 threads 10 events 10000
21-06-23 12:16:15 Processed 10000 events in 1.133347e+01 seconds, throughput 882.343 events/s.
21-06-23 12:16:28 Processed 10000 events in 1.136503e+01 seconds, throughput 879.892 events/s.
21-06-23 12:16:41 Processed 10000 events in 1.137366e+01 seconds, throughput 879.224 events/s.
21-06-23 12:16:41 Number of streams 10 threads 10, average throughput 880.486333

fwyzard · 2021-06-23T14:36:53Z

Comparing the throughput of the cuda and updated cudadev applications, and enabling one additional module at a time, it looks like the slowdown comes from the CAHitNtupletCUDA:

fwyzard · 2021-06-23T14:39:00Z

@VinInn @felicepantaleo @makortel can you think of anything that may have slowed down the CAHitNtupletCUDA producer between CMSSW_11_2_0_pre8_Patatrack and CMSSW_12_0_0_pre3 ?

VinInn · 2021-06-23T14:49:20Z

can you run nvprof (or equivalent nsight)?
besides the "esthetic" changes nothing should have been modified

fwyzard · 2021-06-23T15:38:01Z

any metric or settings in particular ?

here is the output of

ncu --print-summary per-kernel ./cuda --maxEvents 10

cuda.log

and here is the result of

ncu --print-summary per-kernel ./cudadev --maxEvents 10

cudadev.log

VinInn · 2021-06-23T15:53:23Z

too much info to digest (at least in txt form)
maybe the much condensed
/opt/nvidia/nsight-systems/2020.3.4/bin/nsys profile --stats=true ./cuda
(equivalent but slower than good old nvprof)
is enough

fwyzard · 2021-06-23T15:57:31Z

ah, ok - it's also much faster to generate :-)

VinInn · 2021-06-23T15:59:01Z

maybe some sort of numeric diff out of the logs
I managed to extract this

innocent@Vincenzos-MacBook-Pro-2 Downloads % grep -r gpu__time_duration.sum cudadev.log | awk '{print $6}' | sort -n | tail -n 10
206.848000
228.278400
235.241600
240.032000
250.297600
277.088000
288.256000
320.640000
350.822400
427.392000
innocent@Vincenzos-MacBook-Pro-2 Downloads % grep -r gpu__time_duration.sum cuda.log | awk '{print $6}' | sort -n | tail -n 10
193.888000
230.304000
235.145600
246.003200
264.905600
272.864000
281.760000
317.408000
351.430400

fwyzard · 2021-06-23T16:00:49Z

from the nsys logs, the main offender seems to be kernel_find_ntuplets:

  Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name
  -------  ---------------  ---------  ---------  -------  -------  ----------------------------------------------------------------------------------------------------
-    22.9      366,054,748      1,000  366,054.7  113,823  755,099  kernel_find_ntuplets(TrackingRecHit2DSOAView const*, GPUCACell*, unsigned int const*, cms::cuda::Si…
+    25.2      415,254,529      1,000  415,254.5  123,263  874,844  kernel_find_ntuplets(TrackingRecHit2DSOAView const*, GPUCACell*, unsigned int const*, cms::cuda::Si…

VinInn · 2021-06-23T16:02:48Z

can standalone print the stat?

for producer in producers_by_type(process, 'CAHitNtupletCUDA'):
  producer.fillStatistics = True # not the default

fwyzard · 2021-06-23T16:05:26Z

yes, it just needs to be recompiled:

$ ./cuda
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 2.169266e+00 seconds, throughput 460.985 events/s.
||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
Counters Raw 1000 15097326 116597755 5336129 1008402 3655540 5432428 1705671 1140395 107068019 108531397
Counters Norm 1000 ||  15097.3|  116597.8|  5336.1|  3655.5|  1008.4|  5432.4|  1705.7|  1140.4|  0.918|  0.931||

vs

$ ./cudadev
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 2.128844e+00 seconds, throughput 469.739 events/s.
||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
Counters Raw 1000 15097326 116597755 5336143 1008393 3655568 5432449 1705691 1140365 107068019 22838481
Counters Norm 1000 ||  15097.3|  116597.8|  5336.1|  3655.6|  1008.4|  5432.4|  1705.7|  1140.4|  0.918|  0.196||

VinInn · 2021-06-23T16:07:50Z

identical (good)!
So some esthetic change was not so esthetic...

fwyzard · 2021-06-23T16:08:32Z

or

yes, it just needs to be recompiled:

cuda

info	nEvents	nHits	nCells	nTuples	nFitTacks	nGoodTracks	nUsedHits	nDupHits	nKilledCells	nEmptyCells	nZeroTrackCells
raw	1000	15097326	116597755	5336129	1008402	3655540	5432428	1705671	1140395	107068019	108531397
norm	1000	15097.3	116597.8	5336.1	3655.5	1008.4	5432.4	1705.7	1140.4	0.918	0.931

cudadev

info	nEvents	nHits	nCells	nTuples	nFitTacks	nGoodTracks	nUsedHits	nDupHits	nKilledCells	nEmptyCells	nZeroTrackCells
raw	1000	15097326	116597755	5336143	1008393	3655568	5432449	1705691	1140365	107068019	22838481
norm	1000	15097.3	116597.8	5336.1	3655.6	1008.4	5432.4	1705.7	1140.4	0.918	0.196

fwyzard · 2021-06-23T16:09:35Z

the only difference seems in nZeroTrackCells: 0.931 vs 0.196 (normalised)

VinInn · 2021-06-23T16:11:04Z

The meaning changed...

VinInn · 2021-06-23T16:12:07Z

one can compare the details in the ".log". Clearly "dev" is spending more cycles

VinInn · 2021-06-23T16:14:41Z

we changed some int16 in int32 and maybe the size of some arrays. Plus made the hits dynamically allocated.
I think this is in my first PR after the merge (or we did it before the merge?)

VinInn · 2021-06-23T16:21:18Z

src/cudadev/CUDADataFormats/TrackingRecHit2DSOAView.h

 public:
-  static constexpr uint32_t maxHits() { return gpuClustering::MaxNumClusters; }
-  using hindex_type = uint16_t;  // if above is <=2^16
+  using hindex_type = uint32_t;  // if above is <=2^32


this is not esthetic

try to revert this

if I change this I get a failed assert:

/data/user/fwyzard/pixeltrack-standalone/src/cudadev/CUDADataFormats/TrackingRecHit2DHeterogeneous.h:111:62: error: static assertion failed static_assert(sizeof(TrackingRecHit2DSOAView::hindex_type) == sizeof(float));

ok. the storing area changed as well. so not so trivial

maybe there is a not so hard way to revert the full PR....

not obvious, was done BEFORE the merge (or at least not by me in CMSSW master)

Yes, I think it was done in the Patatrack branch - before the merge in the upstream CMSSW and the following clean up.

Here: cms-patatrack/cmssw#583

VinInn · 2021-06-23T16:25:48Z

src/cudadev/plugin-PixelTriplets/CAConstants.h

  // types
-  using hindex_type = uint16_t;  // FIXME from siPixelRecHitsHeterogeneousProduct
-  using tindex_type = uint16_t;  //  for tuples
+  using hindex_type = uint32_t;  // FIXME from siPixelRecHitsHeterogeneousProduct


here as well!

all these changes and still in two places....

if I change this I can run fine, but I still have a ~3.5% loss in performance

so this is not the cause of the regression

VinInn · 2021-06-23T16:31:55Z

src/cudadev/plugin-PixelTriplets/CAHitNtupletGeneratorKernelsImpl.h

-        thisCell.theUsed |= 1;
-        oc.theUsed |= 1;
+        thisCell.setUsedBit(1);
+        oc.setUsedBit(1);


hope these mods are just esthetic

I assume so, given that it's defined as

__device__ __forceinline__ void setUsedBit(uint16_t bit) { theUsed_ |= bit; }

makortel · 2021-06-24T02:11:59Z

src/cudadev/DataFormats/TrackerTopology.h

+//to implement
+// endcap pixel
+
+class TrackerTopology {


I wonder why TrackerTopology is needed? I did notice the added #include in SiPixelRawToClusterGPUKernel.cu but by quick look on the diff the reason wasn't evident (e.g. the full). If only a subset of this header is needed, could the copy be limited to that?

Good point, I'll check if it can be reduced.

OK, the only part that uses the topology is

pixeltrack-standalone/src/cudadev/plugin-SiPixelClusterizer/SiPixelRawToClusterGPUKernel.cu

Line 57 in fb9033e

return (PixelSubdetector::PixelBarrel == ((rawId >> DetId::kSubdetOffset) & DetId::kSubdetMask));

so I can copy those three constants and avoid including the full topology.

The PixelSubdetector::PixelBarrel is defined in PixelSubdetector.h (added in this PR), and the DetId::* in DetId.h (added in this PR), so maybe the TrackerTopology could be avoided completely?

Thanks for the suggestion, I've update the PR dropping DataFormats/TrackerTopology.h and including the other headers directly.

makortel · 2021-06-24T02:14:02Z

Overall this PR looks ok for cudadev (after the TrackerTopology is resolved), but I'm curious about the motivation for the update? E.g. is there a desire to have a periodically updated copy in this standalone repo?

fwyzard · 2021-06-24T06:01:57Z

Hi Matti, I've prepared this update for two reasons:

I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand Illegal memory access in the CUDA implementation with multiple concurrent processes #188
Eric is working on a SoA model, and I think we can test it here before approaching CMSSW; having a more recent code base could make the port to CMSSW easier

I don't know if we want to do regular updates. If we do (on a regular basis or not) the next ones might be easier, now that all code is in CMSSW and we can check the upstream PRs.
In this update some things were complicated because the updates happened across forks.

VinInn · 2021-06-24T06:27:48Z

If one dumps the ptr for the two versions of kernel_find_ntuplets we can try to spot if there is a major offender....

fwyzard · 2021-06-24T09:13:35Z

Took me a bit, but here they are:

fwyzard · 2021-06-24T13:09:34Z

1. I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand [Illegal memory access in the CUDA implementation with multiple concurrent processes #188](https://github.com/cms-patatrack/pixeltrack-standalone/issues/188)

... and I'm happy to report that the same issue is still there after the update (but somewhat harder to trigger) :-)

fwyzard · 2021-06-24T14:03:45Z

On the other hand, if I run multiple jobs concurrently, the cudadev implementation has the same throughput as the cuda implementation.

So maybe we don't need to worry too much about the single job slowdown.

makortel · 2021-06-24T15:10:45Z

I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand Illegal memory access in the CUDA implementation with multiple concurrent processes #188

Eric is working on a SoA model, and I think we can test it here before approaching CMSSW; having a more recent code base could make the port to CMSSW easier

Thanks for the clarifications, makes sense. I have also been planning to use the cudadev as a prototyping ground for "CUDA framework" side (actually started long time ago and then got distracted), and that aim seems to be compatible with what you described above.

I don't know if we want to do regular updates. If we do (on a regular basis or not) the next ones might be easier, now that all code is in CMSSW and we can check the upstream PRs.

Ok. I suppose the current "update when there is demand" can be good-enough for now.

Co-authored-by: Angela Czirkos <[email protected]> Co-authored-by: Eric Cano <[email protected]> Co-authored-by: Felice Pantaleo <[email protected]> Co-authored-by: Marco Musich <[email protected]> Co-authored-by: Matti Kortelainen <[email protected]> Co-authored-by: Tamas Vami <[email protected]> Co-authored-by: Vincenzo Innocente <[email protected]>

fwyzard requested a review from makortel June 23, 2021 03:29

fwyzard force-pushed the update_cudadev_to_CMSSW_12_0_0_pre3 branch from b923a0e to fb9033e Compare June 23, 2021 09:50

VinInn reviewed Jun 23, 2021

View reviewed changes

makortel reviewed Jun 24, 2021

View reviewed changes

makortel added the cuda label Jun 24, 2021

fwyzard and others added 2 commits June 25, 2021 09:29

Update README.md

fe2bc62

fwyzard force-pushed the update_cudadev_to_CMSSW_12_0_0_pre3 branch from fb9033e to fe2bc62 Compare June 25, 2021 07:29

makortel merged commit 6dded93 into cms-patatrack:master Jun 28, 2021

fwyzard deleted the update_cudadev_to_CMSSW_12_0_0_pre3 branch June 28, 2021 14:22

[cudadev] Update the CUDA implementation to the version in CMSSW 12.0.0-pre3 #191

[cudadev] Update the CUDA implementation to the version in CMSSW 12.0.0-pre3 #191

Uh oh!

Conversation

fwyzard commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Jun 23, 2021

cuda

cudadev

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

fwyzard commented Jun 23, 2021

cuda

cudadev

Uh oh!

fwyzard commented Jun 23, 2021 • edited by VinInn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021

Uh oh!

VinInn commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VinInn Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

fwyzard commented Jun 23, 2021 •

edited

Loading

fwyzard commented Jun 23, 2021 •

edited

Loading

VinInn commented Jun 23, 2021 •

edited

Loading

fwyzard commented Jun 23, 2021 •

edited by VinInn

Loading

VinInn commented Jun 23, 2021 •

edited

Loading

VinInn Jun 23, 2021 •

edited

Loading

fwyzard commented Jun 24, 2021 •

edited

Loading