Skip to content

Conversation

@fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jun 23, 2021

No description provided.

@fwyzard fwyzard requested a review from makortel June 23, 2021 03:29
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

Mhm, I seem to have broken something:

$ CUDA_VISIBLE_DEVICES=0 ./cudadev --validation
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Event 1 
 N(vertices) is 0 expected 23, difference 23 is outside tolerance 1
Event 2 
 N(vertices) is 0 expected 31, difference 31 is outside tolerance 1
...

@fwyzard fwyzard force-pushed the update_cudadev_to_CMSSW_12_0_0_pre3 branch from b923a0e to fb9033e Compare June 23, 2021 09:50
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

OK, I had picked one change too much (from cms-sw/cmssw#33889, which is post-CMSSW_12_0_0_pre3) in the vertex finder.
Everything should be fine now.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

$ CUDA_VISIBLE_DEVICES=0 ./cudadev --validation
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
CountValidator: all 1000 events passed validation
 Average relative track difference 0.00088923 (all within tolerance)
 Average absolute vertex difference 0.001 (all within tolerance)
Processed 1000 events in 2.920028e+00 seconds, throughput 342.462 events/s.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

Note that the new code is measurable slower.

cuda

[email protected]:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -c 0 ./run-scan.py --overwrite --minThreads 6 --maxThreads 10 --eventsPerStream 1000 --warmup --repeat 3 ./cuda
21-06-23 11:59:17 Warming up
21-06-23 11:59:26 Processed 6000 events in 6.619445e+00 seconds, throughput 906.42 events/s.

21-06-23 11:59:26 Number of streams 6 threads 6 events 6000
21-06-23 11:59:34 Processed 6000 events in 6.624209e+00 seconds, throughput 905.769 events/s.
21-06-23 11:59:42 Processed 6000 events in 6.600303e+00 seconds, throughput 909.049 events/s.
21-06-23 11:59:51 Processed 6000 events in 6.625149e+00 seconds, throughput 905.64 events/s.
21-06-23 11:59:51 Number of streams 6 threads 6, average throughput 906.819333

21-06-23 11:59:51 Number of streams 7 threads 7 events 7000
21-06-23 12:00:00 Processed 7000 events in 7.586786e+00 seconds, throughput 922.657 events/s.
21-06-23 12:00:09 Processed 7000 events in 7.587357e+00 seconds, throughput 922.587 events/s.
21-06-23 12:00:19 Processed 7000 events in 7.624599e+00 seconds, throughput 918.081 events/s.
21-06-23 12:00:19 Number of streams 7 threads 7, average throughput 921.108333

21-06-23 12:00:19 Number of streams 8 threads 8 events 8000
21-06-23 12:00:29 Processed 8000 events in 8.647812e+00 seconds, throughput 925.089 events/s.
21-06-23 12:00:39 Processed 8000 events in 8.669792e+00 seconds, throughput 922.744 events/s.
21-06-23 12:00:50 Processed 8000 events in 8.672280e+00 seconds, throughput 922.479 events/s.
21-06-23 12:00:50 Number of streams 8 threads 8, average throughput 923.437333

21-06-23 12:00:50 Number of streams 9 threads 9 events 9000
21-06-23 12:01:01 Processed 9000 events in 9.738684e+00 seconds, throughput 924.15 events/s.
21-06-23 12:01:13 Processed 9000 events in 9.747927e+00 seconds, throughput 923.273 events/s.
21-06-23 12:01:24 Processed 9000 events in 9.766345e+00 seconds, throughput 921.532 events/s.
21-06-23 12:01:24 Number of streams 9 threads 9, average throughput 922.985000

21-06-23 12:01:24 Number of streams 10 threads 10 events 10000
21-06-23 12:01:37 Processed 10000 events in 1.085532e+01 seconds, throughput 921.207 events/s.
21-06-23 12:01:50 Processed 10000 events in 1.089492e+01 seconds, throughput 917.859 events/s.
21-06-23 12:02:02 Processed 10000 events in 1.089915e+01 seconds, throughput 917.503 events/s.
21-06-23 12:02:02 Number of streams 10 threads 10, average throughput 918.856333

cudadev

[email protected]:/data/user/fwyzard/pixeltrack-standalone$ CUDA_VISIBLE_DEVICES=0 numactl -c 0 ./run-scan.py --overwrite --minThreads 6 --maxThreads 10 --eventsPerStream 1000 --warmup --repeat 3 ./cudadev
21-06-23 12:13:51 Warming up
21-06-23 12:13:59 Processed 6000 events in 6.865620e+00 seconds, throughput 873.92 events/s.

21-06-23 12:13:59 Number of streams 6 threads 6 events 6000
21-06-23 12:14:08 Processed 6000 events in 6.850016e+00 seconds, throughput 875.91 events/s.
21-06-23 12:14:16 Processed 6000 events in 6.895238e+00 seconds, throughput 870.166 events/s.
21-06-23 12:14:25 Processed 6000 events in 6.931269e+00 seconds, throughput 865.642 events/s.
21-06-23 12:14:25 Number of streams 6 threads 6, average throughput 870.572667

21-06-23 12:14:25 Number of streams 7 threads 7 events 7000
21-06-23 12:14:35 Processed 7000 events in 7.913887e+00 seconds, throughput 884.521 events/s.
21-06-23 12:14:44 Processed 7000 events in 7.942972e+00 seconds, throughput 881.282 events/s.
21-06-23 12:14:54 Processed 7000 events in 7.938204e+00 seconds, throughput 881.812 events/s.
21-06-23 12:14:54 Number of streams 7 threads 7, average throughput 882.538333

21-06-23 12:14:54 Number of streams 8 threads 8 events 8000
21-06-23 12:15:05 Processed 8000 events in 9.040098e+00 seconds, throughput 884.946 events/s.
21-06-23 12:15:15 Processed 8000 events in 9.040252e+00 seconds, throughput 884.931 events/s.
21-06-23 12:15:26 Processed 8000 events in 9.045760e+00 seconds, throughput 884.392 events/s.
21-06-23 12:15:26 Number of streams 8 threads 8, average throughput 884.756333

21-06-23 12:15:26 Number of streams 9 threads 9 events 9000
21-06-23 12:15:38 Processed 9000 events in 1.017927e+01 seconds, throughput 884.15 events/s.
21-06-23 12:15:50 Processed 9000 events in 1.018134e+01 seconds, throughput 883.97 events/s.
21-06-23 12:16:02 Processed 9000 events in 1.019767e+01 seconds, throughput 882.555 events/s.
21-06-23 12:16:02 Number of streams 9 threads 9, average throughput 883.558333

21-06-23 12:16:02 Number of streams 10 threads 10 events 10000
21-06-23 12:16:15 Processed 10000 events in 1.133347e+01 seconds, throughput 882.343 events/s.
21-06-23 12:16:28 Processed 10000 events in 1.136503e+01 seconds, throughput 879.892 events/s.
21-06-23 12:16:41 Processed 10000 events in 1.137366e+01 seconds, throughput 879.224 events/s.
21-06-23 12:16:41 Number of streams 10 threads 10, average throughput 880.486333

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

Comparing the throughput of the cuda and updated cudadev applications, and enabling one additional module at a time, it looks like the slowdown comes from the CAHitNtupletCUDA:

image

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

@VinInn @felicepantaleo @makortel can you think of anything that may have slowed down the CAHitNtupletCUDA producer between CMSSW_11_2_0_pre8_Patatrack and CMSSW_12_0_0_pre3 ?

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

can you run nvprof (or equivalent nsight)?
besides the "esthetic" changes nothing should have been modified

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

any metric or settings in particular ?

here is the output of

ncu --print-summary per-kernel ./cuda --maxEvents 10

cuda.log

and here is the result of

ncu --print-summary per-kernel ./cudadev --maxEvents 10

cudadev.log

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

too much info to digest (at least in txt form)
maybe the much condensed
/opt/nvidia/nsight-systems/2020.3.4/bin/nsys profile --stats=true ./cuda
(equivalent but slower than good old nvprof)
is enough

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

ah, ok - it's also much faster to generate :-)

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

maybe some sort of numeric diff out of the logs
I managed to extract this

innocent@Vincenzos-MacBook-Pro-2 Downloads % grep -r gpu__time_duration.sum cudadev.log | awk '{print $6}' | sort -n | tail -n 10
206.848000
228.278400
235.241600
240.032000
250.297600
277.088000
288.256000
320.640000
350.822400
427.392000
innocent@Vincenzos-MacBook-Pro-2 Downloads % grep -r gpu__time_duration.sum cuda.log | awk '{print $6}' | sort -n | tail -n 10
193.888000
230.304000
235.145600
246.003200
264.905600
272.864000
281.760000
317.408000
351.430400

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

from the nsys logs, the main offender seems to be kernel_find_ntuplets:

  Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name
  -------  ---------------  ---------  ---------  -------  -------  ----------------------------------------------------------------------------------------------------
-    22.9      366,054,748      1,000  366,054.7  113,823  755,099  kernel_find_ntuplets(TrackingRecHit2DSOAView const*, GPUCACell*, unsigned int const*, cms::cuda::Si…
+    25.2      415,254,529      1,000  415,254.5  123,263  874,844  kernel_find_ntuplets(TrackingRecHit2DSOAView const*, GPUCACell*, unsigned int const*, cms::cuda::Si…

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

can standalone print the stat?

for producer in producers_by_type(process, 'CAHitNtupletCUDA'):
  producer.fillStatistics = True # not the default

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

yes, it just needs to be recompiled:

$ ./cuda
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 2.169266e+00 seconds, throughput 460.985 events/s.
||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
Counters Raw 1000 15097326 116597755 5336129 1008402 3655540 5432428 1705671 1140395 107068019 108531397
Counters Norm 1000 ||  15097.3|  116597.8|  5336.1|  3655.5|  1008.4|  5432.4|  1705.7|  1140.4|  0.918|  0.931||

vs

$ ./cudadev
Found 1 devices
Processing 1000 events, of which 1 concurrently, with 1 threads.
Processed 1000 events in 2.128844e+00 seconds, throughput 469.739 events/s.
||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
Counters Raw 1000 15097326 116597755 5336143 1008393 3655568 5432449 1705691 1140365 107068019 22838481
Counters Norm 1000 ||  15097.3|  116597.8|  5336.1|  3655.6|  1008.4|  5432.4|  1705.7|  1140.4|  0.918|  0.196||

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

identical (good)!
So some esthetic change was not so esthetic...

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

or

yes, it just needs to be recompiled:

cuda

info nEvents nHits nCells nTuples nFitTacks nGoodTracks nUsedHits nDupHits nKilledCells nEmptyCells nZeroTrackCells
raw 1000 15097326 116597755 5336129 1008402 3655540 5432428 1705671 1140395 107068019 108531397
norm 1000 15097.3 116597.8 5336.1 3655.5 1008.4 5432.4 1705.7 1140.4 0.918 0.931

cudadev

info nEvents nHits nCells nTuples nFitTacks nGoodTracks nUsedHits nDupHits nKilledCells nEmptyCells nZeroTrackCells
raw 1000 15097326 116597755 5336143 1008393 3655568 5432449 1705691 1140365 107068019 22838481
norm 1000 15097.3 116597.8 5336.1 3655.6 1008.4 5432.4 1705.7 1140.4 0.918 0.196

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 23, 2021

the only difference seems in nZeroTrackCells: 0.931 vs 0.196 (normalised)

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

The meaning changed...

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

one can compare the details in the ".log". Clearly "dev" is spending more cycles

@VinInn
Copy link
Contributor

VinInn commented Jun 23, 2021

we changed some int16 in int32 and maybe the size of some arrays. Plus made the hits dynamically allocated.
I think this is in my first PR after the merge (or we did it before the merge?)

public:
static constexpr uint32_t maxHits() { return gpuClustering::MaxNumClusters; }
using hindex_type = uint16_t; // if above is <=2^16
using hindex_type = uint32_t; // if above is <=2^32
Copy link
Contributor

@VinInn VinInn Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not esthetic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I change this I get a failed assert:

/data/user/fwyzard/pixeltrack-standalone/src/cudadev/CUDADataFormats/TrackingRecHit2DHeterogeneous.h:111:62: error: static assertion failed
   static_assert(sizeof(TrackingRecHit2DSOAView::hindex_type) == sizeof(float));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. the storing area changed as well. so not so trivial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe there is a not so hard way to revert the full PR....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not obvious, was done BEFORE the merge (or at least not by me in CMSSW master)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it was done in the Patatrack branch - before the merge in the upstream CMSSW and the following clean up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// types
using hindex_type = uint16_t; // FIXME from siPixelRecHitsHeterogeneousProduct
using tindex_type = uint16_t; // for tuples
using hindex_type = uint32_t; // FIXME from siPixelRecHitsHeterogeneousProduct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these changes and still in two places....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I change this I can run fine, but I still have a ~3.5% loss in performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is not the cause of the regression

thisCell.theUsed |= 1;
oc.theUsed |= 1;
thisCell.setUsedBit(1);
oc.setUsedBit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope these mods are just esthetic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume so, given that it's defined as

__device__ __forceinline__ void setUsedBit(uint16_t bit) { theUsed_ |= bit; }

//to implement
// endcap pixel

class TrackerTopology {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why TrackerTopology is needed? I did notice the added #include in SiPixelRawToClusterGPUKernel.cu but by quick look on the diff the reason wasn't evident (e.g. the full). If only a subset of this header is needed, could the copy be limited to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll check if it can be reduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the only part that uses the topology is

return (PixelSubdetector::PixelBarrel == ((rawId >> DetId::kSubdetOffset) & DetId::kSubdetMask));

so I can copy those three constants and avoid including the full topology.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PixelSubdetector::PixelBarrel is defined in PixelSubdetector.h (added in this PR), and the DetId::* in DetId.h (added in this PR), so maybe the TrackerTopology could be avoided completely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've update the PR dropping DataFormats/TrackerTopology.h and including the other headers directly.

@makortel
Copy link
Collaborator

Overall this PR looks ok for cudadev (after the TrackerTopology is resolved), but I'm curious about the motivation for the update? E.g. is there a desire to have a periodically updated copy in this standalone repo?

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 24, 2021

Hi Matti, I've prepared this update for two reasons:

  1. I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand Illegal memory access in the CUDA implementation with multiple concurrent processes #188
  2. Eric is working on a SoA model, and I think we can test it here before approaching CMSSW; having a more recent code base could make the port to CMSSW easier

I don't know if we want to do regular updates. If we do (on a regular basis or not) the next ones might be easier, now that all code is in CMSSW and we can check the upstream PRs.
In this update some things were complicated because the updates happened across forks.

@VinInn
Copy link
Contributor

VinInn commented Jun 24, 2021

If one dumps the ptr for the two versions of kernel_find_ntuplets we can try to spot if there is a major offender....

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 24, 2021

Took me a bit, but here they are:

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 24, 2021

1. I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand [Illegal memory access in the CUDA implementation with multiple concurrent processes #188](https://github.com/cms-patatrack/pixeltrack-standalone/issues/188)

... and I'm happy to report that the same issue is still there after the update (but somewhat harder to trigger) :-)

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 24, 2021

On the other hand, if I run multiple jobs concurrently, the cudadev implementation has the same throughput as the cuda implementation.

So maybe we don't need to worry too much about the single job slowdown.

@makortel
Copy link
Collaborator

  1. I want to check if the latest version of the code (after various people's clean up) is easier to debug to understand Illegal memory access in the CUDA implementation with multiple concurrent processes #188
  2. Eric is working on a SoA model, and I think we can test it here before approaching CMSSW; having a more recent code base could make the port to CMSSW easier

Thanks for the clarifications, makes sense. I have also been planning to use the cudadev as a prototyping ground for "CUDA framework" side (actually started long time ago and then got distracted), and that aim seems to be compatible with what you described above.

I don't know if we want to do regular updates. If we do (on a regular basis or not) the next ones might be easier, now that all code is in CMSSW and we can check the upstream PRs.

Ok. I suppose the current "update when there is demand" can be good-enough for now.

@makortel makortel added the cuda label Jun 24, 2021
fwyzard and others added 2 commits June 25, 2021 09:29
Co-authored-by: Angela Czirkos <[email protected]>
Co-authored-by: Eric Cano <[email protected]>
Co-authored-by: Felice Pantaleo <[email protected]>
Co-authored-by: Marco Musich <[email protected]>
Co-authored-by: Matti Kortelainen <[email protected]>
Co-authored-by: Tamas Vami <[email protected]>
Co-authored-by: Vincenzo Innocente <[email protected]>
@fwyzard fwyzard force-pushed the update_cudadev_to_CMSSW_12_0_0_pre3 branch from fb9033e to fe2bc62 Compare June 25, 2021 07:29
@makortel makortel merged commit 6dded93 into cms-patatrack:master Jun 28, 2021
@fwyzard fwyzard deleted the update_cudadev_to_CMSSW_12_0_0_pre3 branch June 28, 2021 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants