Skip to content

Conversation

@GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented Nov 3, 2025

Introduces a new kernel that merges built T5s on top of T5 and pT5 Track Candidates. Changes the LST TC data structure to accommodate longer tracks.

TC_avgOTlen_etacoarse

Zoom in:
TC_avgOTlen_etacoarsezoom

Histogram of Lengths:
dc2d8c22-cfa1-4899-804d-aa05e2d12ac1-1

Purity of extensions looks good.
Plots below are the same, just one has y-scale set to log. Only a small decrease in pMatched of the track candidates.

55966cf4-6af8-4227-b638-fbd6b4738bf4 2073daeb-0d57-4e05-926e-27ce4efefb72

@GNiendorf
Copy link
Member Author

Fixed an error where the additional hits weren't being considered in the plots, should see at least some differences now.

/run all

@GNiendorf
Copy link
Member Author

/run standalone

@SegmentLinking SegmentLinking deleted a comment from github-actions bot Nov 4, 2025
@slava77
Copy link

slava77 commented Nov 4, 2025

Histogram of track lengths with this current draft PR.

is this ttbar?
How does this look in log scale; I'm interested to see the odd nHits entries (or is this for 100% matched tracks?)

@GNiendorf
Copy link
Member Author

GNiendorf commented Nov 4, 2025

Histogram of track lengths with this current draft PR.

is this ttbar? How does this look in log scale; I'm interested to see the odd nHits entries (or is this for 100% matched tracks?)

Sorry, just bad plotting formatting if I'm understanding your confusion. There are only bins at x=0, 6, (pLSs and pT3s), 10 (pT5s, T5s), 12 (extended once), and 14 (extended twice). This doesn't factor in if these hits are real or fake, just raw length of the TC's.

@slava77
Copy link

slava77 commented Nov 4, 2025

There are only bins at x=0, 6, (pLSs and pT3s), 10 (pT5s, T5s), 12 (extended once), and 14 (extended twice).

is merging allowed on the same doublet module or in the same layer?
if not, I can understand why only even entries will be in place.
Otherwise there should be cases with 2MDs merged having one common hit

@slava77
Copy link

slava77 commented Nov 5, 2025

/run standalone

@slava77
Copy link

slava77 commented Nov 5, 2025

/run gpu-standalone

@GNiendorf
Copy link
Member Author

Speeding up this kernel has been difficult. Moving the code to the existing duplicate cleaning kernel did not give a lot of benefit. Trying to make the kernel less wasteful to speed it up instead.

@GNiendorf
Copy link
Member Author

/run gpu-standalone

2 similar comments
@GNiendorf
Copy link
Member Author

/run gpu-standalone

@GNiendorf
Copy link
Member Author

/run gpu-standalone

@GNiendorf
Copy link
Member Author

Timing is finally fixed. Code is janky though, and still only works in standalone.

@GNiendorf
Copy link
Member Author

/run gpu-cmssw

@github-actions
Copy link

There was a problem while building and running with CMSSW on GPU. The logs can be found here.

@GNiendorf
Copy link
Member Author

/run gpu-cmssw

@GNiendorf
Copy link
Member Author

/run gpu-cmssw

@GNiendorf
Copy link
Member Author

Plots look good from what I see.
Screenshot 2025-11-21 at 4 37 01 PM
Screenshot 2025-11-21 at 4 38 01 PM

@GNiendorf GNiendorf changed the title Work in Progress: LST Duplicate Merging LST T5-T5 Duplicate Merging Dec 2, 2025
@GNiendorf GNiendorf marked this pull request as ready for review December 2, 2025 15:34
@GNiendorf
Copy link
Member Author

Marking this PR as ready for review. Will push some final cleanup soon.

@github-actions
Copy link

The PR was built and ran successfully in standalone mode on GPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
[target branch]
   avg     35.9      0.4      0.5      0.5      0.9      0.3      0.6      0.6      0.4      1.2      0.0      41.4       5.1+/-  2.6      41.4   explicit[s=1]
   avg      1.4      0.6      0.6      0.8      1.2      0.3      0.9      1.0      0.5      1.6      0.0       8.9       7.2+/-  3.5       4.6   explicit[s=2]
   avg      2.4      1.0      1.1      1.3      1.7      0.4      1.5      1.9      0.9      2.6      0.1      15.0      12.1+/-  5.1       3.8   explicit[s=4]
   avg      1.7      1.6      1.6      2.0      2.3      0.6      2.2      2.8      1.2      3.2      0.2      19.5      17.2+/-  8.2       4.0   explicit[s=6]
   avg      2.2      2.1      1.7      3.2      3.4      0.7      3.4      3.2      2.4      4.7      0.1      27.2      24.2+/- 11.7       4.0   explicit[s=8]
[this PR]
   avg     36.8      0.4      0.4      0.5      0.9      0.3      0.6      0.7      0.4      1.5      0.0      42.5       5.4+/-  2.6      42.5   explicit[s=1]
   avg      0.9      0.6      0.6      0.8      1.1      0.3      0.9      1.0      0.5      2.0      0.0       8.7       7.4+/-  3.2       4.7   explicit[s=2]
   avg      2.5      0.9      1.0      1.3      1.7      0.4      1.4      1.7      1.0      3.2      0.1      15.2      12.3+/-  5.2       3.9   explicit[s=4]
   avg      3.7      1.6      1.7      2.3      2.4      0.6      2.2      2.8      1.4      5.1      0.1      23.8      19.6+/-  9.4       4.0   explicit[s=6]
   avg      4.5      1.9      2.1      3.3      3.1      0.7      3.0      3.2      1.5      5.0      0.4      28.5      23.4+/-  9.0       3.7   explicit[s=8]

@github-actions
Copy link

The PR was built and ran successfully with CMSSW on GPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

GNiendorf commented Dec 10, 2025

@slava77 Let me know if there are any remaining comments. Thanks for the comments so far, I think this PR is a good starting point for me to continue exploring the duplicate merging now (maybe start merging T3's, relax hit sharing requirements, look at adding hits on the same layer etc.).

Comment on lines 738 to 751
const auto& threadIdx = alpaka::getIdx<alpaka::Block, alpaka::Threads>(acc);
const auto& blockDim = alpaka::getWorkDiv<alpaka::Block, alpaka::Threads>(acc);

// Flatten 2D thread indices within the block (Y, X) into one index
const int threadIndexFlat = threadIdx[1u] * blockDim[2u] + threadIdx[2u];
const int blockDimFlat = blockDim[1u] * blockDim[2u];

// Scan over lower modules
for (int lowerModuleIndex = lowerModuleBegin + threadIndexFlat; lowerModuleIndex < lowerModuleEnd;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(following on an earlier comment)

Cleaned this code up a bit to be similar to what Yanxi does in the T5 build kernel, but I'm not sure if there's something more clean we can replace this with.

I'm not sure that analogy applies
wouldn't a simple
for (auto lowerModuleIndex : cms::alpakatools::uniform_elements(acc, lowerModuleEnd)) be enough ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code example works because we start at lowerModuleBegin rather than 0, but I think this gets at the question of why I am using Acc3D when I flatten two of those dimensions. It looks like Acc1D works just as well, I'll push that soon.

Copy link
Member Author

@GNiendorf GNiendorf Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any cms::alpakatools wrapper functions that would allow me to do the block-level work I use here even in 1D (each block handles its own TC with shared memory for that TC/block, unlike uniform_elements which I think distributes over multiple blocks?). Let me know if you know of one, otherwise I think Acc1D is as clean as this can go.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the block direction is aligned with the candidates, perhaps a 2D and use a uniform_elements_x ?
the range can always be over lowerModuleEnd-lowerModuleBegin with an addition.

Note that I expect that having lowerModuleBegin not just 0 the saving is relatively minor and would go away when we start looking for overlap hits on the same layer

@slava77
Copy link

slava77 commented Dec 10, 2025

The full set of validation and comparison plots can be found here.

image

this looks problematic.

  • 15-16 is expected, OK
  • 17-18 is probably a 7-layer case for eta between 1.6-1.8 where we can have 5 endcap + 2 barrel layers.
  • 24-26 look like a bug; my guess is there are some cases with initialization to 0 or perhaps some out-of bound reads (I started a CPU test to see if the issue persists there as well)

@slava77
Copy link

slava77 commented Dec 10, 2025

/run all

@github-actions
Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     30.6    379.7    274.2    117.5     52.4    701.5     11.8    123.2    133.3    190.5      1.5    2016.2    1284.1+/- 305.4     619.3   explicit[s=4] (target branch)
   avg     30.8    376.2    271.9    118.3     52.1    688.5     11.9    127.3    132.9    185.3      1.8    1996.8    1277.5+/- 304.3     618.7   explicit[s=4] (this PR)

@github-actions
Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

/run gpu-all

@github-actions
Copy link

The PR was built and ran successfully in standalone mode on GPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
[target branch]
   avg     33.9      0.4      0.4      0.5      0.9      0.3      0.6      0.6      0.4      1.2      0.0      39.2       5.0+/-  2.5      39.2   explicit[s=1]
   avg      1.1      0.5      0.6      0.7      1.1      0.3      0.9      0.9      0.5      1.6      0.0       8.2       6.8+/-  3.0       4.2   explicit[s=2]
   avg      1.9      0.8      0.9      1.2      1.6      0.4      1.4      1.4      0.6      2.4      0.0      12.8      10.5+/-  3.8       3.4   explicit[s=4]
   avg      2.8      1.2      1.3      1.8      2.3      0.6      1.9      1.9      1.0      3.3      0.0      18.2      14.8+/-  4.6       3.2   explicit[s=6]
   avg      3.6      1.7      1.9      2.4      2.9      0.8      2.7      2.8      1.2      4.2      0.0      24.1      19.7+/-  5.3       3.1   explicit[s=8]
[this PR]
   avg     34.6      0.4      0.4      0.5      0.9      0.3      0.6      0.7      0.4      1.4      0.0      40.1       5.3+/-  2.6      40.1   explicit[s=1]
   avg      1.1      0.5      0.5      0.7      1.1      0.3      0.9      0.9      0.5      1.9      0.0       8.5       7.0+/-  2.8       4.3   explicit[s=2]
   avg      1.9      0.8      1.0      1.2      1.6      0.4      1.3      1.4      0.6      2.9      0.0      13.1      10.8+/-  3.8       3.4   explicit[s=4]
   avg      2.7      1.3      1.4      1.8      2.2      0.6      1.9      2.0      0.9      3.8      0.0      18.6      15.4+/-  4.8       3.2   explicit[s=6]
   avg      3.7      1.7      1.9      2.6      2.9      0.7      2.5      2.5      1.3      4.8      0.0      24.7      20.3+/-  4.8       3.2   explicit[s=8]

@GNiendorf
Copy link
Member Author

GNiendorf commented Dec 11, 2025

  • 15-16 is expected, OK
  • 17-18 is probably a 7-layer case for eta between 1.6-1.8 where we can have 5 endcap + 2 barrel layers.
  • 24-26 look like a bug; my guess is there are some cases with initialization to 0 or perhaps some out-of bound reads (I started a CPU test to see if the issue persists there as well)

I've seen this bump in the CMSSW plots since the beginning of this PR for both CPU/GPU. I don't see the bump on the standalone plots, so I assumed it had something to do with like the final fit or similar that CMSSW does different from standalone.

Edit: I guess there are 11 OT layers so 22 hits + 3/4 pixel hits causing this? So maybe some bug where all hits get read as non-empty?

@github-actions
Copy link

The PR was built and ran successfully with CMSSW on GPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link

slava77 commented Dec 11, 2025

Edit: I guess there are 11 OT layers so 22 hits + 3/4 pixel hits causing this? So maybe some bug where all hits get read as non-empty?

that's my guess.

I don't see an explosion in fakes
image
So, this is relatively rare.
I don't see any change in the CPU variant.
it would help to find these tracks and inspect/visualize in some way: just counting the number of OT hits or layers in the LSTOutputConverter should be a good way to catch the candidate.

In the CMSSW setup we are supposedly using the LST candidates directly to next run a fit on them. So, there is no path to add more hits (there is a way to lose some due to fitting).

@GNiendorf GNiendorf force-pushed the t5_t5_merging branch 2 times, most recently from 02e949a to d55821d Compare December 11, 2025 15:58
@SegmentLinking SegmentLinking deleted a comment from github-actions bot Dec 11, 2025
@GNiendorf
Copy link
Member Author

/run gpu-CMSSW

@slava77
Copy link

slava77 commented Dec 11, 2025

@github-actions
Copy link

The PR was built and ran successfully with CMSSW on GPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

Screenshot 2025-12-11 at 12 24 45 PM

@slava77 Looks good now after adding resets to pLS function + some other cleanup.

@slava77
Copy link

slava77 commented Dec 11, 2025

Plots look good from what I see.
Screenshot 2025-11-21 at 4 37 01 PM

this kind of disproves my older "intuition" arguments (in a different context) that adding hits to a pT5 should not improve the momentum resolution.
I will still, perhaps more silently, think that it's still the case and maybe speculate that the improvements here are from adding a hit to a pT5 that skipped B1 or perhaps from adding another hit to a T5. Not sure if someone wants to explore which category gains more.

@github-actions github-actions bot merged commit 5e0394e into master Dec 16, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants