-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Track Embeddings for Improved Duplicate Removal in LST #48249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-48249/45056 |
|
A new Pull Request was created by @GNiendorf for master. It involves the following packages:
@cmsbuild, @jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
test parameters:
|
|
@cmsbuild please test |
|
+1 Size: This PR adds an extra 292KB to repository Comparison SummarySummary:
CUDA Comparison SummarySummary:
ROCM Comparison SummarySummary:
|
|
@GNiendorf noticed that we do not have comparisons for the GPU workflows (.704). @iarspider @smuzaffar |
|
please test @slava77 , thanks for pointing this out. This should be fixed now. For baseline relvals, we first run |
|
assign heterogeneous |
|
+1 Size: This PR adds an extra 292KB to repository Comparison SummarySummary:
CUDA Comparison SummarySummary:
ROCM Comparison SummarySummary:
|
| const float radius, | ||
| const float betaIn) { | ||
| const float betaIn, | ||
| float (&output)[dnn::t3dnn::kOutputFeatures]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't
| float (&output)[dnn::t3dnn::kOutputFeatures]) { | |
| float output[dnn::t3dnn::kOutputFeatures]) { |
equivalent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, but you do get a compile-time check from the reference form that the array passed to output is exactly kOutputFeatures elements.
| float& dBeta1, | ||
| float& dBeta2, | ||
| bool& tightCutFlag, | ||
| float (&t5Embed)[Params_T5::kEmbed], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| float (&t5Embed)[Params_T5::kEmbed], | |
| float t5Embed[Params_T5::kEmbed], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see.
OK then :-)
|
+heterogeneous
|
Yes, exactly. In the current delta-R based cleaning, two real tracks with small angular separation can be mistakenly treated as duplicates, causing one to be removed and lowering efficiency. If a fake track is near a real one and the fake has a higher score (the sum of a few chi-squared values), the real track can also be incorrectly removed. This can occur if the real track is displaced for example. This PR fixes both cases, which is also why the fake rate increases slightly. Fake-real pairs are always treated as non-duplicates during training, so some fake tracks that were previously being cleaned away by the simple delta-R flag are now no longer marked as duplicates.
The increase in fake rate is relatively small, and I think relying on duplicate cleaning to reduce the fake rate by cleaning away fakes close to other fakes or true tracks in the detector is not a great side effect to rely on. This behavior could probably be replicated in the embeddings by lowering the 75% threshold for real hits during training to something like 55%, so that a “fake” track with 60% matched hits to a sim track would be marked as a duplicate of a “real” track with more than 75% matched hits to the same sim track, but again this could lower efficiency if the fake track gets chosen over the real one. |
|
Mhm... did you also consider implementing a duplicate removal based on shared hits ? |
Yes many of the duplicate cleaning steps check for shared hits already. See below for example cmssw/RecoTracker/LSTCore/src/alpaka/Kernels.h Lines 175 to 177 in f79df57
|
|
+1 |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @mandrenguyen, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2) |
|
+1 |

This PR introduces a new method of duplicate removal using fully-connected neural networks to compute low-dimensional embeddings of tracks, creating a learned similarity measure for duplicate track rejection. Two small neural networks are trained to map pLS and T5 track features into a shared 6-dimensional embedding space using a contrastive loss function. Duplicate candidates are then identified by placing cuts on the Euclidean distance between tracks in the learned embedding space, replacing the current T5-T5, T5-pT5, and pLS-T5 delta-R based duplicate removal.
T5-T5 and pLS-T5 pairs with small angular separation (delta-R squared < 0.02) are used for DNN training. Cuts on the embedding distance introduced by this PR reduce the LST duplicate rate in the barrel by up to 50% and substantially increase displaced track efficiency. Timing differences from the additional embedding DNNs are negligible, in part because embeddings are computed per-track and this method only requires a simple pairwise Euclidean distance calculation between embedding vectors.
More details can be found here: Embed_T5_PLS.pdf
The DNN training notebook is also added to the standalone codebase in this PR, in-line with previous ML-related PR's for LST: #47618, #46857, and #47995.
PR validation:
This PR was tested on CPU and GPU in the standalone configuration and runs without issue.
@slava77