-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Track classifier using TensorFlow #31682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The code-checks are being triggered in jenkins. |
| <use name="FWCore/PluginManager"/> | ||
| <use name="PhysicsTools/TensorFlow"/> | ||
| <use name="FWCore/Utilities"/> | ||
| <use name="RecoTracker/Record"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @hajohajo - these dependencies on RecoTracker and PhysicsTools are not allowed in DataFormats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should move TrackTfGraphDefProducer, TfGraphDefWrapper to RecoTracker/FinalTrackSelectors.
|
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-31682/18820
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
I thought that we had a matrix workflow for this; apparently not.
these have to be addressed before we can run further tests by the bot |
|
once the dedicated workflow which enable the DNN approach is defined, a profile test should be performed |
|
@hajohajo thanks for this PR, a few suggestions to move forward with the review:
|
… RecoTracker/FinalTrackSelectors
|
The code-checks are being triggered in jenkins. |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-31682/18921
|
|
A new Pull Request was created by @hajohajo for master. It involves the following packages: RecoTracker/FinalTrackSelectors @perrotta, @jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
@JanFSchulte @hajohajo please do some more fine-grained checks with igprof if possible, as suggested by Slava: #31682 (comment). This and addressing the code review comments are currently the open items to address before moving forward. A look by the ML prod group would also be nice-to-have. |
| class TfGraphDefWrapper { | ||
| public: | ||
| TfGraphDefWrapper(tensorflow::GraphDef*); | ||
| tensorflow::GraphDef* GetGraphDef() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must return
| tensorflow::GraphDef* GetGraphDef() const; | |
| tensorflow::GraphDef const* getGraphDef() const; |
to ensure the graph is used in const-thread safe manner.
|
|
||
| // ------------ method called to produce the data ------------ | ||
| std::unique_ptr<TfGraphDefWrapper> TfGraphDefProducer::produce(const TfGraphRecord& iRecord) { | ||
| return std::unique_ptr<TfGraphDefWrapper>(&wrapper_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to double-delete (once by the destructor of TFGraphDefProducer, and once by the EventSetup system).
I would actually suggest
| return std::unique_ptr<TfGraphDefWrapper>(&wrapper_); | |
| return make_unique<TfGraphDefWrapper>(tensorflow::loadGraphDef(...)); |
This way the graph is loaded only if some other module consumes it instead of at the construction time.
| tensorflow::GraphDef* GetGraphDef() const; | ||
|
|
||
| private: | ||
| tensorflow::GraphDef* graphDef_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
| tensorflow::GraphDef* graphDef_; | |
| std::unique_ptr<tensorflow::GraphDef> graphDef_; |
?
| : tfDnnLabel_(cfg.getParameter<std::string>("tfDnnLabel")) | ||
|
|
||
| {} | ||
| TfDnnCache* cache_ = new TfDnnCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would
| TfDnnCache* cache_ = new TfDnnCache(); | |
| TfDnnCache cache_; |
work?
| @@ -0,0 +1,3 @@ | |||
| #include "PhysicsTools/TensorFlow/interface/TensorFlow.h" | |||
| #include "FWCore/Utilities/interface/typelookup.h" | |||
| TYPELOOKUP_DATA_REG(tensorflow::Session); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should indeed be unnecessary.
|
|
||
| // class declaration | ||
|
|
||
| class TfGraphDefProducer : public edm::ESProducer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that if this class was made an ESSource, an EmptyESSource for the TfGraphRecord would not be needed. (depends mostly if other ESProducers producing products to TfGraphRecord that would not consume the product of TfGraphDefProducer are foreseen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the ESProducer itself is generic (even if the Session is made part of the ESProduct) I can easily imagine other users in the future, in which case it would be better to stay using EmptyESSource and not make this module ESSource.
| void TfGraphDefProducer::fillDescriptions(edm::ConfigurationDescriptions& descriptions) { | ||
| edm::ParameterSetDescription desc; | ||
| desc.add<std::string>("ComponentName", "tfGraphDef"); | ||
| desc.add<edm::FileInPath>("FileName", edm::FileInPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving out default would make it clear already in the python that a necessary parameter has not been set.
|
thanks for this further check so, the timing seems under control (1 inference pee track), while the memory load is increased significantly which are the possible approaches ? thanks ! |
|
This is 0 PU. I can rerun with PU 50 to see if that makes a large difference. For the memory, while it increased significantly, I would like to understand how much of an issue that actually is. It comes in in 899th place, so I am not sure if this is a "not nice, we would like to reduce it" or a "I can't possibly be deployed in this state" kind of situation. |
please quantify this in MB.
the last two will scale with the number of threads. |
|
According to our discussion offline and a multi-threaded test I just finished, the situation is the following:
In conclusion, the total memory cost is ~13MiB per thread/stream. I found 104MiB for an the 8 thread test accordingly. As the memory cost for the BDT is too low to show up in the profile, which only shows the top 1000 entries, I don't know how large the increase actually is. But for the single-threaded case the overall memory use of the job increased by only 6 MiB from 2,833,769,503 to 2,839,613,441 (see https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/jschulte/PR-31682-wf10824.1/igreport_memlive_1stEvent_BDT compared to https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/jschulte/PR-31682-wf10824.1/igreport_memlive_1stEvent_BDT) @mtosi I checked PU 50 and found no dependence on PU of these numbers. |
|
Thank you for these comprehensive checks. To me, 13MiB/thread (0.45% total?) seems to be in the ballpark of what's to be expected from a DNN with O(100k) float32-s, quantizing float32->int8 could help here. |
|
One can always make it a OneProducer. My understanding is that it is not an issue of constant weights alone: it is the network itself that needs memory to compute: experts can confirm. |
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/jschulte/PR-31682-wf10824.1/igreport_memlive_1stEvent_BDT/1246 I'm still puzzled why the current case scales with nThreads, shouldn't there be just one session per job (covering all classifiers)? |
is there an igprof file for that? |
| edm::ESHandle<TfGraphDefWrapper> tfDnnHandle; | ||
| es.get<TfGraphRecord>().get(tfDnnLabel_, tfDnnHandle); | ||
| tensorflow::GraphDef* graphDef_ = tfDnnHandle.product()->GetGraphDef(); | ||
| cache_->session_ = tensorflow::createSession( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, this creates a new session per stream.
This explains the scaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I could have pointed you to that, sorry. I thought you were asking why it was implemented that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need cache_ to be a part of the global cache, so the base class TrackMVAClassifierBase : public edm::stream::EDProducer<> { needs some modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought Session is not thread safe, and therefore may not be shared across stream instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all other TF sessions that we have are in streams; although I'm still not sure if this was done deliberately due to the session being not MT safe.
@Dr15Jones @riga is a session not thread safe?
My understanding is that there is one Graph per job (in an ESProduct in this case), and then one Session per "inference caller". |
|
In this case the number of Session objects gets also multiplied by the number of iterations where the DNN is used. And by construction the DNN modules are sequential (except jetCore is concurrent to everything else). One option would be to use |
|
pinging @mtosi @JanFSchulte et al: please let us know your plans on how to take this forward. Code review items should be addressed (in a new PR?), while threading/memory efficiency can also be for a follow-up. |
|
@jpata We are in the process of preparing a new PR that will address the code comments. As this project is taken over by a new PhD student with limited experience it will take a little bit of time, but we plan to have this ready in 1-2 weeks so it will make it in time for pre9. |
|
Thanks for a quick reply! Once that PR exists, we will ask to close this one here. |
|
pinging :) |
|
just another ping @JanFSchulte et al, let us know if there's any issue or progress. |
|
Hi @jpata Sorry for the late reply. We are aiming for the new PR on Monday or Tuesday. |
|
-1 the continuation PR has been opened #32128, this can be closed by the author or the core team. |
|
please close |
This PR changes implements TensorFlow based track classifier in RecoTracker/FinalTrackSelectors. It also changes the default behaviour when using Configuration/ProcessModifiers/trackdnn_cff. Since none of the standard workflows (as far as I'm I aware of) use it, so there are no changes expected in the output. Main purpose of this PR is to get the implementation of the code into CMSSW to let other people continue developing and testing the TF based networks.
@mtosi @JanFSchulte
Description
The implementation so that the tensorflow::graphDef shared to multiple modules as ESProduct only one tensorflow::Session per module is initialized in the first call to produce() and then kept as a member variable of the edm::stream, based on the discussion at #28912
The preprocessing of inputs happens inside the neural network using custom layers to perform input scaling and transformations to avoid the need to adjust CMSSW code too often. Only when the input variables (or their order) is changed, should one need to adjust the code at RecoTracker/FinalTrackSelectors/plugins/TrackTfClassifier.cc
The selection thresholds for different track qualities was collected into RecoTracker/IterativeTracking/python/dnnQualityCuts.py so they can be adjusted from one place for all the iterations. The different iterations read their quality cut values for the DNN from the same dictionary stored there.
A set of network weights for running the TrackTfClassifier with can be found from:
cms-data/RecoTracker-FinalTrackSelectors#9
The 'frozen_graph.pb' file needs to be placed to the folder RecoTracker/FinalTrackSelectors/data in order to be found by the relevant code. The training was done using TF2 so a suitably novel CMSSW version is required to run it (11_X and onwards).
The classifier can be turned on by importing the trackdnn modifier in the reconstruction step:
from Configuration.ProcessModifiers.trackdnn_cff import trackdnnand adding the IOV for the TfGraphRecord in the process:
process.tf_dummy_source = cms.ESSource("EmptyESSource", recordName = cms.string("TfGraphRecord"), firstValid = cms.vuint32(1), iovIsRunNotTime = cms.bool(True) )PR validation:
The basic tests run with
scram b runtestsand
runTheMatrix.py -l limited -i all --ibeoswithout issues. The standard workflow 10824.1 was tested with the above changes made to "step3" performing reconstruction and expected changes were confirmed in the produced tracking validation plots.