Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Macro to merge simulation files #95

Merged
merged 22 commits into from
Aug 20, 2023
Merged

Macro to merge simulation files #95

merged 22 commits into from
Aug 20, 2023

Conversation

lobis
Copy link
Member

@lobis lobis commented Mar 7, 2023

lobis Medium: 183

This is a simple macro to merge simulation files, which is convinient when working with the output of a batch system. When most simulations have between 0-5 events it is really costly to open each one in order to perform analysis, its the bottleneck.

I implemented some required methods such as the equal operator or the copy constructor. I think it would be a good idea to implement them for all metadata classes as a whole. I think it could be done automatically if it's implemented in the base class, but I am not sure.

Validation in rest-for-physics/restG4#101

@lobis lobis requested a review from a team March 7, 2023 16:57
@lobis lobis added the enhancement New feature or request label Mar 7, 2023
@lobis lobis marked this pull request as ready for review March 7, 2023 17:25
lobis added 2 commits March 7, 2023 21:13
@juanangp
Copy link
Member

juanangp commented Mar 7, 2023

I have implemented a method to merge TRestRuns, please check rest-for-physics/framework#382

Perhaps you can use it out of the box...

@lobis
Copy link
Member Author

lobis commented Mar 8, 2023

I have implemented a method to merge TRestRuns, please check rest-for-physics/framework#382

Perhaps you can use it out of the box...

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

@juanangp
Copy link
Member

juanangp commented Mar 8, 2023

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

In principle the TRestRunMerger should keep the metadata structure of the last file while some variables such as the start of the run are updated. So, I guess that the geometry should still be there.

@lobis
Copy link
Member Author

lobis commented Mar 8, 2023

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

In principle the TRestRunMerger should keep the metadata structure of the last file while some variables such as the start of the run are updated. So, I guess that the geometry should still be there.

But in this case we require some processing of the metadata structure, for example we want to add the number of primary events between files, and also check (not done yet) if the user settings such as generators etc. match between files.

For metadata such as the physics lists the TRestRunMerger should work fine though.

So in my opinion this PR could be merged and be later updated to use TRestRunManager to merge all other metadata structures except TRestGeant4Metadata, which needs specific logic. Not sure how easy this will be though as this PR took me quite a while to get right.

@lobis lobis requested review from jgalan and juanangp March 8, 2023 08:38
@jgalan
Copy link
Member

jgalan commented Mar 8, 2023

I think you should make simulations that produce at least 10 events, so that with 100 files you get 1000 events, that it is a 3% error.

I don't like the idea of merging files, we should have tools that join data on the fly and produce datasets https://sultan.unizar.es/rest/classTRestDataSet.html.

Even if you have to join 1000 files, the 1000 files keep the original production tracking (few events in each file) while now you get a TRestDataSet that allows you for quick access with all the data. You have to do this once, it is like merging files but in a clean way. You do not really merge, you compile data and you know how this data compilation was generated (TRestDataSet metadata), leading to full traceability with a persistent files database, not with merged files that contain replicated metadata information being spread everywhere

The dataset keeps metadata information on how the dataset was generated. We do not need to spread/replicate metadata information everywhere. We just have a database with data+metadata, and datasets with data different data compilations.

This will connect with https://github.com/rest-for-physics/framework/pull/355/files where we have a class TRestExperiment that it is initialised using a TRestDataSet.

@lobis
Copy link
Member Author

lobis commented Mar 8, 2023

I think you should make simulations that produce at least 10 events, so that with 100 files you get 1000 events, that it is a 3% error.

I don't like the idea of merging files, we should have tools that join data on the fly and produce datasets https://sultan.unizar.es/rest/classTRestDataSet.html.

Even if you have to join 1000 files, the 1000 files keep the original production tracking (few events in each file) while now you get a TRestDataSet that allows you for quick access with all the data. You have to do this once, it is like merging files but in a clean way. You do not really merge, you compile data and you know how this data compilation was generated (TRestDataSet metadata), leading to full traceability with a persistent files database, not with merged files that contain replicated metadata information being spread everywhere

The dataset keeps metadata information on how the dataset was generated. We do not need to spread/replicate metadata information everywhere. We just have a database with data+metadata, and datasets with data different data compilations.

This will connect with https://github.com/rest-for-physics/framework/pull/355/files where we have a class TRestExperiment that it is initialised using a TRestDataSet.

The problem with making simulations with 10 events, is that it takes too many hours, this makes scheduling in the cluster a problem.

For example if I make simulations of 8h duration I can run maybe 50 simulations concurrently. If I make 1h duration, I can run 500 concurrently, a 10x in simulation speed effectively, this is why I choose to have many smaller files.

The problem I see with the data set (correct me if I am wrong) is that it still needs to open all files. For example if you have 10000 files (which may be the case for shielding contamination for example), this can take hours. Perhaps this is acceptable when the final analysis is done, but in order to try different analysis the merged files are much easier to work with.

In principle we could merge restG4 simulation files without loss of information, the only problem is that the seed will now be meaningless (unless we store the seed information for all contributing files), but other than that the full information is present.

@juanangp
Copy link
Member

juanangp commented Mar 8, 2023

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

On the other hand, TRestDataSet is just merging the analysis tree of different files, so I still don't see a clear advantage wrt merging TRestRuns.

@jgalan
Copy link
Member

jgalan commented Mar 10, 2023

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

On the other hand, TRestDataSet is just merging the analysis tree of different files, so I still don't see a clear advantage wrt merging TRestRuns.

In a TRestDataSet you detach many things. The dataset cannot be processed anymore, and it can be considered a final product of the data processing chain.

@jgalan
Copy link
Member

jgalan commented Mar 10, 2023

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

I don't understand, a REST processed file that contains a TRestRun entry will always have a TRestAnalysisTree.

@juanangp
Copy link
Member

In a TRestDataSet you detach many things. The dataset cannot be processed anymore, and it can be considered a final product of the data processing chain.

If a TRestDataSet cannot be processed further it will decrease the performance of the dataset since you should re-generate a TRestDataSet every time for data analysis. For instance:

  • Calibrated dataset adding energy observable (e.g. from ADC to keV).
  • Observable selection with energy cuts.
  • Event selection based on observable cuts.

I think we can perform the analysis of the data just using the datasets, otherwise we should re-process all the TRestRuns using the output of the dataset and adding the corresponding observables to them using TRestEventProcess, which is much less efficient. Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this. Otherwise, I don't see the advantage of using dataset because we will spend most of the computation time generating the datasets.

I think we should decide first the analysis workflow before adding any constraint.

@jgalan
Copy link
Member

jgalan commented Mar 10, 2023

If a TRestDataSet cannot be processed further it will decrease the performance of the dataset since you should re-generate a TRestDataSet every time for data analysis. For instance:

  • Calibrated dataset adding energy observable (e.g. from ADC to keV).

You will add an ADC to keV conversion through a dedicated TRestEventProcess (e.g. TRestCalibrationProcess or TRestGainMapCalibrationProcess). Then, from the files processed you will generate a TRestDataSet that will contain the observable produced by TRestXYZCalibrationProcess.

  • Observable selection with energy cuts.
  • Event selection based on observable cuts.

I think we can perform the analysis of the data just using the datasets, otherwise we should re-process all the TRestRuns using the output of the dataset and adding the corresponding observables to them using TRestEventProcess, which is much less efficient. Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this. Otherwise, I don't see the advantage of using dataset because we will spend most of the computation time generating the datasets.

I think we should decide first the analysis workflow before adding any constraint.

We should discuss this carefully, we must distinguish between an event data processing chain and an analysis. You generate datasets when you need to generate a data release that can be analysed. This data release has already gone through all the data processing stages.

@jgalan
Copy link
Member

jgalan commented Mar 10, 2023

Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this.

There are some sources of misunderstanding here (I don't understand what I don't want to do), yes, a dataset is for doing the final analysis! That's why it gives access to a combined TTree or RDataFrame, you can use that analysis data as you want, filter data, combine columns, etc.

However, the correction of ADC to keV should be done in the event data processing chain, because you might need access to a gain correction map for example and a readout, thus you may need to access event information.

Such level of information, as readout, gain maps, calibration curves will not be available to the final end user of a dataset. We are separating here between raw data and high level data for final analysis.

@jgalan
Copy link
Member

jgalan commented May 5, 2023

The issue with files with a single event should have been solved here. rest-for-physics/framework#393

I think using TRestDataSet to merge final results would be the best policy.

@jgalan
Copy link
Member

jgalan commented May 5, 2023

On the other hand, pipelines should succeed once we merge this rest-for-physics/framework#409

@lobis lobis requested a review from DavidDiezIb August 19, 2023 18:03
@lobis
Copy link
Member Author

lobis commented Aug 20, 2023

I will merge this PR.

I understand this is a bit of a controverted topic but I believe allowing the possibility of merging simulation files is really useful and the same effects cannot be achieved with other tools such as TRestDataSet. This is especially true when running jobs on a cluster as you can automatically configure the compute to produce a single root file instead of hundreds of thousand of smaller files. Just the overhead of opening / closing so many files is too much in my experience and there is no real benefit.

At the end of the day this is just an optional feature.

@lobis lobis merged commit c79d9ed into master Aug 20, 2023
@lobis lobis deleted the lobis-merge branch August 20, 2023 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants