Macro to merge simulation files #95

lobis · 2023-03-07T14:08:44Z

This is a simple macro to merge simulation files, which is convinient when working with the output of a batch system. When most simulations have between 0-5 events it is really costly to open each one in order to perform analysis, its the bottleneck.

I implemented some required methods such as the equal operator or the copy constructor. I think it would be a good idea to implement them for all metadata classes as a whole. I think it could be done automatically if it's implemented in the base class, but I am not sure.

Validation in rest-for-physics/restG4#101

for more information, see https://pre-commit.ci

…nt4lib into lobis-merge * 'lobis-merge' of https://github.com/rest-for-physics/geant4lib: [pre-commit.ci] auto fixes from pre-commit.com hooks

juanangp · 2023-03-07T21:01:14Z

I have implemented a method to merge TRestRuns, please check rest-for-physics/framework#382

Perhaps you can use it out of the box...

lobis · 2023-03-08T08:20:57Z

I have implemented a method to merge TRestRuns, please check rest-for-physics/framework#382

Perhaps you can use it out of the box...

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

juanangp · 2023-03-08T08:29:16Z

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

In principle the TRestRunMerger should keep the metadata structure of the last file while some variables such as the start of the run are updated. So, I guess that the geometry should still be there.

lobis · 2023-03-08T08:34:38Z

Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc.

In principle the TRestRunMerger should keep the metadata structure of the last file while some variables such as the start of the run are updated. So, I guess that the geometry should still be there.

But in this case we require some processing of the metadata structure, for example we want to add the number of primary events between files, and also check (not done yet) if the user settings such as generators etc. match between files.

For metadata such as the physics lists the TRestRunMerger should work fine though.

So in my opinion this PR could be merged and be later updated to use TRestRunManager to merge all other metadata structures except TRestGeant4Metadata, which needs specific logic. Not sure how easy this will be though as this PR took me quite a while to get right.

jgalan · 2023-03-08T09:08:23Z

I think you should make simulations that produce at least 10 events, so that with 100 files you get 1000 events, that it is a 3% error.

I don't like the idea of merging files, we should have tools that join data on the fly and produce datasets https://sultan.unizar.es/rest/classTRestDataSet.html.

Even if you have to join 1000 files, the 1000 files keep the original production tracking (few events in each file) while now you get a TRestDataSet that allows you for quick access with all the data. You have to do this once, it is like merging files but in a clean way. You do not really merge, you compile data and you know how this data compilation was generated (TRestDataSet metadata), leading to full traceability with a persistent files database, not with merged files that contain replicated metadata information being spread everywhere

The dataset keeps metadata information on how the dataset was generated. We do not need to spread/replicate metadata information everywhere. We just have a database with data+metadata, and datasets with data different data compilations.

This will connect with https://github.com/rest-for-physics/framework/pull/355/files where we have a class TRestExperiment that it is initialised using a TRestDataSet.

lobis · 2023-03-08T09:15:44Z

I think you should make simulations that produce at least 10 events, so that with 100 files you get 1000 events, that it is a 3% error.

I don't like the idea of merging files, we should have tools that join data on the fly and produce datasets https://sultan.unizar.es/rest/classTRestDataSet.html.

Even if you have to join 1000 files, the 1000 files keep the original production tracking (few events in each file) while now you get a TRestDataSet that allows you for quick access with all the data. You have to do this once, it is like merging files but in a clean way. You do not really merge, you compile data and you know how this data compilation was generated (TRestDataSet metadata), leading to full traceability with a persistent files database, not with merged files that contain replicated metadata information being spread everywhere

The dataset keeps metadata information on how the dataset was generated. We do not need to spread/replicate metadata information everywhere. We just have a database with data+metadata, and datasets with data different data compilations.

This will connect with https://github.com/rest-for-physics/framework/pull/355/files where we have a class TRestExperiment that it is initialised using a TRestDataSet.

The problem with making simulations with 10 events, is that it takes too many hours, this makes scheduling in the cluster a problem.

For example if I make simulations of 8h duration I can run maybe 50 simulations concurrently. If I make 1h duration, I can run 500 concurrently, a 10x in simulation speed effectively, this is why I choose to have many smaller files.

The problem I see with the data set (correct me if I am wrong) is that it still needs to open all files. For example if you have 10000 files (which may be the case for shielding contamination for example), this can take hours. Perhaps this is acceptable when the final analysis is done, but in order to try different analysis the merged files are much easier to work with.

In principle we could merge restG4 simulation files without loss of information, the only problem is that the seed will now be meaningless (unless we store the seed information for all contributing files), but other than that the full information is present.

juanangp · 2023-03-08T09:22:46Z

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

On the other hand, TRestDataSet is just merging the analysis tree of different files, so I still don't see a clear advantage wrt merging TRestRuns.

jgalan · 2023-03-10T11:04:18Z

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

On the other hand, TRestDataSet is just merging the analysis tree of different files, so I still don't see a clear advantage wrt merging TRestRuns.

In a TRestDataSet you detach many things. The dataset cannot be processed anymore, and it can be considered a final product of the data processing chain.

jgalan · 2023-03-10T11:06:27Z

I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a TRestG4Event tree but the analysis tree is empty, so we cannot use a TRestDataSet.

I don't understand, a REST processed file that contains a TRestRun entry will always have a TRestAnalysisTree.

juanangp · 2023-03-10T12:23:47Z

In a TRestDataSet you detach many things. The dataset cannot be processed anymore, and it can be considered a final product of the data processing chain.

If a TRestDataSet cannot be processed further it will decrease the performance of the dataset since you should re-generate a TRestDataSet every time for data analysis. For instance:

Calibrated dataset adding energy observable (e.g. from ADC to keV).
Observable selection with energy cuts.
Event selection based on observable cuts.

I think we can perform the analysis of the data just using the datasets, otherwise we should re-process all the TRestRuns using the output of the dataset and adding the corresponding observables to them using TRestEventProcess, which is much less efficient. Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this. Otherwise, I don't see the advantage of using dataset because we will spend most of the computation time generating the datasets.

I think we should decide first the analysis workflow before adding any constraint.

jgalan · 2023-03-10T18:26:55Z

If a TRestDataSet cannot be processed further it will decrease the performance of the dataset since you should re-generate a TRestDataSet every time for data analysis. For instance:

Calibrated dataset adding energy observable (e.g. from ADC to keV).

You will add an ADC to keV conversion through a dedicated TRestEventProcess (e.g. TRestCalibrationProcess or TRestGainMapCalibrationProcess). Then, from the files processed you will generate a TRestDataSet that will contain the observable produced by TRestXYZCalibrationProcess.

Observable selection with energy cuts.

Event selection based on observable cuts.

I think we can perform the analysis of the data just using the datasets, otherwise we should re-process all the TRestRuns using the output of the dataset and adding the corresponding observables to them using TRestEventProcess, which is much less efficient. Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this. Otherwise, I don't see the advantage of using dataset because we will spend most of the computation time generating the datasets.

I think we should decide first the analysis workflow before adding any constraint.

We should discuss this carefully, we must distinguish between an event data processing chain and an analysis. You generate datasets when you need to generate a data release that can be analysed. This data release has already gone through all the data processing stages.

jgalan · 2023-03-10T18:28:43Z

Since I cannot see any stopper to perform the analysis over the dataset, I am wondering why you don't want to do this.

There are some sources of misunderstanding here (I don't understand what I don't want to do), yes, a dataset is for doing the final analysis! That's why it gives access to a combined TTree or RDataFrame, you can use that analysis data as you want, filter data, combine columns, etc.

However, the correction of ADC to keV should be done in the event data processing chain, because you might need access to a gain correction map for example and a readout, thus you may need to access event information.

Such level of information, as readout, gain maps, calibration curves will not be available to the final end user of a dataset. We are separating here between raw data and high level data for final analysis.

jgalan · 2023-05-05T13:20:09Z

The issue with files with a single event should have been solved here. rest-for-physics/framework#393

I think using TRestDataSet to merge final results would be the best policy.

jgalan · 2023-05-05T13:22:20Z

On the other hand, pipelines should succeed once we merge this rest-for-physics/framework#409

lobis · 2023-08-20T13:13:41Z

I will merge this PR.

I understand this is a bit of a controverted topic but I believe allowing the possibility of merging simulation files is really useful and the same effects cannot be achieved with other tools such as TRestDataSet. This is especially true when running jobs on a cluster as you can automatically configure the compute to produce a single root file instead of hundreds of thousand of smaller files. Just the overhead of opening / closing so many files is too much in my experience and there is no real benefit.

At the end of the day this is just an optional feature.

lobis and others added 9 commits March 7, 2023 14:45

remove muted (as a member) for TRestGeant4PhysicsInfo

2ed32c9

add muted to src

507e7db

add merge to TRestGeant4Metadata. Implement required operators

ff9c85c

add merge to TRestGeant4Metadata. Implement required operators

7eefd75

improve operators

a110a49

add macro to merge files

5c70e5d

fix destructor

7583696

add macro

854510b

[pre-commit.ci] auto fixes from pre-commit.com hooks

533d64e

for more information, see https://pre-commit.ci

lobis mentioned this pull request Mar 7, 2023

Add validation for merge files rest-for-physics/restG4#101

Open

lobis requested a review from a team March 7, 2023 16:57

lobis added the enhancement New feature or request label Mar 7, 2023

lobis marked this pull request as ready for review March 7, 2023 17:25

lobis added 2 commits March 7, 2023 21:13

fix segfault

a481206

Merge branch 'lobis-merge' of https://github.com/rest-for-physics/gea…

b49f328

…nt4lib into lobis-merge * 'lobis-merge' of https://github.com/rest-for-physics/geant4lib: [pre-commit.ci] auto fixes from pre-commit.com hooks

lobis requested review from jgalan and juanangp March 8, 2023 08:38

Merge branch 'master' into lobis-merge

713871f

lobis and others added 2 commits May 5, 2023 11:30

Merge branch 'master' into lobis-merge

65d8d4b

remove unused member from constructor

a64fc78

Merge branch 'master' into lobis-merge

af1421a

lobis requested a review from DavidDiezIb August 19, 2023 18:03

lobis added 7 commits August 19, 2023 20:12

do not use repeated event ids

e8789a8

update merge logic

abe2372

fix merge macro

3be6520

update docs

602003a

handle sub-events

3ed2a80

fix bug

8cac93f

update merge

014e28e

lobis merged commit c79d9ed into master Aug 20, 2023

lobis deleted the lobis-merge branch August 20, 2023 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Macro to merge simulation files #95

Macro to merge simulation files #95

lobis commented Mar 7, 2023 •

edited by pull-request-badge bot

Loading

juanangp commented Mar 7, 2023

lobis commented Mar 8, 2023

juanangp commented Mar 8, 2023

lobis commented Mar 8, 2023 •

edited

Loading

jgalan commented Mar 8, 2023 •

edited

Loading

lobis commented Mar 8, 2023 •

edited

Loading

juanangp commented Mar 8, 2023

jgalan commented Mar 10, 2023

jgalan commented Mar 10, 2023

juanangp commented Mar 10, 2023

jgalan commented Mar 10, 2023

jgalan commented Mar 10, 2023 •

edited

Loading

jgalan commented May 5, 2023

jgalan commented May 5, 2023

lobis commented Aug 20, 2023

Macro to merge simulation files #95

Macro to merge simulation files #95

Conversation

lobis commented Mar 7, 2023 • edited by pull-request-badge bot Loading

juanangp commented Mar 7, 2023

lobis commented Mar 8, 2023

juanangp commented Mar 8, 2023

lobis commented Mar 8, 2023 • edited Loading

jgalan commented Mar 8, 2023 • edited Loading

lobis commented Mar 8, 2023 • edited Loading

juanangp commented Mar 8, 2023

jgalan commented Mar 10, 2023

jgalan commented Mar 10, 2023

juanangp commented Mar 10, 2023

jgalan commented Mar 10, 2023

jgalan commented Mar 10, 2023 • edited Loading

jgalan commented May 5, 2023

jgalan commented May 5, 2023

lobis commented Aug 20, 2023

lobis commented Mar 7, 2023 •

edited by pull-request-badge bot

Loading

lobis commented Mar 8, 2023 •

edited

Loading

jgalan commented Mar 8, 2023 •

edited

Loading

lobis commented Mar 8, 2023 •

edited

Loading

jgalan commented Mar 10, 2023 •

edited

Loading