-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Macro to merge simulation files #95
Conversation
for more information, see https://pre-commit.ci
…nt4lib into lobis-merge * 'lobis-merge' of https://github.com/rest-for-physics/geant4lib: [pre-commit.ci] auto fixes from pre-commit.com hooks
I have implemented a method to merge TRestRuns, please check rest-for-physics/framework#382 Perhaps you can use it out of the box... |
Yes I saw it, but the bulk of this change is the merging of the metadata structure. Currently I am doing a very poor merge of the TRestRun because I plan to include your changes for the merge of TRestRun. Also for simulations there are special cases such as the geometry that can only exist once in the root file etc. |
In principle the |
But in this case we require some processing of the metadata structure, for example we want to add the number of primary events between files, and also check (not done yet) if the user settings such as generators etc. match between files. For metadata such as the physics lists the So in my opinion this PR could be merged and be later updated to use |
I think you should make simulations that produce at least 10 events, so that with 100 files you get 1000 events, that it is a 3% error. I don't like the idea of merging files, we should have tools that join data on the fly and produce datasets https://sultan.unizar.es/rest/classTRestDataSet.html. Even if you have to join 1000 files, the 1000 files keep the original production tracking (few events in each file) while now you get a The dataset keeps metadata information on how the dataset was generated. We do not need to spread/replicate metadata information everywhere. We just have a database with data+metadata, and datasets with data different data compilations. This will connect with https://github.com/rest-for-physics/framework/pull/355/files where we have a class |
The problem with making simulations with 10 events, is that it takes too many hours, this makes scheduling in the cluster a problem. For example if I make simulations of 8h duration I can run maybe 50 simulations concurrently. If I make 1h duration, I can run 500 concurrently, a 10x in simulation speed effectively, this is why I choose to have many smaller files. The problem I see with the data set (correct me if I am wrong) is that it still needs to open all files. For example if you have 10000 files (which may be the case for shielding contamination for example), this can take hours. Perhaps this is acceptable when the final analysis is done, but in order to try different analysis the merged files are much easier to work with. In principle we could merge restG4 simulation files without loss of information, the only problem is that the seed will now be meaningless (unless we store the seed information for all contributing files), but other than that the full information is present. |
I personally think that to be able to merge TRestRuns at some point would be very useful. For instance in this case we just have a On the other hand, |
In a |
I don't understand, a REST processed file that contains a |
If a
I think we can perform the analysis of the data just using the datasets, otherwise we should re-process all the I think we should decide first the analysis workflow before adding any constraint. |
You will add an ADC to keV conversion through a dedicated
We should discuss this carefully, we must distinguish between an event data processing chain and an analysis. You generate datasets when you need to generate a data release that can be analysed. This data release has already gone through all the data processing stages. |
There are some sources of misunderstanding here (I don't understand what I don't want to do), yes, a dataset is for doing the final analysis! That's why it gives access to a combined TTree or RDataFrame, you can use that analysis data as you want, filter data, combine columns, etc. However, the correction of ADC to keV should be done in the event data processing chain, because you might need access to a gain correction map for example and a readout, thus you may need to access event information. Such level of information, as readout, gain maps, calibration curves will not be available to the final end user of a dataset. We are separating here between raw data and high level data for final analysis. |
The issue with files with a single event should have been solved here. rest-for-physics/framework#393 I think using |
On the other hand, pipelines should succeed once we merge this rest-for-physics/framework#409 |
I will merge this PR. I understand this is a bit of a controverted topic but I believe allowing the possibility of merging simulation files is really useful and the same effects cannot be achieved with other tools such as At the end of the day this is just an optional feature. |
This is a simple macro to merge simulation files, which is convinient when working with the output of a batch system. When most simulations have between 0-5 events it is really costly to open each one in order to perform analysis, its the bottleneck.
I implemented some required methods such as the equal operator or the copy constructor. I think it would be a good idea to implement them for all metadata classes as a whole. I think it could be done automatically if it's implemented in the base class, but I am not sure.
Validation in rest-for-physics/restG4#101