CMS - 2017 MC data-curation scripts

Create a new directory `cms-2016-simulated-datasets` with

### inputs

- [x] get the input files, Mini and Nano separately with

      
      dasgoclient -query="dataset=/*/RunIISummer20UL17*MiniAOD*v2-106X*/MINIAODSIM"  > inputs/CMS-2017-mc-mini-datasets.txt
      dasgoclient -query="dataset=/*/RunIISummer20UL17*NanoAOD*v9-106X*/NANOAODSIM"  > inputs/CMS-2017-mc-nano-datasets.txt
      

- [x] an empty placeholder file doi-sim.txt

### code

- [x] copy all *.py files from the cms-2016-sim-test branch of [cms-2016-sim-test/cms-2016-simulated-datasets/code](https://github.com/cernopendata/data-curation/tree/cms-2016-sim-test/cms-2016-simulated-datasets/code)
- [x] make threading configurable in all *.py scripts in which it is present
- [x] modify `lhe_generators.py` so that it is integrated with  `interface.py` in a similar way as the other scripts
- [x] test the chain with a small number of input files (with option `--ignore-eos-store`) when needed
- [x] modify the scripts to take into account #239: it applies to outputs/docs and lhe_generators directories, but confirm with Tibor or Pablo (no need to modify outputs/records as it will be aggregated with [aggregate_dataset_records.py](https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/aggregate_dataset_records.py))
    - see https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L317-L318
-  [x] test timing of different steps (select datasets that have the LHE step, see mcm_store )
   -  note for example that the provenance information is the same for Nano and Mini datasets, but [dataset_records.py](https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py) builds that same information two times, once for Nano and then for Mini
- [x] currently the scripts expect Mini to have a corresponding Nano. Take care of the cases where Mini do not have Nano. Probably best to build a Mini-Nano correspondence map at an early stage so that it does not need to be done everytime. You could store it in `inputs` (similar to  `recid_info.py`). 
   - mcm cache is done only for Nano, it could be done for those Mini that do not have Nano
   - [this](https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L299) could be done once only at the very beginning of the script chain
- [x] The methodology (`get_all_generator_text`)  is the time-consuming part in teh dataset record building and it is exactly the same for mini and nano. Check if the related nano dataset record has already been built take the full methodology from there.
  

### README.md

-  [x] Update README.md to describe the configurable threading



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CMS - 2017 MC data-curation scripts #243

inputs

code

README.md

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CMS - 2017 MC data-curation scripts #243

Description

inputs

code

README.md

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions