Skip to content

CMS - 2017 MC data-curation scripts #243

Open
@katilp

Description

@katilp

Create a new directory cms-2016-simulated-datasets with

inputs

  • get the input files, Mini and Nano separately with

    dasgoclient -query="dataset=/*/RunIISummer20UL17*MiniAOD*v2-106X*/MINIAODSIM"  > inputs/CMS-2017-mc-mini-datasets.txt
    dasgoclient -query="dataset=/*/RunIISummer20UL17*NanoAOD*v9-106X*/NANOAODSIM"  > inputs/CMS-2017-mc-nano-datasets.txt
    
  • an empty placeholder file doi-sim.txt

code

  • copy all *.py files from the cms-2016-sim-test branch of cms-2016-sim-test/cms-2016-simulated-datasets/code
  • make threading configurable in all *.py scripts in which it is present
  • modify lhe_generators.py so that it is integrated with interface.py in a similar way as the other scripts
  • test the chain with a small number of input files (with option --ignore-eos-store) when needed
  • modify the scripts to take into account CMS - avoid single directories with too many entries #239: it applies to outputs/docs and lhe_generators directories, but confirm with Tibor or Pablo (no need to modify outputs/records as it will be aggregated with aggregate_dataset_records.py)
  • test timing of different steps (select datasets that have the LHE step, see mcm_store )
    • note for example that the provenance information is the same for Nano and Mini datasets, but dataset_records.py builds that same information two times, once for Nano and then for Mini
  • currently the scripts expect Mini to have a corresponding Nano. Take care of the cases where Mini do not have Nano. Probably best to build a Mini-Nano correspondence map at an early stage so that it does not need to be done everytime. You could store it in inputs (similar to recid_info.py).
    • mcm cache is done only for Nano, it could be done for those Mini that do not have Nano
    • this could be done once only at the very beginning of the script chain
  • The methodology (get_all_generator_text) is the time-consuming part in teh dataset record building and it is exactly the same for mini and nano. Check if the related nano dataset record has already been built take the full methodology from there.

README.md

  • Update README.md to describe the configurable threading

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions