Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMS - 2017 MC data-curation scripts #243

Open
11 tasks done
katilp opened this issue Jul 18, 2024 · 0 comments
Open
11 tasks done

CMS - 2017 MC data-curation scripts #243

katilp opened this issue Jul 18, 2024 · 0 comments

Comments

@katilp
Copy link
Member

katilp commented Jul 18, 2024

Create a new directory cms-2016-simulated-datasets with

inputs

  • get the input files, Mini and Nano separately with

    dasgoclient -query="dataset=/*/RunIISummer20UL17*MiniAOD*v2-106X*/MINIAODSIM"  > inputs/CMS-2017-mc-mini-datasets.txt
    dasgoclient -query="dataset=/*/RunIISummer20UL17*NanoAOD*v9-106X*/NANOAODSIM"  > inputs/CMS-2017-mc-nano-datasets.txt
    
  • an empty placeholder file doi-sim.txt

code

  • copy all *.py files from the cms-2016-sim-test branch of cms-2016-sim-test/cms-2016-simulated-datasets/code
  • make threading configurable in all *.py scripts in which it is present
  • modify lhe_generators.py so that it is integrated with interface.py in a similar way as the other scripts
  • test the chain with a small number of input files (with option --ignore-eos-store) when needed
  • modify the scripts to take into account CMS - avoid single directories with too many entries #239: it applies to outputs/docs and lhe_generators directories, but confirm with Tibor or Pablo (no need to modify outputs/records as it will be aggregated with aggregate_dataset_records.py)
  • test timing of different steps (select datasets that have the LHE step, see mcm_store )
    • note for example that the provenance information is the same for Nano and Mini datasets, but dataset_records.py builds that same information two times, once for Nano and then for Mini
  • currently the scripts expect Mini to have a corresponding Nano. Take care of the cases where Mini do not have Nano. Probably best to build a Mini-Nano correspondence map at an early stage so that it does not need to be done everytime. You could store it in inputs (similar to recid_info.py).
    • mcm cache is done only for Nano, it could be done for those Mini that do not have Nano
    • this could be done once only at the very beginning of the script chain
  • The methodology (get_all_generator_text) is the time-consuming part in teh dataset record building and it is exactly the same for mini and nano. Check if the related nano dataset record has already been built take the full methodology from there.

README.md

  • Update README.md to describe the configurable threading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant