cms-2016-simulated-datasets: updates done #207

katilp · 2023-10-20T14:54:47Z

Addresses #182

Adds code for all steps.
The logic has been changed to find the provenance through the production chain.

Input files are for testing only.

Tested on 3 datasets only, for them, it works fine: gives the full provenance, LHE included.
Ready for the final updates in #182 (comment)

cms-2016-simulated-datasets/code/conffiles_records.py

cms-2016-simulated-datasets/inputs/recid_info.py

cms-2016-simulated-datasets/code/mcm_store.py

tiborsimko · 2023-12-01T13:04:56Z

cms-2016-simulated-datasets/code/das_json_store.py

+    cmd = 'dasgoclient -query "'
+    if query != "dataset":
+        cmd += query + ' '
+    cmd += 'dataset=' + dataset + '" -json'


BTW since we are using Python 3, expressions like this can be made more readable by using f-strings:

old:

cmd += 'dataset=' + dataset + '" -json'

new:

cmd += f'dataset={dataset}" -json'

The same technique to format string with variable replacements could be used elsewhere in the code in order to simplify string concatenation etc.

cms-2016-simulated-datasets/code/mcm_store.py

katilp · 2023-12-14T08:08:18Z

@tiborsimko Missing file added and tested. Updates are done (apart from the print format)

Resulting record JSON of the six test datasets :
cms-simulated-datasets-2016.json

katilp · 2024-02-05T17:38:44Z

For the pileup (see cernopendata/opendata.cern.ch#3569)

check a free RECID
in code/dataset_records.py: in pileup_dataset_recif =

'Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX' : <RECID>

katilp · 2024-03-07T21:31:59Z

@tiborsimko the html docs contain some process-specific information so we need to run them for all.
Perhaps, we can make the doc file generation a separate step in the workflow.

katilp · 2024-04-05T08:08:46Z

Take care of the cases where DAS finds two parent datasets. This makes the script fail when AODSIM is taken as parent (instead of MINIAODSIM) in https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L456

Make sure that MINIAODSIM is picked.

There were 56 of such AODSIM error messages

katilp · 2024-04-11T11:09:02Z

Many datasets (also those with the gridpack available) miss the LHE information (or only have the production script displayed)
A madgraph example has only the script (but it is 404): https://opendata.cern.ch/record/33703
A powheg example displays the correct link but it shows 404: https://opendata.cern.ch/record/35757

There's ìndeed no 2016-sim directory under lhe_generators:

$ eos ls /eos/opendata/cms/lhe_generators/
2015-sim

katilp · 2024-05-20T09:37:59Z

For nano variable display, follow up from cernopendata/opendata.cern.ch#3607 and make the corresponding changes in https://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/external-scripts/inspectNanoFile.py#L316

The css file is in /eos/opendata/cms/upload/kati/patsize.css

This file is now https://opendata.cern.ch/eos/opendata/cms/dataset-semantics/patsize.css so inspectNanoFile should be updated to use it

(I'm updating the existing doc html files under /eos/opendata/cms/dataset-semantics/NanoAOD and /eos/opendata/cms/dataset-semantics/NanoAODSIM so no need to rerun them)

katilp · 2024-05-20T10:07:00Z

For the record, the list of 2016 MC datasets that do not have CODP records yet are in /eos/user/c/cmsdpoa/data-curation/cms-2016-simulated-datasets/missing-2024-03.txt (with / -> @ in the listing)

katilp · 2024-05-21T11:50:47Z

Add the mcdb handling to the lhe_generators.py as in https://github.com/cernopendata/data-curation/blob/master/cms-2015-simulated-datasets/code/lhe_generators.py#L51-L103

Also change the output directory name to lhe_generators/2016-sim/ and update the corresponding path in dataset_records.pyhttps://github.com/cernopendata/data-curation/blob/cms-2016-sim-test/cms-2016-simulated-datasets/code/dataset_records.py#L605

An example record that should have mcdb info https://opendata.cern.ch/record/72661

katilp · 2024-05-30T20:31:30Z

Many datasets (also those with the gridpack available) miss the LHE information (or only have the production script displayed) A madgraph example has only the script (but it is 404): https://opendata.cern.ch/record/33703 A powheg example displays the correct link but it shows 404: https://opendata.cern.ch/record/35757

There's ìndeed no 2016-sim directory under lhe_generators:
$ eos ls /eos/opendata/cms/lhe_generators/
2015-sim

The directory has now been copied to /eos/opendata/cms/lhe_generators/2016-sim/

Notes:

lhe_generators/2016-sim/gridpacks contains also the 1705 datasets without LHE step (in that case There is no LHE directory in LOG.txt)
it now contains also the 132 datasets that will have <recid>_lhe_header.txt in lhe_generators/2016-sim/mcdb (in that case Skipping because of mcdb_id value in LOG.txt
in total, it contains 17536 <recid>subdirectories whereas input NANO datasets are 21707
most (but not all) datasets have the generator parameters in the <recid>/InputCards subdirectory whereas the code checks only flat under the <recid>subdirectory
- the 1316 datasets with powheg.input show the generator parameters properly e.g. 35759
- the 173 dataset with readInput.DAT show the generator parameters properly e.g. 40044
- the 14019 datasets with InputCards do not show the generator parameters as the code only looks for files directly under the <recid>subdirectory
- 3 datasets have only process directory 67225, 67227, 67229 with
```
$ ls /eos/opendata/cms/lhe_generators/2016-sim/gridpacks/67225/process/madevent/Cards/
param_card.dat  proc_card_mg5.dat  run_card.dat   
```
- there are 182 datasets with runcmsgrid.sh with no generator parameter files (the one I checked was JHUGen), check those!

If we keep the directory structure as it is now, the code should

take the files from the InputCards subdir, if it exists
take the files from /process/madevent/Cards/ if InputCards does not exist

katilp · 2024-06-26T18:59:41Z

Observed still missing things in the provenance, unfortunately also among the ZZZ samples that get displayed first e.g. https://opendata.cern.ch/record/75597

That's due to lacking information in the lhe_generator/2016-sim/gridpacks
(also in my local cache)
I know why it happened and it is fixed in the code ( No 'cms.vstring(/cvmfs' found in fragment; skipping because args=cms.vstring(["/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc7_amd64_gcc700/madgraph/V5_2.6....5/VVV/ZZZ_Dim6_cW_cHd_cHWB_cHW_4F_slc7_amd64_gcc700_CMSSW_10_6_19_tarball.tar.xz"]))

The fixed code would find these. But the lhe_generator directory for those datasets got generated before the fix.

I open a separate issue to fix the records that need to be completed because I do not want to rerun the full record generation.
Most likely not that many but some detective work needed to figure out which one.

As the code update is already done, it probably does not require changes in this PR.

And they are only 15, see cernopendata/opendata.cern.ch#3652

…year

…set format

Uses O(1k) CMS 2016 MC datasets in order to have a richer dataset sample for testing of metadata extraction. Enriches the documentation and the global `.gitignore` file.

This commits brings numerous improvements and changes necessary to complete the first full data extraction run and the record generation run on the complete CMS 2016 data input.

…path search Closes #238

tiborsimko

Rebasing and merging, the set of scripts used to publish CMS 2016 SIM records, with several post-release updates.

katilp requested a review from tiborsimko October 20, 2023 14:55

katilp force-pushed the cms-2016-sim-test branch from 60e5220 to 996f971 Compare October 26, 2023 15:23

katilp mentioned this pull request Nov 8, 2023

CMS - add creation of NanoAOD content documentation to NanoAOD scripts #213

Closed

katilp mentioned this pull request Dec 1, 2023

CMS: 2016 data release checklist #124

Open

22 tasks

tiborsimko reviewed Dec 1, 2023

View reviewed changes

katilp changed the title ~~cms-2016-simulated-datasets: work in progress~~ cms-2016-simulated-datasets: updates done Dec 14, 2023

tiborsimko force-pushed the cms-2016-sim-test branch 4 times, most recently from 1fb5083 to a36eec2 Compare January 16, 2024 13:26

tiborsimko force-pushed the cms-2016-sim-test branch from a36eec2 to 440bfab Compare January 24, 2024 13:23

tiborsimko force-pushed the cms-2016-sim-test branch 7 times, most recently from 5ffc345 to 1496945 Compare February 5, 2024 10:41

tiborsimko force-pushed the cms-2016-sim-test branch 4 times, most recently from 600d931 to b922c4c Compare February 9, 2024 10:54

katilp mentioned this pull request Feb 12, 2024

records: add 2016 pileup record cernopendata/opendata.cern.ch#3574

Merged

tiborsimko force-pushed the cms-2016-sim-test branch 2 times, most recently from a1a9df2 to d7cc784 Compare February 13, 2024 12:44

tiborsimko force-pushed the cms-2016-sim-test branch 2 times, most recently from 2a2ca53 to 75ee4a9 Compare March 7, 2024 14:24

katilp mentioned this pull request Jun 2, 2024

CMS - dataset, recid, mcdb_id correspondance #235

Open

tiborsimko force-pushed the cms-2016-sim-test branch 3 times, most recently from 57696dd to 4397736 Compare June 10, 2024 13:41

Kati Lassila-Perini and others added 13 commits September 9, 2024 18:30

cms-2016-simulated-datasets: work in progress

a8aae62

cms-2016-simulated-datasets: updates as in #182

603080f

cms-2016-simulated-datasets: further enriching and cleaning as in #182

698b4bf

cms-2016-simulated-datasets: further enriching and cleaning as in #182

7443041

cms-2016-simulated-datasets: add a mini-nano cache, get reprocessing …

c5c9c29

…year

cms-2016-simulated-datasets: minor fixes to the PR

4c571e7

cms-2016-simulated-datasets: add missing lhe file and fix output data…

0f4db89

…set format

cms-2016-simulated-datasets: richer dataset sample

8bc5363

Uses O(1k) CMS 2016 MC datasets in order to have a richer dataset sample for testing of metadata extraction. Enriches the documentation and the global `.gitignore` file.

cms-2016-simulated-datasets: first full run

ef48342

This commits brings numerous improvements and changes necessary to complete the first full data extraction run and the record generation run on the complete CMS 2016 data input.

cms-2016-simulated-datasets: mcdb support and code cleanup

a0a1d0c

cms-2016-simulated-datasets: JHUGen updates

9099941

cms-2016-simulated-datasets: JHUGen mcfm option and simpler gridpack …

65eb25f

…path search Closes #238

cms-2016-simulated-datasets: read InputCards and _JHUGen subdirectories

0878405

tiborsimko force-pushed the cms-2016-sim-test branch from 4397736 to 0878405 Compare September 9, 2024 16:33

tiborsimko approved these changes Sep 9, 2024

View reviewed changes

tiborsimko merged commit 0878405 into master Sep 9, 2024
6 checks passed

tiborsimko deleted the cms-2016-sim-test branch September 9, 2024 16:39

This was referenced Sep 9, 2024

Replace system calls by xrootdpyfs #236

Closed

CMS 2017 optimize #244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cms-2016-simulated-datasets: updates done #207

cms-2016-simulated-datasets: updates done #207

katilp commented Oct 20, 2023

tiborsimko Dec 1, 2023

katilp commented Dec 14, 2023

katilp commented Feb 5, 2024

katilp commented Mar 7, 2024

katilp commented Apr 5, 2024 •

edited

Loading

katilp commented Apr 11, 2024 •

edited

Loading

katilp commented May 20, 2024 •

edited

Loading

katilp commented May 20, 2024

katilp commented May 21, 2024

katilp commented May 30, 2024 •

edited

Loading

katilp commented Jun 26, 2024 •

edited

Loading

tiborsimko left a comment

cms-2016-simulated-datasets: updates done #207

cms-2016-simulated-datasets: updates done #207

Conversation

katilp commented Oct 20, 2023

tiborsimko Dec 1, 2023

Choose a reason for hiding this comment

katilp commented Dec 14, 2023

katilp commented Feb 5, 2024

katilp commented Mar 7, 2024

katilp commented Apr 5, 2024 • edited Loading

katilp commented Apr 11, 2024 • edited Loading

katilp commented May 20, 2024 • edited Loading

katilp commented May 20, 2024

katilp commented May 21, 2024

katilp commented May 30, 2024 • edited Loading

katilp commented Jun 26, 2024 • edited Loading

tiborsimko left a comment

Choose a reason for hiding this comment

katilp commented Apr 5, 2024 •

edited

Loading

katilp commented Apr 11, 2024 •

edited

Loading

katilp commented May 20, 2024 •

edited

Loading

katilp commented May 30, 2024 •

edited

Loading

katilp commented Jun 26, 2024 •

edited

Loading