Skip to content

Conversation

@iguinn
Copy link
Contributor

@iguinn iguinn commented Nov 17, 2025

This PR will ultimately add a few functions for querying our data. The goal is to have an easier system for accessing large amounts of data in a flexible way without having to write many many lines of code, which takes a lot of coder time, and slows down the iterative process of looking at data. The queries will be declarative and resemble SQL queries (SELECT fields FROM data-source WHERE condition). These functions will take advantage of the structures/codebase we have built up:

  • Use dbetto to dig through metadata and parameter databases
  • Use the LH5Iterator to scale well in terms of memory enable parallelism
  • Take advantage of the dataflow config file to locate all of the necessary pieces (this will not need a file db, unless there is a need to query at the level of cycles, which is not planned)

So far, I have added a metadata query, with a data query, data hist-query and (maybe) event query to come. This splits the query into a run query (based on period, run, datatype (e.g. cal, phy, etc.), and starttime), and a channel query (using anything from our databases, identifying using shortcuts prepended with @). Right now that is @det for detector database and @par for analysis parameters and @run for run info; this can be extended. The needed databases are found using dbetto, and are currently hard-coded. Due to the length of names here, the ability to alias them has been added). Here's what the meta-data query looks like:
image
This function uses the dataflow config (pointed to by a refprod argument or the REFPROD environment variable) to construct a legend metadata instance, uses meta.datasets.runinfo to find and query the runs. The channelmap is used to loop over detectors and get information for @det. The paths pointed to in the config by pars_* are used for @par. Currently eval is used to evaluate the run and channel queries. This can return as pd, ak, or lgdo, although lgdo won't always work due to unsupported data types.

To-do/requests:

  • Add tests. This will require the addition to legend-testdata of some directories structured like our productions, with config files, analysis parameters, metadata, etc.
  • Tutorials
  • More queries. A prototype already exists for data querying.
  • When querying data, evt tier is going to be a sticking point since it is structured differently than others. This is the basis for my suggestion of adding dataset views to lh5 (Views legend-data-format-specs#13), which would benefit from having views from each detector's events to the corresponding events. This can be worked around for global trigger data, but is very challenging for cal data

@iguinn iguinn requested review from ggmarshall and gipert November 17, 2025 18:00
@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

❌ Patch coverage is 9.75610% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.60%. Comparing base (c842646) to head (d4a6372).

Files with missing lines Patch % Lines
src/pygama/flow/query_meta.py 8.26% 111 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #633      +/-   ##
==========================================
- Coverage   51.22%   50.60%   -0.62%     
==========================================
  Files          60       61       +1     
  Lines        8219     8341     +122     
==========================================
+ Hits         4210     4221      +11     
- Misses       4009     4120     +111     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gipert
Copy link
Member

gipert commented Nov 18, 2025

hi @iguinn this looks interesting! i will have a look as soon as i can

@gipert
Copy link
Member

gipert commented Dec 10, 2025

So let's move this to pylegendmeta?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants