Data querying #633

iguinn · 2025-11-17T17:59:57Z

This PR will ultimately add a few functions for querying our data. The goal is to have an easier system for accessing large amounts of data in a flexible way without having to write many many lines of code, which takes a lot of coder time, and slows down the iterative process of looking at data. The queries will be declarative and resemble SQL queries (SELECT fields FROM data-source WHERE condition). These functions will take advantage of the structures/codebase we have built up:

Use dbetto to dig through metadata and parameter databases
Use the LH5Iterator to scale well in terms of memory enable parallelism
Take advantage of the dataflow config file to locate all of the necessary pieces (this will not need a file db, unless there is a need to query at the level of cycles, which is not planned)

So far, I have added a metadata query, with a data query, data hist-query and (maybe) event query to come. This splits the query into a run query (based on period, run, datatype (e.g. cal, phy, etc.), and starttime), and a channel query (using anything from our databases, identifying using shortcuts prepended with @). Right now that is @det for detector database and @par for analysis parameters and @run for run info; this can be extended. The needed databases are found using dbetto, and are currently hard-coded. Due to the length of names here, the ability to alias them has been added). Here's what the meta-data query looks like:

This function uses the dataflow config (pointed to by a refprod argument or the REFPROD environment variable) to construct a legend metadata instance, uses meta.datasets.runinfo to find and query the runs. The channelmap is used to loop over detectors and get information for @det. The paths pointed to in the config by pars_* are used for @par. Currently eval is used to evaluate the run and channel queries. This can return as pd, ak, or lgdo, although lgdo won't always work due to unsupported data types.

To-do/requests:

Add tests. This will require the addition to legend-testdata of some directories structured like our productions, with config files, analysis parameters, metadata, etc.
Tutorials
More queries. A prototype already exists for data querying.
When querying data, evt tier is going to be a sticking point since it is structured differently than others. This is the basis for my suggestion of adding dataset views to lh5 (Views legend-data-format-specs#13), which would benefit from having views from each detector's events to the corresponding events. This can be worked around for global trigger data, but is very challenging for cal data

codecov · 2025-11-17T18:07:06Z

Codecov Report

❌ Patch coverage is 9.75610% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.60%. Comparing base (c842646) to head (d4a6372).

Files with missing lines	Patch %	Lines
src/pygama/flow/query_meta.py	8.26%	111 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #633      +/-   ##
==========================================
- Coverage   51.22%   50.60%   -0.62%     
==========================================
  Files          60       61       +1     
  Lines        8219     8341     +122     
==========================================
+ Hits         4210     4221      +11     
- Misses       4009     4120     +111

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gipert · 2025-11-18T09:19:05Z

hi @iguinn this looks interesting! i will have a look as soon as i can

gipert · 2025-12-10T08:04:13Z

So let's move this to pylegendmeta?

iguinn and others added 2 commits November 17, 2025 09:45

Added query_meta

f24d918

style: pre-commit fixes

a2b33b1

iguinn requested review from ggmarshall and gipert November 17, 2025 18:00

Merge branch 'main' into query

d4a6372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data querying #633

Data querying #633

Uh oh!

iguinn commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 •

edited

Loading

Uh oh!

gipert commented Nov 18, 2025

Uh oh!

gipert commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data querying #633

Are you sure you want to change the base?

Data querying #633

Uh oh!

Conversation

iguinn commented Nov 17, 2025

Uh oh!

codecov bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gipert commented Nov 18, 2025

Uh oh!

gipert commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 17, 2025 •

edited

Loading