FileSystemMetricsRepository file to be a parquet (/ delta) #185

WiktorMadejski · 2024-01-06T14:00:40Z

Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).

Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:

Enable anomalies explainability by enabling live metrics repository table
Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)

Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.

Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?

chenliu0831 · 2024-01-08T15:45:27Z

I agree Parquet would be a much better storage format and that Deequ metrics should be properly data modeled.

Could we have a separate feature request of supporting Delta/Iceberg which uses parquet file format in storage layer? It would good to spec out the use-case and benefits in that one.

Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?

I think human-readable might be the biggest benefit, but the context of the design might have been lost in history.

WiktorMadejski changed the title ~~FileRepository file to be a parquet (/ delta)~~ FileSystemMetricsRepository file to be a parquet (/ delta) Jan 6, 2024

WiktorMadejski mentioned this issue Jan 6, 2024

FileSystemMetricsRepository sync on DBFS tutorial #187

Merged

chenliu0831 added the feature request label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileSystemMetricsRepository file to be a parquet (/ delta) #185

FileSystemMetricsRepository file to be a parquet (/ delta) #185

WiktorMadejski commented Jan 6, 2024 •

edited

Loading

chenliu0831 commented Jan 8, 2024

FileSystemMetricsRepository file to be a parquet (/ delta) #185

FileSystemMetricsRepository file to be a parquet (/ delta) #185

Comments

WiktorMadejski commented Jan 6, 2024 • edited Loading

chenliu0831 commented Jan 8, 2024

WiktorMadejski commented Jan 6, 2024 •

edited

Loading