You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).
Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:
Enable anomalies explainability by enabling live metrics repository table
Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)
Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.
Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?
The text was updated successfully, but these errors were encountered:
WiktorMadejski
changed the title
FileRepository file to be a parquet (/ delta)
FileSystemMetricsRepository file to be a parquet (/ delta)
Jan 6, 2024
I agree Parquet would be a much better storage format and that Deequ metrics should be properly data modeled.
Could we have a separate feature request of supporting Delta/Iceberg which uses parquet file format in storage layer? It would good to spec out the use-case and benefits in that one.
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?
I think human-readable might be the biggest benefit, but the context of the design might have been lost in history.
Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).
Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:
Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.
Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?
The text was updated successfully, but these errors were encountered: