Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileSystemMetricsRepository file to be a parquet (/ delta) #185

Open
WiktorMadejski opened this issue Jan 6, 2024 · 1 comment
Open
Labels
feature request Feature request

Comments

@WiktorMadejski
Copy link
Contributor

WiktorMadejski commented Jan 6, 2024

Is your feature request related to a problem? Please describe.
We run thousands of anomalies validations on hundreds of data sources. We often want to explain why they fail. To do so - first step is to look at the underlying metrics - persisted in PyDeeQu managed json. Processing json in spark environment is not efficient (both programmatically and not).

Describe the solution you'd like
One way to provide explanations for failed anomalies is to expose json repository file underlying FileSystemMetricsRepository to analytical user. Since PyDeeQu is a PySpark framework the best and natural choice is to store repository data as .parquet/delta files instead of json. This could also cover:

  • Enable anomalies explainability by enabling live metrics repository table
  • Establishing data contract for repository (currently PyDeeQu json repository does not really assume a schema)

Describe alternatives you've considered
Consider DELTA format - probably way batter choice but requires more development to enable incremental processing and rollbacks.

Additional context
Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?

@WiktorMadejski WiktorMadejski changed the title FileRepository file to be a parquet (/ delta) FileSystemMetricsRepository file to be a parquet (/ delta) Jan 6, 2024
@chenliu0831 chenliu0831 added the feature request Feature request label Jan 8, 2024
@chenliu0831
Copy link
Contributor

I agree Parquet would be a much better storage format and that Deequ metrics should be properly data modeled.

Could we have a separate feature request of supporting Delta/Iceberg which uses parquet file format in storage layer? It would good to spec out the use-case and benefits in that one.

Would also be good to know why .json format was a chosen? And is there any way we can benefit from this choice?

I think human-readable might be the biggest benefit, but the context of the design might have been lost in history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Feature request
Projects
None yet
Development

No branches or pull requests

2 participants