-
Couldn't load subscription status.
- Fork 3.3k
feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
this impacts user-facing config, right? A before/after example would help to understand what we are really addressing here |
| "enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion are deprecated " | ||
| "when using use_queries_v2=True. These configs only work with the legacy (non-queries v2) extraction path. " | ||
| "For queries v2, use enable_stateful_time_window instead to enable stateful ingestion " | ||
| "for the unified time window extraction (lineage + usage + operations + queries)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, docs for enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion should be updated to mention that they only work with use_queries_v2=False
| "enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion are deprecated " | ||
| "when using use_queries_v2=True. These configs only work with the legacy (non-queries v2) extraction path. " | ||
| "For queries v2, use enable_stateful_time_window instead to enable stateful ingestion " | ||
| "for the unified time window extraction (lineage + usage + operations + queries)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additionally, docs for enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion should be updated to mention that they only work with use_queries_v2=False
| self.structured_report = structured_report | ||
| self.redundant_run_skip_handler = redundant_run_skip_handler | ||
|
|
||
| self.start_time, self.end_time = self._get_time_window() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are not tracking them already, we should log or add to the report these effective start/end time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log it
| @root_validator(pre=False, skip_on_failure=True) | ||
| def validate_queries_v2_stateful_ingestion(cls, values: Dict) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm replacing these v1 validators in #15057
| @root_validator(pre=False, skip_on_failure=True) | ||
| def validate_queries_v2_stateful_ingestion(cls, values: Dict) -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm replacing these v1 validators in #15057
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Documentation updates completed: Changes made: Updated field documentation in
Both fields now include the following note:
These changes automatically apply to both BigQuery and Snowflake configs since they inherit from these base mixins ( |
ed777cf to
7997267
Compare
7997267 to
b69d90c
Compare
b69d90c to
a8545b4
Compare
Summary
This PR implements unified stateful time window ingestion for BigQuery and Snowflake queries extractors (queries v2), replacing the previous lineage-specific approach with a more flexible time window-based handler. The new implementation supports bucket alignment for usage statistics aggregation and
works consistently across all queries v2 features (lineage, usage, operations, queries).
Changes
Feature Implementation
Unified State Handler:
RedundantLineageRunSkipHandlerwithRedundantQueriesRunSkipHandlerinclude_lineage)Bucket Alignment for Usage Statistics:
include_usage_statistics=Trueconfig.window.bucket_durationget_time_bucket()function intime_window_config.pyState Updates:
update_state()now includesbucket_durationparameter for proper state trackingFiles Modified:
src/datahub/ingestion/source/bigquery_v2/queries_extractor.pysrc/datahub/ingestion/source/snowflake/snowflake_queries.pysrc/datahub/ingestion/source/state/redundant_run_skip_handler.pyTest Coverage
Updated and enhanced test suites for both BigQuery and Snowflake:
Files Modified:
tests/unit/bigquery/test_bigquery_queries.pytests/unit/snowflake/test_snowflake_queries.pyTest Updates:
*StatefulLineageIngestionto*StatefulTimeWindowIngestionRedundantQueriesRunSkipHandlerupdate_stateassertions to includeBucketDurationparametertest_bucket_alignment_with_usage_statistics- Daily bucket alignmenttest_bucket_alignment_hourly_with_usage_statistics- Hourly bucket alignmenttest_no_bucket_alignment_without_usage_statistics- No alignment when disabledBug Fix
Fixed mypy Duplicate Module Error:
tests/integration/bigquery_v2/test_bigquery_queries.pytotest_bigquery_queries_integration.pyBenefits
Testing
Unit Tests:
Linting:
ruff checkpassesmypypasses (no duplicate module errors)Test Execution: