feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040

treff7es · 2025-10-17T13:08:41Z

Summary

This PR implements unified stateful time window ingestion for BigQuery and Snowflake queries extractors (queries v2), replacing the previous lineage-specific approach with a more flexible time window-based handler. The new implementation supports bucket alignment for usage statistics aggregation and
works consistently across all queries v2 features (lineage, usage, operations, queries).

Changes

Feature Implementation

Unified State Handler:

Replaced RedundantLineageRunSkipHandler with RedundantQueriesRunSkipHandler
Handler is now always active when provided (not conditional on include_lineage)
State tracking now works for all queries v2 features, not just lineage

Bucket Alignment for Usage Statistics:

Added automatic start time alignment to bucket boundaries when include_usage_statistics=True
Supports both daily and hourly bucket durations via config.window.bucket_duration
Ensures complete bucket periods for accurate usage aggregation:
- Daily buckets: Rounds start time down to 00:00:00 of the day
- Hourly buckets: Rounds start time down to HH:00:00 of the hour
Alignment uses existing get_time_bucket() function in time_window_config.py

State Updates:

update_state() now includes bucket_duration parameter for proper state tracking
State is updated after successful extraction for all features

Files Modified:

src/datahub/ingestion/source/bigquery_v2/queries_extractor.py
src/datahub/ingestion/source/snowflake/snowflake_queries.py
src/datahub/ingestion/source/state/redundant_run_skip_handler.py

Test Coverage

Updated and enhanced test suites for both BigQuery and Snowflake:

Files Modified:

tests/unit/bigquery/test_bigquery_queries.py
tests/unit/snowflake/test_snowflake_queries.py

Test Updates:

Renamed test classes from *StatefulLineageIngestion to *StatefulTimeWindowIngestion
Updated imports to use RedundantQueriesRunSkipHandler
Removed lineage-specific conditional tests (handler now always active)
Updated update_state assertions to include BucketDuration parameter
Added comprehensive bucket alignment tests:
- test_bucket_alignment_with_usage_statistics - Daily bucket alignment
- test_bucket_alignment_hourly_with_usage_statistics - Hourly bucket alignment
- test_no_bucket_alignment_without_usage_statistics - No alignment when disabled

Bug Fix

Fixed mypy Duplicate Module Error:

Renamed tests/integration/bigquery_v2/test_bigquery_queries.py to test_bigquery_queries_integration.py
Resolves mypy duplicate module name conflict with unit test file

Benefits

More Flexible State Management: State tracking now works for all queries v2 features, not just lineage
Accurate Usage Aggregation: Bucket alignment ensures complete time periods for usage statistics
Configurable Granularity: Supports both hourly and daily bucket durations
Better Incremental Ingestion: Properly tracks time windows for all feature combinations
Cleaner Architecture: Unified handler reduces complexity and conditional logic

Testing

Unit Tests:

✅ All 65 tests pass (8 BigQuery + 57 Snowflake)
✅ New bucket alignment tests verify both daily and hourly configurations
✅ Tests cover handler presence/absence, time window adjustment, state updates

Linting:

✅ ruff check passes
✅ mypy passes (no duplicate module errors)

Test Execution:

pytest tests/unit/bigquery/test_bigquery_queries.py tests/unit/snowflake/test_snowflake_queries.py -v
# 65 passed in 2.07s

Usage Example

source:
  type: bigquery-queries
  config:
    include_usage_statistics: true
    window:
      start_time: "-7d"
      end_time: "2024-01-01"
      bucket_duration: DAY  # or HOUR for hourly granularity
    stateful_ingestion:
      enabled: true

When stateful ingestion is enabled with usage statistics:
- Start time will be automatically aligned to bucket boundaries
- State is preserved across runs for incremental ingestion
- Works consistently for all queries v2 features

Breaking Changes

None - this is an enhancement that maintains backward compatibility. Existing configurations will continue to work as before.

Related Issues

Addresses the need for unified state management across all queries v2 features and ensures accurate usage statistics aggregation through proper bucket alignment.

<!--

Thank you for contributing to DataHub!

Before you submit your PR, please go through the checklist below:

- [ ] The PR conforms to DataHub's [Contributing Guideline](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md) (particularly [PR Title Format](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md#pr-title-format))
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in [Updating DataHub](https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md)

-->

codecov · 2025-10-17T13:11:51Z

Codecov Report

❌ Patch coverage is 78.57143% with 15 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...gestion/source/state/redundant_run_skip_handler.py	30.00%	7 Missing ⚠️
.../ingestion/source/bigquery_v2/queries_extractor.py	86.66%	2 Missing ⚠️
.../ingestion/source/state/stateful_ingestion_base.py	80.00%	2 Missing ⚠️
...c/datahub/ingestion/source/bigquery_v2/bigquery.py	75.00%	1 Missing ⚠️
...ub/ingestion/source/bigquery_v2/bigquery_config.py	83.33%	1 Missing ⚠️
...hub/ingestion/source/snowflake/snowflake_config.py	83.33%	1 Missing ⚠️
...datahub/ingestion/source/snowflake/snowflake_v2.py	75.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

sgomezvillamor · 2025-10-21T08:55:31Z

replacing the previous lineage-specific approach with a more flexible time window-based handler

this impacts user-facing config, right? A before/after example would help to understand what we are really addressing here

sgomezvillamor · 2025-10-21T08:59:41Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

+                    "enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion are deprecated "
+                    "when using use_queries_v2=True. These configs only work with the legacy (non-queries v2) extraction path. "
+                    "For queries v2, use enable_stateful_time_window instead to enable stateful ingestion "
+                    "for the unified time window extraction (lineage + usage + operations + queries)."


additionally, docs for enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion should be updated to mention that they only work with use_queries_v2=False

sgomezvillamor · 2025-10-21T08:59:52Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

+                    "enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion are deprecated "
+                    "when using use_queries_v2=True. These configs only work with the legacy (non-queries v2) extraction path. "
+                    "For queries v2, use enable_stateful_time_window instead to enable stateful ingestion "
+                    "for the unified time window extraction (lineage + usage + operations + queries)."


additionally, docs for enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion should be updated to mention that they only work with use_queries_v2=False

sgomezvillamor · 2025-10-21T09:04:33Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

        self.structured_report = structured_report
+        self.redundant_run_skip_handler = redundant_run_skip_handler
+
+        self.start_time, self.end_time = self._get_time_window()


if we are not tracking them already, we should log or add to the report these effective start/end time

sgomezvillamor · 2025-10-21T09:07:55Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

+    @root_validator(pre=False, skip_on_failure=True)
+    def validate_queries_v2_stateful_ingestion(cls, values: Dict) -> Dict:


FYI I'm replacing these v1 validators in #15057

sgomezvillamor · 2025-10-21T09:08:04Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

+    @root_validator(pre=False, skip_on_failure=True)
+    def validate_queries_v2_stateful_ingestion(cls, values: Dict) -> Dict:


FYI I'm replacing these v1 validators in #15057

sgomezvillamor

LGTM

treff7es · 2025-10-21T11:21:21Z

Documentation updates completed:

Changes made:

Updated field documentation in stateful_ingestion_base.py for:

enable_stateful_lineage_ingestion (line 100-107)
enable_stateful_usage_ingestion (line 151-158)

Both fields now include the following note:

NOTE: This only works with use_queries_v2=False (legacy extraction path). For queries v2, use enable_stateful_time_window instead.

These changes automatically apply to both BigQuery and Snowflake configs since they inherit from these base mixins (StatefulLineageConfigMixin and StatefulUsageConfigMixin).

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 17, 2025

github-actions bot deployed to datahub-wheels (Preview) October 17, 2025 13:10 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 17, 2025

vercel bot deployed to Preview October 17, 2025 13:44 View deployment

github-actions bot deployed to datahub-wheels (Preview) October 17, 2025 15:00 View deployment

github-actions bot deployed to datahub-wheels (Preview) October 17, 2025 15:27 View deployment

vercel bot deployed to Preview October 17, 2025 15:49 View deployment

sgomezvillamor reviewed Oct 21, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 21, 2025

sgomezvillamor reviewed Oct 21, 2025

View reviewed changes

sgomezvillamor approved these changes Oct 21, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-merge and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 21, 2025

github-actions bot deployed to datahub-wheels (Preview) October 21, 2025 11:24 View deployment

vercel bot deployed to Preview October 21, 2025 11:39 View deployment

treff7es force-pushed the window_state branch from ed777cf to 7997267 Compare October 21, 2025 12:11

github-actions bot deployed to datahub-wheels (Preview) October 21, 2025 12:13 View deployment

vercel bot deployed to Preview October 21, 2025 12:47 View deployment

treff7es force-pushed the window_state branch from 7997267 to b69d90c Compare October 21, 2025 13:43

github-actions bot deployed to datahub-wheels (Preview) October 21, 2025 13:45 View deployment

vercel bot deployed to Preview October 21, 2025 14:33 View deployment

treff7es added 2 commits October 21, 2025 17:25

Add way to store time window in statefile

8c4f096

Fix tests

0f130e1

treff7es added 2 commits October 21, 2025 17:25

Fix 3.9 linter issue

da3c362

Update doc

a8545b4

treff7es force-pushed the window_state branch from b69d90c to a8545b4 Compare October 21, 2025 15:25

github-actions bot deployed to datahub-wheels (Preview) October 21, 2025 15:27 View deployment

vercel bot deployed to Preview October 21, 2025 16:25 View deployment

treff7es merged commit e9becdd into master Oct 22, 2025
90 of 107 checks passed

treff7es deleted the window_state branch October 22, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040

feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040

Uh oh!

treff7es commented Oct 17, 2025

Uh oh!

codecov bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

sgomezvillamor commented Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025

Uh oh!

treff7es Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025 •

edited

Loading

Uh oh!

sgomezvillamor left a comment

Uh oh!

treff7es commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@root_validator(pre=False, skip_on_failure=True)
		def validate_queries_v2_stateful_ingestion(cls, values: Dict) -> Dict:

Uh oh!

feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040

feat(ingest/snowfle,bigquery): Stateful time window ingestion for queries v2 with bucket alignment #15040

Uh oh!

Conversation

treff7es commented Oct 17, 2025

Summary

Changes

Feature Implementation

Test Coverage

Bug Fix

Benefits

Testing

Uh oh!

codecov bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sgomezvillamor commented Oct 21, 2025

Uh oh!

sgomezvillamor Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

treff7es Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor left a comment

Choose a reason for hiding this comment

Uh oh!

treff7es commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 17, 2025 •

edited

Loading

sgomezvillamor Oct 21, 2025 •

edited

Loading