-
Notifications
You must be signed in to change notification settings - Fork 17
fix: yield partitions for unique stream slices in StreamSlicerPartitionGenerator #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: yield partitions for unique stream slices in StreamSlicerPartitionGenerator #508
Conversation
…stream slices, avoiding duplicates
/autofix
|
📝 WalkthroughWalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant StreamSlicerPartitionGenerator
Caller->>StreamSlicerPartitionGenerator: generate()
StreamSlicerPartitionGenerator->>StreamSlicerPartitionGenerator: For each stream_slice
StreamSlicerPartitionGenerator->>StreamSlicerPartitionGenerator: _make_hashable(stream_slice)
alt stream_slice not seen
StreamSlicerPartitionGenerator->>StreamSlicerPartitionGenerator: Add to seen_slices
StreamSlicerPartitionGenerator-->>Caller: yield partition
else stream_slice seen
StreamSlicerPartitionGenerator-->>StreamSlicerPartitionGenerator: Skip
end
Suggested labels
Would you like to add a test to ensure the duplicate filtering logic remains robust in future changes, wdyt? 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (9)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)
92-99
: Excellent implementation to prevent duplicate partition processingThe approach to tracking unique stream slices using a set and hashable representations is clean and effective. This directly addresses the PR objective of preventing errors when processing duplicate partitions.
A small suggestion - would it be valuable to log when a duplicate slice is skipped, perhaps at debug level? This could help with troubleshooting if needed. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
[error] 107-107: mypy error: Returning Any from function declared to return "Hashable" [no-any-return]
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)
3-3
: Import of Hashable type to support new functionalityThe addition of
Hashable
from the typing module aligns perfectly with the new functionality to track unique stream slices. This enables proper type hinting for the hashable representations.
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
Outdated
Show resolved
Hide resolved
… github.com:airbytehq/airbyte-python-cdk into lazebnyi/generate-partiitons-only-for-unique-slices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting right now to unblock oncall issue but I think this will requires some tests
Do we fear this set could increase in size and cause memory issues at some point? Else, I thought about grouping partition routing but I fear it would mess the state |
In theory, yes, we could have memory issues since we store each partition. So, maybe we should introduce a flag to avoid using duplicate validation in cases where we are sure that parent records are unique, or even consider switching off concurrency for them. Alternatively, we could use a key-value storage for these cases, especially if it’s a tier0 or tier1 user. |
Some parent streams may return duplicate record IDs in the response. As a result, we can end up with several identical partitions. After one partition is processed, the next one causes an error because it is already closed.
For example, in the
source-stripe
connector, thepayout_balance_transactions
stream depends onbalance_transactions
, which uses theevents
endpoint to fetch data incrementally. This can lead to multiple events for the same parent with the same state but different values in other fields.Summary by CodeRabbit