-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination bigquery: Use merge statement to do T+D #31529
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
60a309a
to
d6b581a
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IANA SQL expert but overall LGTM
cdcDeleteClause = "WHEN MATCHED AND new_record._ab_cdc_deleted_at IS NOT NULL AND " + cursorComparison + " THEN DELETE"; | ||
// And skip insertion entirely if there's no matching record. | ||
// (This is possible if a single T+D batch contains both an insertion and deletion for the same PK) | ||
cdcSkipInsertClause = "AND new_record._ab_cdc_deleted_at IS NULL"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understanding, in the scenario of WHEN NOT MATCHED AND new_record._ab_cdc_deleted_at IS NOT NULL
, we don't insert (or do anything else) bc the record has been deleted?
I'm also confused by the naming cdcSkipInsertClause
which is then immediately followed by an INSERT
on L549.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confirm my understanding
correct. E.g. if a single T+D batch includes a record for INSERT INTO users ...
and also DELETE FROM users ...
, then new_record
will have deleted_at=, and we don't want to insert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdcSkipInsertClause
yeah I'm not great at naming :/ how about... cdcDeletionClause
? idk
prerelease publish running https://github.com/airbytehq/airbyte/actions/runs/6643634205 |
destination-bigquery test report (commit
|
Step | Result |
---|---|
Build connector tar | ✅ |
Java Connector Unit Tests | ✅ |
Build destination-bigquery docker image for platform(s) linux/amd64 | ✅ |
Java Connector Integration Tests | ✅ |
Validate metadata for destination-bigquery | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-bigquery test
part of #30764
The merge statement replaces three statements in the previous version:
Example Bigquery SQL MERGE
performance/cost notes
I ran a source-faker sync with 10M records to populate a dataset, and then ran some 1M record syncs to simulate incremental updates. tl;dr this is a significant improvement in most cases
Performance on master: (i.e. still deduping raw tables AND still using insert+delete)
Performance from this branch: (i.e. not deduping raw tables AND using merge)
However, running with source-faker emitting 10M records (i.e. rewriting the entire table) in incremental mode resulted in longer T+D times (but still lower costs). IMO that's fine; typical incremental syncs aren't rewriting the entire dataset (and they should probably just use full refresh/overwrite in that case anyway)