Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T&D Should not dedupe raw table (and maintain current CDC delete behavior) #30710

Closed
evantahler opened this issue Sep 23, 2023 · 4 comments · Fixed by #31520
Closed

T&D Should not dedupe raw table (and maintain current CDC delete behavior) #30710

evantahler opened this issue Sep 23, 2023 · 4 comments · Fixed by #31520
Assignees
Labels
team/destinations Destinations team's backlog

Comments

@evantahler
Copy link
Contributor

evantahler commented Sep 23, 2023

One of the options we have to speed up T&D is to skip deduplication on the raw table, e.g (code):

DELETE FROM ${raw_table_id}
  WHERE "_airbyte_raw_id" NOT IN (
     SELECT "_AIRBYTE_RAW_ID" FROM ${final_table_id}
  );

Working on this story includes:

  1. Learning what old normalization did. I think that old normalization did not dedupe the raw table. This means that we've been comfortable trading increased disk usage for faster performance/lower cost. I think that, at least for cloud data warehouses, this is a reasonable tradeoff (storage is very cheap, compute is expensive). As we partition tables by extracted_at, we are already in a good position to ignore old raw records in future T&D runs
  2. Confirm that removing the deduplication of the raw tables saves significant time or expense. We should only do this if there's a clear win here
  3. Decide if we disable raw-table-deduplication for all syncs, or if this becomes another advanced option (e.g. Allow users to opt-out of Destinations V2 Typing and Deduping #30454)
@evantahler evantahler added the team/destinations Destinations team's backlog label Sep 23, 2023
@evantahler evantahler self-assigned this Sep 25, 2023
@evantahler evantahler removed their assignment Sep 27, 2023
@evantahler
Copy link
Contributor Author

Un-assigning myself. For whoever picks this up next, feel free to keep working on #30742 if you like (or not). Note this comment from @edgao: #30742 (comment)

@edgao edgao assigned edgao and unassigned edgao Oct 2, 2023
@evantahler evantahler changed the title T&D Should not dedupe raw table T&D Should not dedupe raw table (BLOCKED FOR DISCUSSION) Oct 3, 2023
@evantahler
Copy link
Contributor Author

Grooming:

  • We think that we need to delete the raw table deletes to handle out-of-order inserts + cdc-delete
  • ... we stop doing CDC deletes (_ab_deleted_at) for both the raw and final tables (tombstone column)
    • This would be a big breaking change we need to manage properly

The goal is to research if we can not dedupe the raw table and still matinan our current CDC behavior (which is really deleting rows in the final table). If so, we should do this work now. If not, this work is blocked on changing what CDC deletes do (e.g. tombstone column)

@evantahler evantahler changed the title T&D Should not dedupe raw table (BLOCKED FOR DISCUSSION) T&D Should not dedupe raw table (and maintain current CDC delete behavior) Oct 10, 2023
@edgao edgao self-assigned this Oct 16, 2023
@edgao
Copy link
Contributor

edgao commented Oct 16, 2023

maybe figured out how to do this as part of #30764. will unassign myself if that doesn't pan out.

@edgao
Copy link
Contributor

edgao commented Oct 16, 2023

pretty sure I have a line on this. I think the problem I ran into last week was that we were deduping the final table in a way that didn't interact well with duplicate _airbyte_raw_ids (I forget the specifics). Realized that we can just dedup new raw records before upserting them, which means we can safely not dedup the raw table.

(yes, that's somewhat incoherent. no, I don't remember the exact issue I was facing last week. 🤷)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment