Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

Closed
evantahler opened this issue Sep 26, 2023 · 2 comments · Fixed by #30779
Closed

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

evantahler opened this issue Sep 26, 2023 · 2 comments · Fixed by #30779
Assignees
Labels
team/destinations Destinations team's backlog

Comments

@evantahler
Copy link
Contributor

evantahler commented Sep 26, 2023

@aaronsteers thinks that Snowflake at least can really handle non-null columns. Redshift doesn't do that.

Also, a COUNT(1) is a full stable scan. WHERE EXISTS are much faster - do that. But in the success case, where there are no null PKs, it will be a full table scan, so it won't help that much

create or replace TABLE AIRBYTE_DEVELOP."v2-internal-staging"."USERS_FINAL" (
	"_airbyte_raw_id" VARCHAR(16777216) NOT NULL, -- Added by Airbyte, this id links the rows in these 2 tables (UUID)
	"_airbyte_extracted_at" TIMESTAMP_TZ(9) NOT NULL,
	"_airbyte_meta" VARIANT NOT NULL, -- Contains errors (and eventually other metadata like sync id, sync time, etc)
	"address" OBJECT,
	"occupation" VARCHAR(16777216),
	"gender" VARCHAR(16777216),
	"academic_degree" VARCHAR(16777216),
	"weight" NUMBER(38,0),
	"created_at" TIMESTAMP_TZ(9),
	"language" VARCHAR(16777216),
	"telephone" VARCHAR(16777216),
	"title" VARCHAR(16777216),
	"updated_at" TIMESTAMP_TZ(9),
	"nationality" VARCHAR(16777216),
	"blood_type" VARCHAR(16777216),
	"name" VARCHAR(16777216),
	"id" NUMBER(38,0), -- <--- I SHOULD BE NON NULL
	"age" NUMBER(38,0),
	"email" VARCHAR(16777216),
	"height" VARCHAR(16777216)
);
@evantahler evantahler added the team/destinations Destinations team's backlog label Sep 26, 2023
@evantahler
Copy link
Contributor Author

evantahler commented Sep 27, 2023

TESTING:

Snowflake: ✅

CREATE TABLE EVAN_TEST (
  ID NUMBER NOT NULL,
  DATA STRING
);

INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK
INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO EVAN_TEST (ID, DATA) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO EVAN_TEST (DATA) VALUES ('hi'); -- FAILS!

BigQuery ✅

CREATE TABLE x.evan_test (
  id INT NOT NULL,
  data STRING
);

INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK
INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO x.evan_test (id, data) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO x.evan_test (data) VALUES ('hi'); -- FAILS!

@evantahler
Copy link
Contributor Author

evantahler commented Sep 29, 2023

To understand the cost improvements, we can analyze this sync, which moves ~2 million faker records each sync to BigQuery. Looking at just the users table:

2023-09-29 22:07:19 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(execute):79 Root-level job 3d3ffca8-5fda-46fe-ab8f-4bfc2beb418a completed in 137722 ms; processed 21622116380 bytes; billed for 21624782848 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql BEGIN TRANSACTION completed in 84 ms; processed 0 bytes; billed for 0 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql SELECT STRUCT<int64>(( SELECT COUNT(1) FROM `destinationsv2-perf-testing`.`airbyte_internal`.`v2_sta... completed in 1252 ms; processed 5350251221 bytes; billed for 5350883328 bytes</int64>
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql INSERT INTO `destinationsv2-perf-testing`.`v2_staging_inserts`.`users` ( `address`, `occupation`, `g... completed in 115790 ms; processed 5810251221 bytes; billed for 5811208192 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql DELETE FROM `destinationsv2-perf-testing`.`v2_staging_inserts`.`users` WHERE `_airbyte_raw_id` IN ( ... completed in 6524 ms; processed 4651362717 bytes; billed for 4651483136 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql UPDATE `destinationsv2-perf-testing`.`airbyte_internal`.`v2_staging_inserts_raw__stream_users` SET `... completed in 8469 ms; processed 5810251221 bytes; billed for 5811208192 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql COMMIT TRANSACTION completed in 386 ms; processed 0 bytes; billed for 0 bytes

This T&D query has a few sub-parts you can see broken out below. Of the total (21622116380 bytes), the NULL-PK check "cost" 5350883328 bytes - or ~25% of the total expense for this T&D operation.

Note: this sync is also not deduping the raw table, which is an added improvement provided by #30710, which also removes about ~25% of the total cost of the sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/destinations Destinations team's backlog
Projects
None yet
1 participant