Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

evantahler · 2023-09-26T17:42:50Z

@aaronsteers thinks that Snowflake at least can really handle non-null columns. Redshift doesn't do that.

Also, a COUNT(1) is a full stable scan. WHERE EXISTS are much faster - do that. But in the success case, where there are no null PKs, it will be a full table scan, so it won't help that much

create or replace TABLE AIRBYTE_DEVELOP."v2-internal-staging"."USERS_FINAL" (
	"_airbyte_raw_id" VARCHAR(16777216) NOT NULL, -- Added by Airbyte, this id links the rows in these 2 tables (UUID)
	"_airbyte_extracted_at" TIMESTAMP_TZ(9) NOT NULL,
	"_airbyte_meta" VARIANT NOT NULL, -- Contains errors (and eventually other metadata like sync id, sync time, etc)
	"address" OBJECT,
	"occupation" VARCHAR(16777216),
	"gender" VARCHAR(16777216),
	"academic_degree" VARCHAR(16777216),
	"weight" NUMBER(38,0),
	"created_at" TIMESTAMP_TZ(9),
	"language" VARCHAR(16777216),
	"telephone" VARCHAR(16777216),
	"title" VARCHAR(16777216),
	"updated_at" TIMESTAMP_TZ(9),
	"nationality" VARCHAR(16777216),
	"blood_type" VARCHAR(16777216),
	"name" VARCHAR(16777216),
	"id" NUMBER(38,0), -- <--- I SHOULD BE NON NULL
	"age" NUMBER(38,0),
	"email" VARCHAR(16777216),
	"height" VARCHAR(16777216)
);

evantahler · 2023-09-27T00:22:13Z

TESTING:

Snowflake: ✅

CREATE TABLE EVAN_TEST (
  ID NUMBER NOT NULL,
  DATA STRING
);

INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK
INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO EVAN_TEST (ID, DATA) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO EVAN_TEST (DATA) VALUES ('hi'); -- FAILS!

BigQuery ✅

CREATE TABLE x.evan_test (
  id INT NOT NULL,
  data STRING
);

INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK
INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO x.evan_test (id, data) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO x.evan_test (data) VALUES ('hi'); -- FAILS!

evantahler · 2023-09-29T22:36:56Z

To understand the cost improvements, we can analyze this sync, which moves ~2 million faker records each sync to BigQuery. Looking at just the users table:

2023-09-29 22:07:19 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(execute):79 Root-level job 3d3ffca8-5fda-46fe-ab8f-4bfc2beb418a completed in 137722 ms; processed 21622116380 bytes; billed for 21624782848 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql BEGIN TRANSACTION completed in 84 ms; processed 0 bytes; billed for 0 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql SELECT STRUCT<int64>(( SELECT COUNT(1) FROM `destinationsv2-perf-testing`.`airbyte_internal`.`v2_sta... completed in 1252 ms; processed 5350251221 bytes; billed for 5350883328 bytes</int64>
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql INSERT INTO `destinationsv2-perf-testing`.`v2_staging_inserts`.`users` ( `address`, `occupation`, `g... completed in 115790 ms; processed 5810251221 bytes; billed for 5811208192 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql DELETE FROM `destinationsv2-perf-testing`.`v2_staging_inserts`.`users` WHERE `_airbyte_raw_id` IN ( ... completed in 6524 ms; processed 4651362717 bytes; billed for 4651483136 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql UPDATE `destinationsv2-perf-testing`.`airbyte_internal`.`v2_staging_inserts_raw__stream_users` SET `... completed in 8469 ms; processed 5810251221 bytes; billed for 5811208192 bytes
2023-09-29 22:07:21 destination > INFO i.a.i.d.b.t.BigQueryDestinationHandler(lambda$execute$1):101 Child sql COMMIT TRANSACTION completed in 386 ms; processed 0 bytes; billed for 0 bytes

This T&D query has a few sub-parts you can see broken out below. Of the total (21622116380 bytes), the NULL-PK check "cost" 5350883328 bytes - or ~25% of the total expense for this T&D operation.

Note: this sync is also not deduping the raw table, which is an added improvement provided by #30710, which also removes about ~25% of the total cost of the sync

evantahler added the team/destinations Destinations team's backlog label Sep 26, 2023

evantahler self-assigned this Sep 27, 2023

evantahler mentioned this issue Sep 27, 2023

T&D: Use non-null columns for PKs rather then scanning raw table to ensure all records have PKs #30779

Merged

edgao closed this as completed in #30779 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

evantahler commented Sep 26, 2023 •

edited

Loading

evantahler commented Sep 27, 2023 •

edited

Loading

evantahler commented Sep 29, 2023 •

edited

Loading

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

Do we need the PK checks in T&D if we can make the PK columns non-null? #30762

Comments

evantahler commented Sep 26, 2023 • edited Loading

evantahler commented Sep 27, 2023 • edited Loading

Snowflake: ✅

BigQuery ✅

evantahler commented Sep 29, 2023 • edited Loading

evantahler commented Sep 26, 2023 •

edited

Loading

evantahler commented Sep 27, 2023 •

edited

Loading

evantahler commented Sep 29, 2023 •

edited

Loading