-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need the PK checks in T&D if we can make the PK columns non-null? #30762
Comments
TESTING: Snowflake: ✅CREATE TABLE EVAN_TEST (
ID NUMBER NOT NULL,
DATA STRING
);
INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK
INSERT INTO EVAN_TEST (ID, DATA) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO EVAN_TEST (ID, DATA) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO EVAN_TEST (DATA) VALUES ('hi'); -- FAILS! BigQuery ✅CREATE TABLE x.evan_test (
id INT NOT NULL,
data STRING
);
INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK
INSERT INTO x.evan_test (id, data) VALUES (1, 'hi'); -- OK, ID is not unique
INSERT INTO x.evan_test (id, data) VALUES (NULL, 'hi'); -- FAILS!
INSERT INTO x.evan_test (data) VALUES ('hi'); -- FAILS! |
To understand the cost improvements, we can analyze this sync, which moves ~2 million faker records each sync to BigQuery. Looking at just the
This T&D query has a few sub-parts you can see broken out below. Of the total (21622116380 bytes), the NULL-PK check "cost" 5350883328 bytes - or ~25% of the total expense for this T&D operation. Note: this sync is also not deduping the raw table, which is an added improvement provided by #30710, which also removes about ~25% of the total cost of the sync |
@aaronsteers thinks that Snowflake at least can really handle non-null columns. Redshift doesn't do that.
Also, a COUNT(1) is a full stable scan. WHERE EXISTS are much faster - do that. But in the success case, where there are no null PKs, it will be a full table scan, so it won't help that much
The text was updated successfully, but these errors were encountered: