-
Notifications
You must be signed in to change notification settings - Fork 119
DENG-9727: Added mode last struct retain nulls udf #8130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
8d3a25d
Added mode last struct
wwyc 601af84
Added udf comment
wwyc 303f6aa
Removed extra select
wwyc dd988a3
Fixed null in tests
wwyc 9359eef
Fixed tests with null values
wwyc c604122
Renamed folder
wwyc adc278e
Fixed udf tests
wwyc 8dcabf6
Fixed comment
wwyc 46e7322
Use struct equals
wwyc File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
description: Given an array of structs, return the most frequently occurring full struct; | ||
break ties by the latest occurrence (mode-last). Equality is on the entire struct, not per-field. | ||
Nulls are retained. | ||
friendly_name: Map Mode Last Struct Retain Nulls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
/* | ||
Return the most frequent STRUCT from an array, breaking ties by latest occurrence | ||
(i.e., mode_last_retain_nulls over whole structs). Use to keep related fields aggregated together. | ||
See also: `map.mode_last`, which determines each value using `stats.mode_last`. | ||
*/ | ||
CREATE OR REPLACE FUNCTION map.mode_last_struct_retain_nulls(entries ANY TYPE) AS ( | ||
( | ||
SELECT AS STRUCT | ||
s.* | ||
FROM | ||
( | ||
SELECT | ||
s, | ||
COUNT(*) AS freq, | ||
MAX(pos) AS last_pos | ||
FROM | ||
UNNEST(entries) AS s | ||
WITH OFFSET pos | ||
GROUP BY | ||
s | ||
) | ||
ORDER BY | ||
freq DESC, | ||
last_pos DESC | ||
LIMIT | ||
1 | ||
) | ||
); | ||
|
||
-- Tests | ||
SELECT | ||
-- 1) Most frequent wins (Berlin appears twice) | ||
assert.struct_equals( | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
'Munich' AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
) | ||
] | ||
) | ||
), | ||
-- 2) Tie -> latest wins (Berlin x2, Munich x2, last element is Munich) | ||
assert.struct_equals( | ||
STRUCT( | ||
'Munich' AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
'Munich' AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
'Munich' AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
) -- latest among the tied | ||
] | ||
) | ||
), | ||
-- 3) FULL-struct equality: different subdivision2 means different value | ||
assert.struct_equals( | ||
STRUCT('Berlin' AS city, 'BE' AS subdivision1, 'A' AS subdivision2, 'DE' AS country), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT('Berlin' AS city, 'BE' AS subdivision1, 'A' AS subdivision2, 'DE' AS country), | ||
STRUCT('Berlin' AS city, 'BE' AS subdivision1, 'B' AS subdivision2, 'DE' AS country), | ||
STRUCT('Berlin' AS city, 'BE' AS subdivision1, 'A' AS subdivision2, 'DE' AS country) | ||
] | ||
) | ||
), | ||
-- 4) Single element returns itself | ||
assert.struct_equals( | ||
STRUCT( | ||
'Cologne' AS city, | ||
'NW' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
'Cologne' AS city, | ||
'NW' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
) | ||
] | ||
) | ||
), | ||
-- 5) Tie between NULL struct and a non-NULL struct; latest wins -> expect Berlin | ||
assert.struct_equals( | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
CAST(NULL AS STRING) AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
CAST(NULL AS STRING) AS country | ||
), | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
CAST(NULL AS STRING) AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
CAST(NULL AS STRING) AS country | ||
), | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
) -- latest among the tied | ||
] | ||
) | ||
), | ||
-- 6) NULL struct occurs most frequently -> expect NULL | ||
assert.struct_equals( | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
CAST(NULL AS STRING) AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
CAST(NULL AS STRING) AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
CAST(NULL AS STRING) AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
CAST(NULL AS STRING) AS country | ||
), | ||
STRUCT( | ||
'Berlin' AS city, | ||
'BE' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
CAST(NULL AS STRING) AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
CAST(NULL AS STRING) AS country | ||
) | ||
] | ||
) | ||
), | ||
-- 7) City is NULL but other fields present; that exact struct is most frequent -> expect that struct (with city = NULL) | ||
assert.struct_equals( | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
map.mode_last_struct_retain_nulls( | ||
[ | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
), | ||
STRUCT('Berlin' AS city, 'BE' AS subdivision1, NULL AS subdivision2, 'DE' AS country), | ||
STRUCT( | ||
CAST(NULL AS STRING) AS city, | ||
'BY' AS subdivision1, | ||
CAST(NULL AS STRING) AS subdivision2, | ||
'DE' AS country | ||
) | ||
] | ||
) | ||
); |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense for this udf but for #7974, wouldn't we want to consider Berlin as the first/last seen city in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UDF would first be applied to stable tables to produce one row per client per day (mirroring baseline_clients_daily). We should retain NULLs at this stage: clients in cities with populations <15k or in locations MaxMind can’t map should remain NULL. If we drop NULLs here, we’d misrepresent a client’s true location and only capture them when they travel to a resolvable city. Downstream, the city_seen table can then keep only the non-NULL city values after this step. I hope that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that matches what I was thinking so I might be misunderstanding where this is going to be used. I'll see if it makes sense when I look at where the udf is used