-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix type coercion for unsigned and signed integers (Int64
vs UInt64
, etc)
#15341
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Omega359
I think this PR nicely reduces code replication and does what is explained on #15340
However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say?
Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔
@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() { | |||
let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3"; | |||
let plan = test_sql(sql).unwrap(); | |||
let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\ | |||
\n Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\ | |||
\n Filter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\ | |||
\n Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be slower-- as now the larger column type is used (so it needs to do a 64 bit comparison rather than 32 bit) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it probably also doesn't lose precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness > performance.
An issue was fixed where type coercion between expressions using certain mathematical operations having unsigned / signed types could produce values with an incorrect type that is not large enough to encompass all the possible values for both types. For example, comparing an unsigned int32 with a signed int32 could result in values having int32 type (where it should be int64) and could result in "Can't cast .." error for any unsigned values larger than the maximum int32 value. This change may result in expressions unexpectedly having a 'larger' output type than they would have had in previous releases.
let df = df
.select_columns(&[P_ID, IDENTITY_KEY_VALUE])?
.with_column(
IDENTITY_KEY_VALUE,
cast(hex_to_u64.call(vec![col(IDENTITY_KEY_VALUE)]), DataType::UInt64),
)?
.with_column(PARTITION_COLUMN, col(IDENTITY_KEY_VALUE).rem(lit(64)))?; That currently throws Is it easy to work around? Yes. Should it happen? No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64
vs UInt64
, etc)
Which issue does this PR close?
Rationale for this change
Better handle type coercion when unsigned numerics are involved
What changes are included in this PR?
code, extensive tests, existing test updates.
Are these changes tested?
Yes
Are there any user-facing changes?
Possibly. Some previous results from dataframe/sql queries may have had the incorrect type if one of the types that were coerced was an unsigned type.