Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type coercion for unsigned and signed integers (Int64 vs UInt64, etc) #15341

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Better handle type coercion when unsigned numerics are involved

What changes are included in this PR?

code, extensive tests, existing test updates.

Are these changes tested?

Yes

Are there any user-facing changes?

Possibly. Some previous results from dataframe/sql queries may have had the incorrect type if one of the types that were coerced was an unsigned type.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 20, 2025
@Omega359 Omega359 marked this pull request as ready for review March 20, 2025 21:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359

I think this PR nicely reduces code replication and does what is explained on #15340

However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say?

Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔

@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() {
let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3";
let plan = test_sql(sql).unwrap();
let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\
\n Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\
\n Filter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\
\n Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be slower-- as now the larger column type is used (so it needs to do a 64 bit comparison rather than 32 bit) 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it probably also doesn't lose precision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness > performance.

@Omega359
Copy link
Contributor Author

However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say?

An issue was fixed where type coercion between expressions using certain mathematical operations having unsigned / signed types could produce values with an incorrect type that is not large enough to encompass all the possible values for both types. For example, comparing an unsigned int32 with a signed int32 could result in values having int32 type (where it should be int64) and could result in "Can't cast .." error for any unsigned values larger than the maximum int32 value.

This change may result in expressions unexpectedly having a 'larger' output type than they would have had in previous releases.

Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔

let df = df
            .select_columns(&[P_ID, IDENTITY_KEY_VALUE])?
            .with_column(
                IDENTITY_KEY_VALUE,
                cast(hex_to_u64.call(vec![col(IDENTITY_KEY_VALUE)]), DataType::UInt64),
            )?
            .with_column(PARTITION_COLUMN, col(IDENTITY_KEY_VALUE).rem(lit(64)))?;

That currently throws Cast error: Can't cast value 16858640775341098663 to type Int32

Is it easy to work around? Yes. Should it happen? No.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359

This makes sense to me

I added a link to add a note in the release notes

@alamb alamb changed the title fix type coercion for uint/int's Fix type coercion for unsigned and signed integers (Int64 vs UInt64, etc) Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

type coercion for arthmetic/binary ops fails for some unsigned/signed mappings
3 participants