Fix type coercion for unsigned and signed integers (`Int64` vs `UInt64`, etc) #15341

Omega359 · 2025-03-20T20:50:36Z

Which issue does this PR close?

Closes type coercion for arthmetic/binary ops fails for some unsigned/signed mappings #15340

Rationale for this change

Better handle type coercion when unsigned numerics are involved

What changes are included in this PR?

code, extensive tests, existing test updates.

Are these changes tested?

Yes

Are there any user-facing changes?

Possibly. Some previous results from dataframe/sql queries may have had the incorrect type if one of the types that were coerced was an unsigned type.

datafusion/expr-common/src/type_coercion/binary.rs

alamb

Thanks @Omega359

I think this PR nicely reduces code replication and does what is explained on #15340

However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say?

Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔

alamb · 2025-03-22T16:06:30Z

datafusion/optimizer/tests/optimizer_integration.rs

@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() {
    let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3";
    let plan = test_sql(sql).unwrap();
    let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\
-    \n  Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\
-    \n    Filter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\
+    \n  Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\


this might be slower-- as now the larger column type is used (so it needs to do a 64 bit comparison rather than 32 bit) 🤔

but it probably also doesn't lose precision

Correctness > performance.

Omega359 · 2025-03-22T16:26:24Z

However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say?

An issue was fixed where type coercion between expressions using certain mathematical operations having unsigned / signed types could produce values with an incorrect type that is not large enough to encompass all the possible values for both types. For example, comparing an unsigned int32 with a signed int32 could result in values having int32 type (where it should be int64) and could result in "Can't cast .." error for any unsigned values larger than the maximum int32 value.

This change may result in expressions unexpectedly having a 'larger' output type than they would have had in previous releases.

Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔

let df = df
            .select_columns(&[P_ID, IDENTITY_KEY_VALUE])?
            .with_column(
                IDENTITY_KEY_VALUE,
                cast(hex_to_u64.call(vec![col(IDENTITY_KEY_VALUE)]), DataType::UInt64),
            )?
            .with_column(PARTITION_COLUMN, col(IDENTITY_KEY_VALUE).rem(lit(64)))?;

That currently throws Cast error: Can't cast value 16858640775341098663 to type Int32

Is it easy to work around? Yes. Should it happen? No.

alamb

Thank you @Omega359

This makes sense to me

I added a link to add a note in the release notes

https://github.com/apache/datafusion/pull/15341/files

type coercion fix for uint/int's.

47d2cf8

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 20, 2025

Omega359 marked this pull request as ready for review March 20, 2025 21:30

berkaysynnada reviewed Mar 21, 2025

View reviewed changes

datafusion/expr-common/src/type_coercion/binary.rs Show resolved Hide resolved

Omega359 added 2 commits March 21, 2025 16:16

Refactored common numerical coercion logic into a single function.

ced045e

Cargo fmt.

08effa8

alamb reviewed Mar 22, 2025

View reviewed changes

alamb mentioned this pull request Mar 23, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Open

21 tasks

alamb approved these changes Mar 23, 2025

View reviewed changes

alamb changed the title ~~fix type coercion for uint/int's~~ Fix type coercion for unsigned and signed integers (Int64 vs UInt64, etc) Mar 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix type coercion for unsigned and signed integers (`Int64` vs `UInt64`, etc) #15341

Fix type coercion for unsigned and signed integers (`Int64` vs `UInt64`, etc) #15341

Omega359 commented Mar 20, 2025

alamb left a comment

alamb Mar 22, 2025

alamb Mar 22, 2025

Omega359 Mar 22, 2025

Omega359 commented Mar 22, 2025

alamb left a comment

Fix type coercion for unsigned and signed integers (Int64 vs UInt64, etc) #15341

Are you sure you want to change the base?

Fix type coercion for unsigned and signed integers (Int64 vs UInt64, etc) #15341

Conversation

Omega359 commented Mar 20, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

Omega359 Mar 22, 2025

Choose a reason for hiding this comment

Omega359 commented Mar 22, 2025

alamb left a comment

Choose a reason for hiding this comment

Fix type coercion for unsigned and signed integers (`Int64` vs `UInt64`, etc) #15341

Fix type coercion for unsigned and signed integers (`Int64` vs `UInt64`, etc) #15341