fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

nuno-faria · 2025-01-21T12:48:10Z

Previously, when combining UInt64 with any signed integer, the resulting type would be Int64, which would result in lost information. Now, combining UInt64 with a signed integer results in a Decimal(20, 0), which is able to encode all (64-bit) integer types. Thanks @jonahgao for the pointers.

The function bitwise_coercion remains the same, since it's probably not a good idea to introduce decimals when performing bitwise operations. In this case, it converts (UInt64 | _) to UInt64.

Which issue does this PR close?

Closes #14208.

What changes are included in this PR?

Updated binary_numeric_coercion in expr-common/type_coercion/binary.rs.
Added new tests to expr-common/type_coercion/binary.rs.
Updated existing sqllogic tests to use the new coercion.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

… signed integers Previously, when combining UInt64 with any signed integer, the resulting type would be Int64, which would result in lost information. Now, combining UInt64 with a signed integer results in a Decimal(20, 0), which is able to encode all (64-bit) integer types.

jonahgao · 2025-01-21T14:49:25Z

datafusion/expr-common/src/type_coercion/binary.rs

+        // accommodates all values of both types. Note that to avoid information
+        // loss when combining UInt64 with signed integers we use Decimal128(20, 0).
+        (Decimal128(20, 0), _)
+        | (_, Decimal128(20, 0))


This looks not correct. For example, combining Decimal128(20, 0) with Decimal128(30, 0) should not result in Decimal128(20, 0)

I think when both types are decimal they are handled above before this match, when calling the decimal_coercion function.

I rechecked it, and decimal_coercion already covers them in

datafusion/datafusion/expr-common/src/type_coercion/binary.rs

Lines 932 to 940 in 2f28327

fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option<DataType> {

use arrow::datatypes::DataType::*;

// This conversion rule is from spark

// https://github.com/apache/spark/blob/1c81ad20296d34f137238dadd67cc6ae405944eb/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L127

match numeric_type {

Int8 => Some(Decimal128(3, 0)),

Int16 => Some(Decimal128(5, 0)),

Int32 => Some(Decimal128(10, 0)),

Int64 => Some(Decimal128(20, 0)),

Although it doesn't handle unsigned integer types, we can supplement it there, maybe as a follow-up PR.

So I think we shouldn't combine decimal with integer types here because decimal_coercion has already handled it.

I tried that initially but then it would not handle UInt64 and Decimal128(20, 0):

Cannot infer common argument type for comparison operation UInt64 = Decimal128(20, 0)

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

match numeric_type { Int8 => Some(Decimal128(3, 0)), Int16 => Some(Decimal128(5, 0)), Int32 => Some(Decimal128(10, 0)), Int64 => Some(Decimal128(20, 0)), Float32 => Some(Decimal128(14, 7)), Float64 => Some(Decimal128(30, 15)), _ => None, }

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

I think so.

jonahgao · 2025-01-21T14:51:12Z

Perhaps we should add a sqllogictest test for #14208.

nuno-faria · 2025-01-21T15:17:01Z

Perhaps we should add a sqllogictest test for #14208.

Done.

…c_type_to_decimal

datafusion/expr-common/src/type_coercion/binary.rs

jonahgao

LGTM, thank you @nuno-faria

alamb · 2025-01-22T23:36:32Z

Thanks again @nuno-faria - it is great to see you contributing ❤️

alamb

Thanks for the contribution @nuno-faria and the review @jonahgao

However, this change doesn't seem like a good one to me

alamb · 2025-01-22T23:38:59Z

datafusion/expr-common/src/type_coercion/binary.rs

-        // for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
+        // accommodates all values of both types. Note that to avoid information
+        // loss when combining UInt64 with signed integers we use Decimal128(20, 0).
+        (UInt64, Int64 | Int32 | Int16 | Int8)


I think this has potentially (large) performance implications.

My understanding is that this means that Int64+Int64 will result in (always) a 128bit result?

So even though performing int64+int64 will never overflow, all queries will pay the price of 2x space (and some time) overhead?

No, Int64+Int64 is not affected, it uses mathematics_numerical_coercion.

I did some validation on this PR branch.

DataFusion CLI v44.0.0 > create table test(a bigint, b bigint unsigned) as values(1,1); 0 row(s) fetched. Elapsed 0.008 seconds. > select arrow_typeof(a+b), arrow_typeof(a+a), arrow_typeof(a), arrow_typeof(b) from test; +-------------------------------+-------------------------------+----------------------+----------------------+ | arrow_typeof(test.a + test.b) | arrow_typeof(test.a + test.a) | arrow_typeof(test.a) | arrow_typeof(test.b) | +-------------------------------+-------------------------------+----------------------+----------------------+ | Int64 | Int64 | Int64 | UInt64 | +-------------------------------+-------------------------------+----------------------+----------------------+ 1 row(s) fetched. Elapsed 0.009 seconds.

I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:

Minor: Add tests for types of arithmetic operator output types #14250

alamb · 2025-01-22T23:40:07Z

datafusion/sqllogictest/test_files/union.slt

 04)------TableScan: aggregate_test_100 projection=[c1, c9]
-05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Int64) AS c9
+05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Decimal128(20, 0)) AS c9


this seems like a regression to me (there is now 2x the space needed)

This query unions Int16 with UInt64. We need to find a common type that can accommodate all possible values of these two types, such as -1 and u64::MAX. Despite increasing the space, it makes the following query available.

create table t1(a smallint) as values(1); create table t2(a bigint unsigned) as values(10000000000000000000); select * from t1 union select * from t2;

alamb · 2025-01-22T23:41:44Z

datafusion/sqllogictest/test_files/insert.slt

+3
+
+query I rowsort
+select * from unsigned_bigint_test


I agree the case of overflow on * of 2 64 bit numbers is more likely and automatically coercing to Decimal128 may make sense.

However, I would argue that if the user cares about avoiding overflows when doing Intger arithmetic they should use Decimal128 in their input types

Overflow is unexpected in this case, as all these values are valid unsigned bigint.

The problem is that in values (10000000000000000001), (1), 10000000000000000001 is parsed as UInt64, and 1 is parsed as Int64. They were coerced to Int64, which can't accommodate 10000000000000000001.

This is very similar to the union case mentioned above.

alamb

Thank you @nuno-faria and @jonahgao for the explanation. I agree this PR makes sense and my concerns did not apply.

alamb · 2025-01-23T14:38:16Z

datafusion/expr-common/src/type_coercion/binary.rs

-        // for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
+        // accommodates all values of both types. Note that to avoid information
+        // loss when combining UInt64 with signed integers we use Decimal128(20, 0).
+        (UInt64, Int64 | Int32 | Int16 | Int8)


I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:

Minor: Add tests for types of arithmetic operator output types #14250

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 21, 2025

jonahgao reviewed Jan 21, 2025

View reviewed changes

test: Add sqllogictest for apache#14208

d632d7a

nuno-faria added 2 commits January 22, 2025 09:36

refactor: Move unsigned integer and decimal coercion to coerce_numeri…

c21610c

…c_type_to_decimal

fix: Also handle unsigned integers when coercing to Decimal256

befdc43

jonahgao reviewed Jan 22, 2025

View reviewed changes

datafusion/expr-common/src/type_coercion/binary.rs Show resolved Hide resolved

fix: Coerce UInt64 and other unsigned integer to UInt64

bb2c5c0

jonahgao approved these changes Jan 22, 2025

View reviewed changes

alamb reviewed Jan 22, 2025

View reviewed changes

alamb mentioned this pull request Jan 23, 2025

Minor: Add tests for types of arithmetic operator output types #14250

Open

alamb approved these changes Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

nuno-faria commented Jan 21, 2025

jonahgao Jan 21, 2025

nuno-faria Jan 21, 2025

jonahgao Jan 21, 2025

jonahgao Jan 22, 2025

nuno-faria Jan 22, 2025

jonahgao Jan 22, 2025

nuno-faria Jan 22, 2025

jonahgao commented Jan 21, 2025

nuno-faria commented Jan 21, 2025

jonahgao left a comment

alamb commented Jan 22, 2025

alamb left a comment

alamb Jan 22, 2025

jonahgao Jan 23, 2025

alamb Jan 23, 2025

alamb Jan 22, 2025

jonahgao Jan 23, 2025

alamb Jan 22, 2025

jonahgao Jan 23, 2025

alamb left a comment

alamb Jan 23, 2025

	fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option<DataType> {
	use arrow::datatypes::DataType::*;
	// This conversion rule is from spark
	// https://github.com/apache/spark/blob/1c81ad20296d34f137238dadd67cc6ae405944eb/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L127
	match numeric_type {
	Int8 => Some(Decimal128(3, 0)),
	Int16 => Some(Decimal128(5, 0)),
	Int32 => Some(Decimal128(10, 0)),
	Int64 => Some(Decimal128(20, 0)),

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Are you sure you want to change the base?

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Conversation

nuno-faria commented Jan 21, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonahgao commented Jan 21, 2025

nuno-faria commented Jan 21, 2025

jonahgao left a comment

Choose a reason for hiding this comment

alamb commented Jan 22, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment