-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix(parquet): Fix inserting decimal type values precision and scale #12846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
2664d97 to
afed39d
Compare
afed39d to
d07f829
Compare
velox/connectors/Connector.h
Outdated
|
|
||
| virtual std::string toString() const = 0; | ||
|
|
||
| virtual const std::vector<TypePtr> getHiveColumnHandleDataType() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to leave "hive" out of the name here. This is a generic connector handle.
| auto rowVector = makeRowVector( | ||
| {makeFlatVector<int64_t>({1, 2}, DECIMAL(8, 2)), | ||
| makeFlatVector<int64_t>({1, 2}, DECIMAL(10, 2)), | ||
| {makeFlatVector<double>({1.23, 203.43}, DECIMAL(8, 2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an error test where the literal does not fit into the column? Is that stopped in the writer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When decimal does not fit in the column it fails and this is done in java code before it comes to the writer, so I believe that test is not needed here.
| max_row_group_length_(DEFAULT_MAX_ROW_GROUP_LENGTH), | ||
| pagesize_(kDefaultDataPageSize), | ||
| version_(ParquetVersion::PARQUET_2_6), | ||
| version_(ParquetVersion::PARQUET_1_0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we mixing changes for separate PRs into one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I believe the issue mentions The Created by: for Prestissimo needs to be fixed as well. But makes sense to make that into another PR, so i'll switch it out
29a100b to
da24fec
Compare
| ASSERT_EQ( | ||
| 0, | ||
| strcmp( | ||
| insertTableHandle->getColumnHandleDataType()[0]->name(), "DECIMAL")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of name, why don't you check on toString of the TypePtr? That will give you precision and scale for DECIMAL.
That way you don't need to check the table later which could have changes that you really care about.
da24fec to
a868a08
Compare
czentgr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far.
Can you please add more description to the commit and description in the PR as to the problem solved here?
We don't have an OSS issue for this that describes the problem? If so lets link it.
f73d5ab to
cf69190
Compare
cf69190 to
96eb244
Compare
| * This method assumes each input `ColumnarBatch` have same schema. | ||
| */ | ||
| void Writer::write(const VectorPtr& data) { | ||
| VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too strong to remove this here? So this would trigger because the type is DECIMAL(8,2) but the schema type is DECIMAL(6,2)? For other types we may not have this problem. Perhaps for DECIMAL we need a separate chat to check that we allow a larger precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check is done in the Java code before it comes here so I believe it should be safe to remove this check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate in Velox as well since it's a general-purpose library.
For decimals, it should be enough to check if the data decimal value fits in the decimal table type.
| } | ||
|
|
||
| TEST_F(HiveDataSinkTest, insertTableHandleGetColumnHandleDataType) { | ||
| const int32_t numBuckets = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const int32_t kNumBuckets = 4
velox/dwio/parquet/writer/Writer.cpp
Outdated
| arrowMapType->item_field(), *mapType.valueType()); | ||
| return newField->WithType( | ||
| ::arrow::map(newKeyField->type(), newValueField->type())); | ||
| } else if (name != "" && type.isDecimal()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !name.empty()
| * This method assumes each input `ColumnarBatch` have same schema. | ||
| */ | ||
| void Writer::write(const VectorPtr& data) { | ||
| VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should validate in Velox as well since it's a general-purpose library.
For decimals, it should be enough to check if the data decimal value fits in the decimal table type.
velox/dwio/parquet/writer/Writer.cpp
Outdated
| auto precisionScale = getDecimalPrecisionScale(type); | ||
| // Parquet type is set from the column type rather than inferred from the | ||
| // field data. | ||
| return newField->WithType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have to rescale the value? If the data is 1200.00 and say the precision, scale is (6, 2), the value will be 120'000. If the Table type is (7, 3) the value must be 1'200'000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in test scaleDecimal to test out the values getting rescaled with parquet file that was created with values having less scale being inserted into table with higher scale example 12.34(DECIMAL(4,2)) becomes 123400 in the parquet file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check in a file. Can't we write and read with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the rescale happen on write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change just sets the precision and scale of the decimal type based on the table that was created, rather than the value that was inserted.
I believe the rescaling is happening before any changes in the PR since the first part hit is this file TableWriter.cpp and const auto& inputType = tableWriteNode->sources()[0]->outputType(); already has the data rescaled coming from LocalPlanner.cpp auto planNode = planNodes[i]; If the table is (10,4) and I insert 12.12 inputType will read as 10,4 but if I insert 12.1234 then inputType will be 6,4, but will look into this value rescaling some more to make sense of it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here, I suppose the literal is checked on the coordinator as well in the sense that it is checked that it will fit into the table column type.
For example, 12.1234 won't fit into DECIMAL(10,2).
But for your comment, the plan will have the table column type along with the literal value that is passed down (not sure which plan node the literal is - maybe a VALUES node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14446 (comment) explains this
0573133 to
88aa1f3
Compare
0bd4424 to
1230440
Compare
velox/connectors/hive/HiveDataSink.h
Outdated
|
|
||
| std::string toString() const override; | ||
|
|
||
| const std::vector<TypePtr> getColumnHandleDataType() const override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A column handle represents a single column.
So this function doesn't return the data type for a single handle but a vector of types for each handle.
So the name is kinda odd. Should this not be plural getColumnHandleDataTypes?
velox/dwio/parquet/writer/Writer.cpp
Outdated
| auto precisionScale = getDecimalPrecisionScale(type); | ||
| // Parquet type is set from the column type rather than inferred from the | ||
| // field data. | ||
| return newField->WithType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here, I suppose the literal is checked on the coordinator as well in the sense that it is checked that it will fit into the table column type.
For example, 12.1234 won't fit into DECIMAL(10,2).
But for your comment, the plan will have the table column type along with the literal value that is passed down (not sure which plan node the literal is - maybe a VALUES node).
velox/dwio/parquet/writer/Writer.cpp
Outdated
| arrowContext_->stagingBytes += bytes; | ||
| } | ||
|
|
||
| bool Writer::equivalentWithSimilarDecimal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a member function?
Why can't it be a helper function in the anonymous namespace? No specific Writer information is needed.
1230440 to
c58ee2c
Compare
| @@ -254,6 +254,14 @@ class ParquetInsertTableHandle : public ConnectorInsertTableHandle { | |||
|
|
|||
| static void registerSerDe(); | |||
|
|
|||
| const std::vector<TypePtr> getColumnHandleDataTypes() const override { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this as well.
velox/dwio/parquet/writer/Writer.cpp
Outdated
| if (!(typeid(*dataType) == typeid(schemaType))) { | ||
| return false; | ||
| } | ||
| for (size_t i = 0; i < dataType->size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not handle nested types where the decimal type could be inside an array for example.
velox/connectors/Connector.h
Outdated
|
|
||
| virtual std::string toString() const = 0; | ||
|
|
||
| virtual const std::vector<TypePtr> getColumnHandleDataTypes() const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if we should add this API to the TableWriteNode. This interface seems to have minimal API calls so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we should fix tableWriteNode->columns() to use HiveInsertTableHandle to infer the output columns.
c58ee2c to
b7b722c
Compare
b7b722c to
d714bba
Compare
| ROW({"c0", "c5", "c6"}, {DECIMAL(10, 2), VARCHAR(), BOOLEAN()}); | ||
| auto insertTableHandle = createHiveInsertTableHandle( | ||
| rowType, | ||
| "/path/to/test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "/path/to/test", | |
| TempDirectoryPath::create()->getPath(); |
PingLiuPing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just happened to hit this decimal issue, wondering what's the status of this PR? Thanks.
d714bba to
3d28f7d
Compare
|
@PingLiuPing Have launched an issue to discuss design that can be found here: #14446. |
| dwio::common::FileFormat::PARQUET, | ||
| {"c5", "c6"}, | ||
| bucketProperty); | ||
| auto precision = insertTableHandle->getColumnHandleDataTypes()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto precision = insertTableHandle->getColumnHandleDataTypes()[0] | |
| auto [precision, scale] = getDecimalPrecisionScale(insertTableHandle->getColumnHandleDataTypes()[0]); |
velox/dwio/parquet/writer/Writer.cpp
Outdated
| void Writer::write(const VectorPtr& data) { | ||
| VELOX_USER_CHECK( | ||
| data->type()->equivalent(*schema_), | ||
| equivalentWithSimilarDecimal(*data->type(), *schema_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why we implement a new function here?
Should we just enhance equivalent for DecimalType?
And the function name equivalentWithSimilarDecimal is misleading. Should not include decimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current equivalent checks for a match for everything and in Decimal case it will check if the scale and precision is the same. Which will fail for our Decimal case. For this equivalentWithSimilarDecimal check we can have a Decimal table column value like Decimal(10,2) and insert something like 1234.56 which is Decimal(6,2). So the precision and scale are not the same but the precision fits inside the table and the scale is the same, which is what we are checking for now in this case. It is named equivalentWithSimilarDecimal since we check for equivalence, but also similar decimal case is handled instead of exact match for decimal. The current equivalent needs to stay for cases that we are checking an exact match for both precision and scale on Decimal values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood. Here is the common code path, but you just need special logic for decimal. So the ideal way is re-implement equivalent in DecimalType class. Or at least you can check the data type here and only do special handling for decimal type. Adding a new function for all data type is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's leave this code as is. The problem should be fixed in upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a similar isTypeOnlyCoercion as Java
https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/type/TypeCoercer.java#L87C20-L87C38
3d28f7d to
2e3e438
Compare
9726724 to
6be6b5f
Compare
velox/dwio/parquet/writer/Writer.cpp
Outdated
| void Writer::write(const VectorPtr& data) { | ||
| VELOX_USER_CHECK( | ||
| data->type()->equivalent(*schema_), | ||
| equivalentWithSimilarDecimal(*data->type(), *schema_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a similar isTypeOnlyCoercion as Java
https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/type/TypeCoercer.java#L87C20-L87C38
|
|
||
| TEST_F(HiveDataSinkTest, insertTableHandleGetColumnHandleDataTypes) { | ||
| const int32_t kNumBuckets = 4; | ||
| auto bucketProperty = std::make_shared<HiveBucketProperty>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the bucketProperty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, removed from test.
6be6b5f to
55f7e50
Compare
55f7e50 to
074b8d4
Compare
074b8d4 to
09874fc
Compare
Prestissimo decimal type is incorrect when inserting values in Parquet. Decimal precision, scale are written incorrectly to Parquet. Need to change presto velox parquet-tools inspect output on parquet tools created by with decimal to be similar to presto java.
To Reproduce
Before the PR the output for prestissimo would be showing the
col0type asDECIMAL(6,2)instead ofDECIMAL(10,2).Expected behavior
Java writer does not have this issue. It uses the right precision, scale. AND with this PR we are getting the same output for the DECIMAL type and scale.