Skip to content

Conversation

@jkhaliqi
Copy link
Collaborator

@jkhaliqi jkhaliqi commented Mar 29, 2025

Prestissimo decimal type is incorrect when inserting values in Parquet. Decimal precision, scale are written incorrectly to Parquet. Need to change presto velox parquet-tools inspect output on parquet tools created by with decimal to be similar to presto java.

To Reproduce

create table decimaltest(col0 DECIMAL(10, 2));
CREATE TABLE
insert into decimaltest values(1200.00);
INSERT: 1 row

parquet meta file.parquet

Before the PR the output for prestissimo would be showing the col0 type as DECIMAL(6,2) instead of DECIMAL(10,2).

Created by: parquet-mr version 2.6.0 (build parquet-cpp-velox)
Properties: (none)
Schema:
message schema {
  optional int32 col0 (DECIMAL(6,2));
}

Expected behavior
Java writer does not have this issue. It uses the right precision, scale. AND with this PR we are getting the same output for the DECIMAL type and scale.

Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba)
Properties: (none)
Schema:
message hive_schema {
  optional fixed_len_byte_array(5) col0 (DECIMAL(10,2));
}

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 29, 2025
@netlify
Copy link

netlify bot commented Mar 29, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 09874fc
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/69370a851c112b00084a1af2

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch 4 times, most recently from 2664d97 to afed39d Compare April 1, 2025 22:25
@jkhaliqi jkhaliqi marked this pull request as ready for review April 1, 2025 22:28
@jkhaliqi jkhaliqi requested a review from majetideepak as a code owner April 1, 2025 22:28
@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from afed39d to d07f829 Compare April 2, 2025 23:10

virtual std::string toString() const = 0;

virtual const std::vector<TypePtr> getHiveColumnHandleDataType() const = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to leave "hive" out of the name here. This is a generic connector handle.

auto rowVector = makeRowVector(
{makeFlatVector<int64_t>({1, 2}, DECIMAL(8, 2)),
makeFlatVector<int64_t>({1, 2}, DECIMAL(10, 2)),
{makeFlatVector<double>({1.23, 203.43}, DECIMAL(8, 2)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an error test where the literal does not fit into the column? Is that stopped in the writer?

Copy link
Collaborator Author

@jkhaliqi jkhaliqi Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When decimal does not fit in the column it fails and this is done in java code before it comes to the writer, so I believe that test is not needed here.

max_row_group_length_(DEFAULT_MAX_ROW_GROUP_LENGTH),
pagesize_(kDefaultDataPageSize),
version_(ParquetVersion::PARQUET_2_6),
version_(ParquetVersion::PARQUET_1_0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we mixing changes for separate PRs into one here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe the issue mentions The Created by: for Prestissimo needs to be fixed as well. But makes sense to make that into another PR, so i'll switch it out

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch 4 times, most recently from 29a100b to da24fec Compare April 10, 2025 20:41
ASSERT_EQ(
0,
strcmp(
insertTableHandle->getColumnHandleDataType()[0]->name(), "DECIMAL"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of name, why don't you check on toString of the TypePtr? That will give you precision and scale for DECIMAL.
That way you don't need to check the table later which could have changes that you really care about.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from da24fec to a868a08 Compare April 10, 2025 21:24
Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far.
Can you please add more description to the commit and description in the PR as to the problem solved here?
We don't have an OSS issue for this that describes the problem? If so lets link it.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch 2 times, most recently from f73d5ab to cf69190 Compare April 29, 2025 16:48
@jkhaliqi jkhaliqi changed the title fix(parquet): Fix inserting decimal type values fix(parquet): Fix inserting decimal type values precision and scale Apr 29, 2025
@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from cf69190 to 96eb244 Compare April 29, 2025 21:37
* This method assumes each input `ColumnarBatch` have same schema.
*/
void Writer::write(const VectorPtr& data) {
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too strong to remove this here? So this would trigger because the type is DECIMAL(8,2) but the schema type is DECIMAL(6,2)? For other types we may not have this problem. Perhaps for DECIMAL we need a separate chat to check that we allow a larger precision?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check is done in the Java code before it comes here so I believe it should be safe to remove this check here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate in Velox as well since it's a general-purpose library.
For decimals, it should be enough to check if the data decimal value fits in the decimal table type.

}

TEST_F(HiveDataSinkTest, insertTableHandleGetColumnHandleDataType) {
const int32_t numBuckets = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const int32_t kNumBuckets = 4

arrowMapType->item_field(), *mapType.valueType());
return newField->WithType(
::arrow::map(newKeyField->type(), newValueField->type()));
} else if (name != "" && type.isDecimal()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: !name.empty()

* This method assumes each input `ColumnarBatch` have same schema.
*/
void Writer::write(const VectorPtr& data) {
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should validate in Velox as well since it's a general-purpose library.
For decimals, it should be enough to check if the data decimal value fits in the decimal table type.

auto precisionScale = getDecimalPrecisionScale(type);
// Parquet type is set from the column type rather than inferred from the
// field data.
return newField->WithType(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have to rescale the value? If the data is 1200.00 and say the precision, scale is (6, 2), the value will be 120'000. If the Table type is (7, 3) the value must be 1'200'000?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in test scaleDecimal to test out the values getting rescaled with parquet file that was created with values having less scale being inserted into table with higher scale example 12.34(DECIMAL(4,2)) becomes 123400 in the parquet file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check in a file. Can't we write and read with this change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the rescale happen on write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change just sets the precision and scale of the decimal type based on the table that was created, rather than the value that was inserted.
I believe the rescaling is happening before any changes in the PR since the first part hit is this file TableWriter.cpp and const auto& inputType = tableWriteNode->sources()[0]->outputType(); already has the data rescaled coming from LocalPlanner.cpp auto planNode = planNodes[i]; If the table is (10,4) and I insert 12.12 inputType will read as 10,4 but if I insert 12.1234 then inputType will be 6,4, but will look into this value rescaling some more to make sense of it...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question here, I suppose the literal is checked on the coordinator as well in the sense that it is checked that it will fit into the table column type.
For example, 12.1234 won't fit into DECIMAL(10,2).

But for your comment, the plan will have the table column type along with the literal value that is passed down (not sure which plan node the literal is - maybe a VALUES node).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14446 (comment) explains this

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch 4 times, most recently from 0573133 to 88aa1f3 Compare June 11, 2025 18:26
@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from 0bd4424 to 1230440 Compare July 24, 2025 01:52

std::string toString() const override;

const std::vector<TypePtr> getColumnHandleDataType() const override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column handle represents a single column.
So this function doesn't return the data type for a single handle but a vector of types for each handle.
So the name is kinda odd. Should this not be plural getColumnHandleDataTypes?

auto precisionScale = getDecimalPrecisionScale(type);
// Parquet type is set from the column type rather than inferred from the
// field data.
return newField->WithType(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question here, I suppose the literal is checked on the coordinator as well in the sense that it is checked that it will fit into the table column type.
For example, 12.1234 won't fit into DECIMAL(10,2).

But for your comment, the plan will have the table column type along with the literal value that is passed down (not sure which plan node the literal is - maybe a VALUES node).

arrowContext_->stagingBytes += bytes;
}

bool Writer::equivalentWithSimilarDecimal(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a member function?
Why can't it be a helper function in the anonymous namespace? No specific Writer information is needed.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from 1230440 to c58ee2c Compare July 25, 2025 21:04
@devavret devavret removed their request for review July 28, 2025 05:19
@@ -254,6 +254,14 @@ class ParquetInsertTableHandle : public ConnectorInsertTableHandle {

static void registerSerDe();

const std::vector<TypePtr> getColumnHandleDataTypes() const override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this as well.

if (!(typeid(*dataType) == typeid(schemaType))) {
return false;
}
for (size_t i = 0; i < dataType->size(); ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not handle nested types where the decimal type could be inside an array for example.


virtual std::string toString() const = 0;

virtual const std::vector<TypePtr> getColumnHandleDataTypes() const = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if we should add this API to the TableWriteNode. This interface seems to have minimal API calls so far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we should fix tableWriteNode->columns() to use HiveInsertTableHandle to infer the output columns.

ROW({"c0", "c5", "c6"}, {DECIMAL(10, 2), VARCHAR(), BOOLEAN()});
auto insertTableHandle = createHiveInsertTableHandle(
rowType,
"/path/to/test",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"/path/to/test",
TempDirectoryPath::create()->getPath();

Copy link
Collaborator

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just happened to hit this decimal issue, wondering what's the status of this PR? Thanks.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from d714bba to 3d28f7d Compare August 26, 2025 20:35
@jkhaliqi
Copy link
Collaborator Author

@PingLiuPing Have launched an issue to discuss design that can be found here: #14446.

dwio::common::FileFormat::PARQUET,
{"c5", "c6"},
bucketProperty);
auto precision = insertTableHandle->getColumnHandleDataTypes()[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto precision = insertTableHandle->getColumnHandleDataTypes()[0]
auto [precision, scale] = getDecimalPrecisionScale(insertTableHandle->getColumnHandleDataTypes()[0]);

void Writer::write(const VectorPtr& data) {
VELOX_USER_CHECK(
data->type()->equivalent(*schema_),
equivalentWithSimilarDecimal(*data->type(), *schema_),
Copy link
Collaborator

@PingLiuPing PingLiuPing Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why we implement a new function here?
Should we just enhance equivalent for DecimalType?
And the function name equivalentWithSimilarDecimal is misleading. Should not include decimal.

Copy link
Collaborator Author

@jkhaliqi jkhaliqi Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current equivalent checks for a match for everything and in Decimal case it will check if the scale and precision is the same. Which will fail for our Decimal case. For this equivalentWithSimilarDecimal check we can have a Decimal table column value like Decimal(10,2) and insert something like 1234.56 which is Decimal(6,2). So the precision and scale are not the same but the precision fits inside the table and the scale is the same, which is what we are checking for now in this case. It is named equivalentWithSimilarDecimal since we check for equivalence, but also similar decimal case is handled instead of exact match for decimal. The current equivalent needs to stay for cases that we are checking an exact match for both precision and scale on Decimal values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood. Here is the common code path, but you just need special logic for decimal. So the ideal way is re-implement equivalent in DecimalType class. Or at least you can check the data type here and only do special handling for decimal type. Adding a new function for all data type is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's leave this code as is. The problem should be fixed in upstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void Writer::write(const VectorPtr& data) {
VELOX_USER_CHECK(
data->type()->equivalent(*schema_),
equivalentWithSimilarDecimal(*data->type(), *schema_),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


TEST_F(HiveDataSinkTest, insertTableHandleGetColumnHandleDataTypes) {
const int32_t kNumBuckets = 4;
auto bucketProperty = std::make_shared<HiveBucketProperty>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the bucketProperty?

Copy link
Collaborator Author

@jkhaliqi jkhaliqi Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, removed from test.

@jkhaliqi jkhaliqi force-pushed the parquet_tools_inspect branch from 074b8d4 to 09874fc Compare December 8, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants