-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] hoodie.parquet.outputtimestamptype setting not converting to TIMESTAMP_MILLIS #12339
Comments
Created Hudi Jira - https://issues.apache.org/jira/browse/HUDI-8592 |
@rangareddy @KendallRackley hello! i tried to reproduce this problem, but looks like it works for bulk_insert mode. here is the code
we want it works for which write mode? can you please give me more details? |
Hi @ktblsva In my reproducible code, I have not set any write operation parameter value, and its default value is 'upsert.' Could you please remove the write operation parameter and test it once again? |
From the #3429 (comment) and #4749 looks like this config seems to be intended only for BULK_INSERT |
Describe the problem you faced
Hey team,
Here's a link to the thread on the Apache Hudi Slack channel where I posted this issue:
https://apache-hudi.slack.com/archives/C4D716NPQ/p1731532187806959
I'm running a PySpark script in AWS Glue ETL. It is reading from a Postgres database table via a JDBC connection and writing the dataframe to Hudi. This DataFrame contains 7 columns. Three of the columns are type Long, with LogicalType "timestamp-micros".
I used these settings in the hoodie config:
Added this in the spark config also:
conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
but it still outputs "timestamp-micros" for field3, field4 and field7:
I tried converting it to timestamp-millis by manually setting the schema and generating a new dataframe from it:
I've tried casting it to milliseconds within the timestamp and this does not work either:
new_df = new_df.withColumn("field3", to_timestamp(col("field3"), 'yyyy-MM-dd HH:mm:ss.SSS'))
It truncates the data in the field from microseconds to milliseconds in the data, but it does not convert the datatypes for those columns eg.
2007-03-11 15:46:41.540000 -----> 2007-03-11 15:46:41.5400
Does the setting "hoodie.parquet.outputtimestamptype" just not work? Is it not possible to output timestamp-milliseconds with Spark?
To Reproduce
Steps to reproduce the behavior:
Ranga Reddy on the channel attempted to recreate this issue by setting a dataframe schema with the TimestampType class. He inserted some rows that had timestamps up to microseconds. The setting
hoodie.parquet.outputtimestamptype
was set to TIMESTAMP_MILLIS, but when writing to Hudi, the logicalType of the timestamp was still TIMESTAMP_MICROS even though the schema was set and the outputtimestamptype setting was added too.Expected behavior
I expect the parquet schema for field3 to be TIMESTAMP_MILLIS instead of TIMESTAMP_MICROS. This is what the schema output should be:
"fields" : [ {
"name" : "field1",
"type" : "integer"
}, {
"name" : "field2",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "field3",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-micros"
} ],
"default" : null
}
Environment Description
Hudi version : Hudi/AWS Bundle 0.14
Spark version : 3.3
Hive version : Not sure
Hadoop version : N/A
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Stacktrace
No Stacktrace, just output
The text was updated successfully, but these errors were encountered: