[SUPPORT] Metadata compaction periodically fails/hangs #12261

liiang-huang · 2024-11-15T03:23:45Z

Describe the problem you faced

Hi Hudi community, I have a glue job that is ingesting data to a Hudi MOR table. However, this job periodically failed in the below stage

Could you help to investigate this issue? I have went through this issue, but doesn't seem like the same issue. When I deleted the requested/inflight deltacommit, also tried to increase resources, the errors still persisted. Thanks!

Environment Description

Hudi version : 0.13.1
Spark version : 3.1
Storage (HDFS/S3/GCS..) : S3

Additional context

Add any other context about the problem here.

Stacktrace

Exception in User Class: jp.ne.paypay.daas.data.exceptions.JobFatalError : Streaming batch load failed with error: Could not compact s3://pay2-datalake-prod-standard/datasets/bronze/payment-accounting-db1-20241010-aurora-prod/payment_accounting/sub_payments_accounting-1761348391


Job aborted due to stage failure: Task 169 in stage 87.0 failed 4 times, most recent failure: Lost task 169.3 in stage 87.0 (TID 21675) (10.12.56.40 executor 13): ExecutorLostFailure (executor 13 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 508519 ms
--

The text was updated successfully, but these errors were encountered:

ad1happy2go · 2024-11-15T10:04:43Z

@liiang-huang Can you collect more stats from metadata table? I see executors getting lost.
You can open spark UI and executors page and see the reason for the executor loss.
How many files you see under .metadata directory? is colstats or RLI enabled. Please share the hudi configs.

liiang-huang · 2024-11-18T05:00:24Z

@ad1happy2go Yes, the reason is

Executor heartbeat timed out after 636587 ms

There are 229 objects in .hoodie/metadata/.hoodie folder, there is a column_stats in metadata folder. Let me know what should I look for further!

rangareddy · 2024-12-04T10:10:07Z

Hi @liiang-huang

Could you please share hudi writer configuration and spark configuration? It is possible to provide timeline to check our end.

ad1happy2go · 2025-02-10T09:51:54Z

@liiang-huang Were you able to get it resolved as i see no update here? Can you please share insights on what was the issue.

liiang-huang · 2025-02-15T03:58:22Z

Hi @ad1happy2go @rangareddy This is still happening sometimes, there are not much logs other than heartbeat

SaveIntoDataSourceCommand org.apache.hudi.Spark31DefaultSource@4fed7de0, Map(hoodie.payload.ordering.field -> daas_internal_ts, hoodie.datasource.hive_sync.database -> pay2bronze, hoodie.datasource.hive_sync.mode -> GLUE, hoodie.filesystem.view.incr.timeline.sync.enable -> false, hoodie.schema.on.read.enable -> true, path -> s3://pay2-datalake-prod-standard/datasets/bronze/paylite-payment-db1-w-slave-20220523-aurora-prod/paylite_payment/sub_payments-1661338391, hoodie.compact.inline.max.delta.seconds -> 3600, hoodie.datasource.write.precombine.field -> daas_internal_ts, hoodie.datasource.write.payload.class -> jp.ne.paypay.daas.data.util.DaaSOverwritePayload, hoodie.compact.inline.trigger.strategy -> NUM_OR_TIME, hoodie.cleaner.fileversions.retained -> 6, hoodie.datasource.meta.sync.enable -> true, hoodie.write.commit.callback.on -> true, hoodie.metadata.enable -> true, hoodie.datasource.hive_sync.table -> paylite_payment_sub_payments, hoodie.datasource.meta_sync.condition.sync -> false, hoodie.write.commit.callback.class -> jp.ne.paypay.daas.data.metrics.DaasHudiWriteCommitCallback, hoodie.index.type -> BLOOM, hoodie.datasource.write.operation -> upsert, hoodie.rollback.using.markers -> false, hoodie.metrics.reporter.type -> CLOUDWATCH, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> paylite_payment_sub_payments, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.write.table.name -> paylite_payment_sub_payments, hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS, hoodie.write.markers.type -> DIRECT, hoodie.compact.inline -> true, hoodie.datasource.compaction.async.enable -> false, hoodie.metrics.on -> true, hoodie.upsert.shuffle.parallelism -> 200, hoodie.meta.sync.client.tool.class -> org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool, hoodie.datasource.write.partitionpath.field -> daas_date, hoodie.compact.inline.max.delta.commits -> 1, hoodie.payload.event.time.field -> daas_internal_ts

Spark config:


spark.driver.memory | 80g
-- | --
spark.driver.port | 35617
spark.dynamicAllocation.enabled | true
spark.dynamicAllocation.executorIdleTimeout | 600s
spark.dynamicAllocation.initialExecutors | 5
spark.dynamicAllocation.maxExecutors | 19
spark.dynamicAllocation.minExecutors | 3
spark.dynamicAllocation.shuffleTracking.enabled | true
spark.eventLog.dir | /tmp/spark-event-logs/
spark.eventLog.enabled | true
spark.executor.cores | 24
spark.executor.extraClassPath | /tmp:/opt/amazon/conf:/opt/amazon/glue-manifest.jar
spark.executor.heartbeatInterval | 3000s
spark.executor.id | driver
spark.executor.memory | 96g
spark.executor.memoryOverhead | 12g
spark.extraListeners | com.amazonaws.services.glueexceptionanalysis.GlueExceptionAnalysisListener
spark.files.overwrite | true
spark.glue.connection-names | daas_ingester_connection
spark.glue.enable-continuous-cloudwatch-log | false
spark.glue.enable-continuous-log-filter | true
spark.glue.enable-job-insights | true
spark.glue.endpoint | https://glue-jes.ap-northeast-1.amazonaws.com
spark.glue.extra-files | s3://pay2-datalake-prod-scripts/daas/log4jproperties/WARN/log4j.properties
spark.glue.extra-jars | s3://pay2-datalake-prod-scripts/daas/libs/daas-data-core-assembly-latest.jar
spark.glue.GLUE_COMMAND_CRITERIA | glueetl
spark.glue.GLUE_TASK_GROUP_ID | 8944fe9d-6a5d-449b-8fe7-8b160959b19b
spark.glue.GLUE_VERSION | 3.0
spark.glue.java-options | -XX:+UseCompressedOops -XX:+UseG1GC -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
spark.glue.JOB_NAME | paylite_payment_sub_payments-binlog-ingester-1661338391
spark.glue.JOB_RUN_ID | jr_35d97af6b19f1fd8227f7a2db329e1e90fc05bbfebf90125180de591acb7298d
spark.glue.USE_PROXY | true
spark.glue.user-jars-first | false
spark.glueAppInsightsLog.dir | /tmp/glue-app-insights-logs/
spark.glueExceptionAnalysisEventLog.dir | /tmp/glue-exception-analysis-logs/
spark.glueJobInsights.enabled | true
spark.hadoop.aws.glue.endpoint | https://glue.ap-northeast-1.amazonaws.com
spark.hadoop.aws.glue.proxy.host | 169.254.76.0
spark.hadoop.aws.glue.proxy.port | 8888
spark.hadoop.fs.s3.buffer.dir | /tmp/hadoop-spark/s3
spark.hadoop.fs.s3.impl | com.amazon.ws.emr.hadoop.fs.EmrFileSystem
spark.hadoop.glue.michiganCredentialsProviderProxy | com.amazonaws.services.glue.remote.LakeformationCredentialsProvider
spark.hadoop.hive.metastore.client.factory.class | com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.hadoop.hive.metastore.warehouse.dir | /tmp/spark-warehouse
spark.hadoop.lakeformation.credentials.url | http://169.254.76.0:9998/lakeformationcredentials
spark.hadoop.mapred.output.committer.class | org.apache.hadoop.mapred.DirectOutputCommitter
spark.hadoop.mapred.output.direct.EmrFileSystem | true
spark.hadoop.mapred.output.direct.NativeS3FileSystem | true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version | 2
spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs | false
spark.hadoop.parquet.enable.summary-metadata | false
spark.jars | CanalBinlog2Hudi.scala.jar
spark.kryo.registrator | org.apache.spark.HoodieSparkKryoRegistrar
spark.master | jes
spark.metrics.conf.*.sink.GlueCloudwatch.class | org.apache.spark.metrics.sink.GlueCloudwatchSink
spark.metrics.conf.*.sink.GlueCloudwatch.jobName | paylite_payment_sub_payments-binlog-ingester-1661338391
spark.metrics.conf.*.sink.GlueCloudwatch.jobRunId | jr_35d97af6b19f1fd8227f7a2db329e1e90fc05bbfebf90125180de591acb7298d
spark.metrics.conf.*.sink.GlueCloudwatch.namespace | Glue
spark.metrics.conf.*.source.jvm.class | org.apache.spark.metrics.source.JvmSource
spark.metrics.conf.*.source.s3.class | org.apache.spark.metrics.source.S3FileSystemSource
spark.metrics.conf.*.source.system.class | org.apache.spark.metrics.source.SystemMetricsSource
spark.metrics.conf.driver.source.aggregate.class | org.apache.spark.metrics.source.AggregateMetricsSource
spark.network.timeout | 3100s
spark.pyFiles |  
spark.pyspark.python | /usr/bin/python3
spark.rpc.askTimeout | 600
spark.scheduler.mode | FIFO
spark.serializer | org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enable | false
spark.shuffle.service.enabled | false
spark.sql.avro.datetimeRebaseModeInRead | CORRECTED
spark.sql.avro.datetimeRebaseModeInWrite | CORRECTED
spark.sql.catalogImplementation | hive
spark.sql.extensions | org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.sql.legacy.avro.datetimeRebaseModeInRead | CORRECTED
spark.sql.legacy.avro.datetimeRebaseModeInWrite | CORRECTED
spark.sql.legacy.parquet.datetimeRebaseModeInRead | CORRECTED
spark.sql.legacy.parquet.datetimeRebaseModeInWrite | CORRECTED
spark.sql.legacy.parquet.int96RebaseModeInRead | CORRECTED
spark.sql.legacy.parquet.int96RebaseModeInWrite | CORRECTED
spark.sql.parquet.datetimeRebaseModeInRead | CORRECTED
spark.sql.parquet.datetimeRebaseModeInWrite | CORRECTED
spark.sql.parquet.fs.optimized.committer.optimization-enabled | true
spark.sql.parquet.int96RebaseModeInRead | CORRECTED
spark.sql.parquet.int96RebaseModeInWrite | CORRECTED
spark.sql.parquet.output.committer.class | com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
spark.sql.shuffle.partitions | 1500
spark.ui.enabled | false
spark.unsafe.sorter.spill.read.ahead.enabled | false

liiang-huang changed the title ~~[SUPPORT] Metadata compaction periodically failure/hang~~ [SUPPORT] Metadata compaction periodically fails/hangs Nov 15, 2024

ad1happy2go added metadata metadata table priority:critical production down; pipelines stalled; Need help asap. labels Nov 15, 2024

ad1happy2go added this to Hudi Issue Support Nov 15, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Nov 15, 2024

ad1happy2go moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Metadata compaction periodically fails/hangs #12261

[SUPPORT] Metadata compaction periodically fails/hangs #12261

liiang-huang commented Nov 15, 2024 •

edited

Loading

ad1happy2go commented Nov 15, 2024

liiang-huang commented Nov 18, 2024 •

edited

Loading

rangareddy commented Dec 4, 2024

ad1happy2go commented Feb 10, 2025

liiang-huang commented Feb 15, 2025

[SUPPORT] Metadata compaction periodically fails/hangs #12261

[SUPPORT] Metadata compaction periodically fails/hangs #12261

Comments

liiang-huang commented Nov 15, 2024 • edited Loading

ad1happy2go commented Nov 15, 2024

liiang-huang commented Nov 18, 2024 • edited Loading

rangareddy commented Dec 4, 2024

ad1happy2go commented Feb 10, 2025

liiang-huang commented Feb 15, 2025

liiang-huang commented Nov 15, 2024 •

edited

Loading

liiang-huang commented Nov 18, 2024 •

edited

Loading