[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

Davis-Zhang-Onehouse · 2024-11-28T00:56:10Z

Change Logs

because in hudi HoodieSparkEngineContext the jobGroupId is set as activeModule , so when the query cancelling happens and thrift server tries to find active jobs with the SQL statement id and nothing will be found.

Any users including OSS using spark-hudi have such issue, not just OH.
This issue can be simply avoided by not overriding jobGroupId with activeModule.

Initial state: there is a spark job inflight, enter Ctrl+C in beeline

After the change:
Job group of spark task and query statement id are the same
the inflight job will be aborted as soon as the interruption is acked, which can be told by 1. the failed job has 0 task completed as I stop the job at the very beginning of the execution. 2. The inflight job end up as a "Failed job". Also I can see the job execution throws execption whose case is "InterruptedException" which is caused by thread.cancel(). (the 1st figure)
Also no new jobs will be scheduled, which can be proved by no new jobs scheduled after the failed job occurs (the 1st figure). Plus the same uninterrupted query with no interruption, there are much more completed jobs (the 3rd figure).

Before:
Delivering the query interruption at the same spot where we have a new spark job inflight, we saw:
[Same as After] No new jobs got scheduled after the interruption is acked.
[Diff from After] For the inflight one it continues to execute until becomes a "Completed Job".
[Diff from After] Each spark job group id is overridden as the hudi module that is currently running, instead of using the SQL statement Id.
(start of spark jobs)

(after cancellation) the interrupted task is shown as a "Completed job"

Other coverage:
When there is inflight spark job, terminate beeline connection - It's the same as Ctrl+C.
If localhost port-forward is killed (simulating connection loss), the behavior is the same as cancelling query with Ctrl+C
I see logs when spark lost connection with client the relevant job will be cleaned up

24/11/27 11:01:54 INFO Executor: Executor killed task 0.0 in stage 1.0 (TID 1), reason: Stage cancelled
24/11/27 11:01:54 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (10.0.0.72 executor driver): TaskKilled (Stage cancelled)
24/11/27 11:01:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
24/11/27 11:01:55 INFO DAGScheduler: Asked to cancel job group ebecc7ab-14f4-4d17-9060-7b7ed054fe63
24/11/27 11:01:55 WARN SparkExecuteStatementOperation: Ignore exception in terminal state with ebecc7ab-14f4-4d17-9060-7b7ed054fe63: org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table <--- the cause of this exception is thread interruption

Impact

When we cancel spark sql queries, either by disconnecting the client from network or ctrl+C when running queries over tools like beeline, it will kill the entire query including the inflight spark job.

Without this change, when cancelling queries, it will not exit until the current spark job finishes.

Risk level (write none, low medium or high below)

None

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...i-spark-client/src/test/java/org/apache/hudi/client/common/TestHoodieSparkEngineContext.java

hudi-bot · 2025-01-03T17:58:49Z

CI report:

bd10225 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM

[HUDI-8596] Fix hudi spark sql cancelling issue

b74899f

Davis-Zhang-Onehouse force-pushed the HUDI-8596 branch from 36f00d6 to b74899f Compare November 28, 2024 00:59

github-actions bot added the size:S PR with lines of changes in (10, 100] label Nov 28, 2024

yihua reviewed Jan 2, 2025

View reviewed changes

...i-spark-client/src/test/java/org/apache/hudi/client/common/TestHoodieSparkEngineContext.java Show resolved Hide resolved

address PR comments

bd10225

yihua approved these changes Jan 3, 2025

View reviewed changes

yihua changed the title ~~[HUDI-8596] Fix hudi spark sql cancelling issue~~ [HUDI-8596] Fix Hudi Spark SQL cancelling issue Jan 3, 2025

yihua merged commit d6d4e88 into apache:master Jan 3, 2025
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

Davis-Zhang-Onehouse commented Nov 28, 2024

hudi-bot commented Jan 3, 2025

yihua left a comment

[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

Conversation

Davis-Zhang-Onehouse commented Nov 28, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

hudi-bot commented Jan 3, 2025

CI report:

yihua left a comment

Choose a reason for hiding this comment