Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358

Merged
merged 2 commits into from
Jan 3, 2025

Conversation

Davis-Zhang-Onehouse
Copy link
Contributor

Change Logs

because in hudi HoodieSparkEngineContext the jobGroupId is set as activeModule , so when the query cancelling happens and thrift server tries to find active jobs with the SQL statement id and nothing will be found.

Any users including OSS using spark-hudi have such issue, not just OH.
This issue can be simply avoided by not overriding jobGroupId with activeModule.

Initial state: there is a spark job inflight, enter Ctrl+C in beeline
image

After the change:
Job group of spark task and query statement id are the same
the inflight job will be aborted as soon as the interruption is acked, which can be told by 1. the failed job has 0 task completed as I stop the job at the very beginning of the execution. 2. The inflight job end up as a "Failed job". Also I can see the job execution throws execption whose case is "InterruptedException" which is caused by thread.cancel(). (the 1st figure)
Also no new jobs will be scheduled, which can be proved by no new jobs scheduled after the failed job occurs (the 1st figure). Plus the same uninterrupted query with no interruption, there are much more completed jobs (the 3rd figure).

image
image
image

Before:
Delivering the query interruption at the same spot where we have a new spark job inflight, we saw:
[Same as After] No new jobs got scheduled after the interruption is acked.
[Diff from After] For the inflight one it continues to execute until becomes a "Completed Job".
[Diff from After] Each spark job group id is overridden as the hudi module that is currently running, instead of using the SQL statement Id.
(start of spark jobs)
image

(after cancellation) the interrupted task is shown as a "Completed job"
image

Other coverage:
When there is inflight spark job, terminate beeline connection - It's the same as Ctrl+C.
If localhost port-forward is killed (simulating connection loss), the behavior is the same as cancelling query with Ctrl+C
I see logs when spark lost connection with client the relevant job will be cleaned up

24/11/27 11:01:54 INFO Executor: Executor killed task 0.0 in stage 1.0 (TID 1), reason: Stage cancelled
24/11/27 11:01:54 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (10.0.0.72 executor driver): TaskKilled (Stage cancelled)
24/11/27 11:01:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
24/11/27 11:01:55 INFO DAGScheduler: Asked to cancel job group ebecc7ab-14f4-4d17-9060-7b7ed054fe63
24/11/27 11:01:55 WARN SparkExecuteStatementOperation: Ignore exception in terminal state with ebecc7ab-14f4-4d17-9060-7b7ed054fe63: org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table <--- the cause of this exception is thread interruption

Impact

When we cancel spark sql queries, either by disconnecting the client from network or ctrl+C when running queries over tools like beeline, it will kill the entire query including the inflight spark job.

Without this change, when cancelling queries, it will not exit until the current spark job finishes.

Risk level (write none, low medium or high below)

None

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Nov 28, 2024
@hudi-bot
Copy link

hudi-bot commented Jan 3, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yihua yihua changed the title [HUDI-8596] Fix hudi spark sql cancelling issue [HUDI-8596] Fix Hudi Spark SQL cancelling issue Jan 3, 2025
@yihua yihua merged commit d6d4e88 into apache:master Jan 3, 2025
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants