[HUDI-8596] Fix Hudi Spark SQL cancelling issue #12358
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
because in hudi HoodieSparkEngineContext the jobGroupId is set as activeModule , so when the query cancelling happens and thrift server tries to find active jobs with the SQL statement id and nothing will be found.
Any users including OSS using spark-hudi have such issue, not just OH.
This issue can be simply avoided by not overriding jobGroupId with activeModule.
Initial state: there is a spark job inflight, enter Ctrl+C in beeline
After the change:
Job group of spark task and query statement id are the same
the inflight job will be aborted as soon as the interruption is acked, which can be told by 1. the failed job has 0 task completed as I stop the job at the very beginning of the execution. 2. The inflight job end up as a "Failed job". Also I can see the job execution throws execption whose case is "InterruptedException" which is caused by thread.cancel(). (the 1st figure)
Also no new jobs will be scheduled, which can be proved by no new jobs scheduled after the failed job occurs (the 1st figure). Plus the same uninterrupted query with no interruption, there are much more completed jobs (the 3rd figure).
Before:
Delivering the query interruption at the same spot where we have a new spark job inflight, we saw:
[Same as After] No new jobs got scheduled after the interruption is acked.
[Diff from After] For the inflight one it continues to execute until becomes a "Completed Job".
[Diff from After] Each spark job group id is overridden as the hudi module that is currently running, instead of using the SQL statement Id.
(start of spark jobs)
(after cancellation) the interrupted task is shown as a "Completed job"
Other coverage:
When there is inflight spark job, terminate beeline connection - It's the same as Ctrl+C.
If localhost port-forward is killed (simulating connection loss), the behavior is the same as cancelling query with Ctrl+C
I see logs when spark lost connection with client the relevant job will be cleaned up
Impact
When we cancel spark sql queries, either by disconnecting the client from network or ctrl+C when running queries over tools like beeline, it will kill the entire query including the inflight spark job.
Without this change, when cancelling queries, it will not exit until the current spark job finishes.
Risk level (write none, low medium or high below)
None
Documentation Update
none
Contributor's checklist