-
Notifications
You must be signed in to change notification settings - Fork 117
handle failed executor event #602
base: branch-2.2-kubernetes
Are you sure you want to change the base?
handle failed executor event #602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general question: how does Yarn handle this case, i.e., of executors that fail to register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/executors
/Executors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty space before (
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 1000? What about totalExpectedExecutors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totalExpectedExecutors
sounds a good idea to limit the executor numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc needs a bit of rewording to explain what the return value represents and how it is calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about this
// It represent current created executors exclude failed one.
// To avoid to create too many failed executor,
// we limit the accounting size of failed executors to totalExpectedExecutors
// so after create 2*totalExpectedExecutors executors,
// we stop create more even if all of them failed
BTW: this fix should also be upstreamed. Can you file a PR against upstream apache/master? |
Not so familiar with spark-on-yarn, I think if allocateResponse.getCompletedContainersStatuses can return this kind of executors, then in yarn mode, it can handle this scenario just like registered but failed executor.
Will do this after this is merged |
d7906c4
to
1e22fab
Compare
Signed-off-by: forrestchen <[email protected]>
1e22fab
to
cf35cda
Compare
Signed-off-by: forrestchen [email protected]
see #600
(Please fill in changes proposed in this fix)
How was this patch tested?
manual test.