Potential Split Brain Between Task Executor State and Job Actor State #781
Replies: 2 comments 3 replies
-
|
I think this might fix the issue: #782 We haven't had luck reproducing the issue, but will try and see if this helps. |
Beta Was this translation helpful? Give feedback.
-
|
hi @kmg-stripe thank you so much on the detailed insights here. I think i understand the problem now and it is a definite a bug of out of sync states between resource cluster tracking and job actors (this is not the first bug we found/fixed on this topic).
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello @Andyz26 !
We recently hit a weird case that seems like a potential bug, or maybe a limitation.
We have a 2 stage job, where for a short time all stage 2 workers were pulling events from what they thought was a stage 1 worker, but ended up being a worker for an unrelated job.
As far as I can tell from the logs and code, it appears that:
A-worker-1on task executorTE-1fails hard (looks like a JVM segfault)TE-1re-registers itself and is marked availableB-worker-2is mapped toTE-1A-worker-1mapped toTE-1in its scheduling info, so workers in stage 2 of A are pulling events fromB-worker-2A-worker-1hits a heartbeat timeout and is failed by the JobActor.The main issue: Even though
TE-1is bound toA-worker-1according to the JobActor, when it fails hard and comes back up, it is re-registered as available. I would expect there to be state knowing thatTE-1should not be marked as available.Have you seen this shape of problem before?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions