Potential Split Brain Between Task Executor State and Job Actor State #781

kmg-stripe · 2025-08-01T19:26:50Z

kmg-stripe
Aug 1, 2025
Collaborator

We recently hit a weird case that seems like a potential bug, or maybe a limitation.

We have a 2 stage job, where for a short time all stage 2 workers were pulling events from what they thought was a stage 1 worker, but ended up being a worker for an unrelated job.

As far as I can tell from the logs and code, it appears that:

A stage 1 worker, say A-worker-1 on task executor TE-1 fails hard (looks like a JVM segfault)
A minute or so later, TE-1 re-registers itself and is marked available
A minute later B-worker-2 is mapped to TE-1
At this point Job A still has A-worker-1 mapped to TE-1 in its scheduling info, so workers in stage 2 of A are pulling events from B-worker-2
This persists for 2 more minutes, when A-worker-1 hits a heartbeat timeout and is failed by the JobActor.

The main issue: Even though TE-1 is bound to A-worker-1 according to the JobActor, when it fails hard and comes back up, it is re-registered as available. I would expect there to be state knowing that TE-1 should not be marked as available.

Have you seen this shape of problem before?

Thanks!

kmg-stripe · 2025-08-01T21:33:24Z

kmg-stripe
Aug 1, 2025
Collaborator Author

I think this might fix the issue: #782

We haven't had luck reproducing the issue, but will try and see if this helps.

0 replies

Andyz26 · 2025-08-04T17:16:08Z

Andyz26
Aug 4, 2025
Maintainer

hi @kmg-stripe thank you so much on the detailed insights here. I think i understand the problem now and it is a definite a bug of out of sync states between resource cluster tracking and job actors (this is not the first bug we found/fixed on this topic).
For our use cases we DO recycle containers in the resource clusters to avoid waiting for full container provision so we need the capability to mark a previous assigned container back to available pool.
I can think of two potential options here:

broadcast a message to job actors when a previously used TE is re-marked as available so that any existing job actor schedulingInfo can be refreshed. (Perf/load is a concern here though given the potential broadcast blast radius).
Mark the TE available only after default heartbeat timeout expires. Less efficient but probably a good enough brute force fix for the short term.
WDYT?

3 replies

kmg-stripe Aug 4, 2025
Collaborator Author

Thanks for the help @Andyz26 ! I'll explore doing (2) this week, since I think it should work for us and should be a quick fix.

Andyz26 Aug 5, 2025
Maintainer

@kmg-stripe i tried a direct message update approach using some internal state to avoid the broadcast, can you take a look at #784?

kmg-stripe Aug 5, 2025
Collaborator Author

Awesome, thank you! I'll take a look today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Split Brain Between Task Executor State and Job Actor State #781

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Potential Split Brain Between Task Executor State and Job Actor State #781

Uh oh!

Uh oh!

kmg-stripe Aug 1, 2025 Collaborator

Replies: 2 comments · 3 replies

Uh oh!

kmg-stripe Aug 1, 2025 Collaborator Author

Uh oh!

Andyz26 Aug 4, 2025 Maintainer

Uh oh!

kmg-stripe Aug 4, 2025 Collaborator Author

Uh oh!

Andyz26 Aug 5, 2025 Maintainer

Uh oh!

kmg-stripe Aug 5, 2025 Collaborator Author

kmg-stripe
Aug 1, 2025
Collaborator

Replies: 2 comments 3 replies

kmg-stripe
Aug 1, 2025
Collaborator Author

Andyz26
Aug 4, 2025
Maintainer

kmg-stripe Aug 4, 2025
Collaborator Author

Andyz26 Aug 5, 2025
Maintainer

kmg-stripe Aug 5, 2025
Collaborator Author