You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it
We had a cluster of nomad servers that lost quorum and would not elect a new leader
Looking at the logs, the leader at the time was logging this
2022-08-17T03:11:20.634Z 2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to get snapshots: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z 2022-08-17T03:11:20.634Z [ERROR] snapshot: failed to scan snapshot directory: error="open /run/nomad-server/server/raft/snapshots: no such file or directory"
2022-08-17T03:11:20.634Z failed to send snapshot to
2022-08-17T03:11:20.634Z failed to list snapshots
2022-08-17T03:11:20.634Z failed to get log
2022-08-17T03:11:20.608Z failed to list snapshots
2022-08-17T03:11:20.608Z failed to send snapshot to
And other servers were logging this
2022-08-17T03:07:55.677Z error waiting for Raft index error=timed out after 5s waiting for index=1525203
So here is my stance:
If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.
or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.
What do you think?
The text was updated successfully, but these errors were encountered:
I'm not persuaded this is something that ought to be handled in the raft library itself. Moreover, the log you cite ("error waiting for Raft index") doesn't look like something from the library, but from Nomad, so it may be that what you're experiencing isn't purely a raft issue. I suggest you file this proposal as an issue on the https://github.com/hashicorp/nomad repo, and the maintainers of that project can decide whether it's better addressed in Nomad or here in the raft library.
I retract what I said earlier: I agree with your original statement
If the leader is broken because it cannot load the snapshots [...] the other server should realize the leader is useless and usurp him
Possible fix: in replicateTo, if we can't load a snapshot, we should step down as leader. The current code specifically doesn't stop replication for this error; it probably should, but there are likely other details we need to consider here.
Hello folks! I have a pretty lazy bug report here so apologies for not going deeper but I wanted to float a stance that by you and see if I can get away with it
We had a cluster of nomad servers that lost quorum and would not elect a new leader
Looking at the logs, the leader at the time was logging this
And other servers were logging this
So here is my stance:
If the leader is broken because it cannot load the snapshots (I have no idea how we got in this situation but lets ignore that for now); the other server should realize the leader is useless and usurp him; perhaps via invoking the Praetorians Guard.
or more down to Earth: this state should cause a heartbeat failure in some way so that we can move past it and elect a new leader.
What do you think?
The text was updated successfully, but these errors were encountered: