Skip to content

Conversation

@leviramsey
Copy link
Contributor

Scenario:

  • A marks B unreachable, gossips the observation
  • clean partition putting A & B on minority side
  • all nodes in majority side mark A & B unreachable
  • A and B have not yet seen latest gossip from majority side
  • A may or may not mark B reachable again: in any case A is unable to gossip the reachability change to the majority side

Thus A is indirectly connected (both a subject and observer of unreachability) but B is not. SBR resolution will be DownIndirectlyConnected (adding A to the downing decision) and then consider a secondary decision.

In considering that decision, unreachability edges between nodes where both are in the set of observers and subjects are removed, as are all edges where both of "the subject has seen the latest gossip" and "observer has observed a subject which had seen the latest gossip" are true. If indirectly connected nodes remain, the decision effectively becomes DownAll. These nodes can only be indirectly connected by the first criterion (intersection between observers and subjects): the definition of the second criterion (unreachable but seen latest gossip) is always removed (as any observers of that subject are also indirectly connected by the second criterion).

In this scenario, the SBR on the other side from A & B will DownAll: all nodes on either side of the partition after this are either crashed or downed.

The double-down-indirectly-connected is safe, but it seems to be literal overkill.

Copy link
Contributor

@patriknw patriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good optimization with the disclaimer that it might have unforeseen consequences.

}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more tests that are covering similar thing for the other strategies? KeepOldest, DownAllNodes, LeaseMajority

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be good to expand or add new multi-jvm test that cover this scenario. You can run locally with:

multi-jvm:testOnly akka.cluster.sbr.IndirectlyConnected5NodeSpec

Note that we also have SBR tests in akka-cluster-sharding/src/multi-jvm/scala/akka/cluster/sbr/

val unreachableByDirectlyConnectedObservers =
reachability.records.iterator.flatMap { r =>
if (directlyConnectedObservers(r.observer)) Some(r.subject) else None
}.toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that a collect?

          reachability.records.iterator.collect { case r if directlyConnectedObservers(r.observer) =>
               r.subject
            }.toSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tend to forget about collect...

throw new IllegalStateException(
s"SBR double $additionalDecision decision, downing all instead. " +
s"originalReachability: [$originalReachability], filtered reachability [$reachability], " +
s"still indirectlyConnected: [$indirectlyConnected], seenBy: [$seenBy]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should keep this message around and include it in the final IllegalStateException, so that we have full information about both decisions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants