-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281
base: main
Are you sure you want to change the base?
Conversation
9ccaf8a
to
40bf7bf
Compare
5465d74
to
0f35c71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From high-level, the code looks good to me. I have really appreciated the introduction of managers and handlers for the Controller, which I believe it is improving the structure of the code.
The business logic seems reasonable to me although I'm not a big expert on using Kraft, so I'd defer this to others. Probably if @zmraul has bandwidth, it could also be good to get his thoughts/review for the more Kraft specific business logic and concepts, unless both Marc and Alex feel comfortable with this.
Just a few small points here and there. Great work @imanenami !
I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! I'm surprised how simple the implementation feels compared to the complexity of the task at hand, great work :)
I left a couple of TODOs, and some points that should be considered on the future.
@@ -226,6 +232,13 @@ def unit_broker(self) -> KafkaBroker: | |||
substrate=self.substrate, | |||
) | |||
|
|||
@property | |||
def kraft_unit_id(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: this needs to be called bootstrap_unit_id
to be consistent with models no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I prefer to keep it this way because bootstrap_unit_id
is the KRaft unit id of our assumed leader in the whole cluster.
return self.relation_data.get("directory-id", "") | ||
|
||
@property | ||
def added_to_quorum(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of important points here, not really blocking but I think they are good for future improvements
- Logic around
added_to_quorum
should probably be checked against the cluster itself, and be tracked as little as possible on the charm. - Something like an upgrade or a failure on the unit will not be correctly resolved.
At the moment there is no unsetting of this property, it will stay astrue
once the unit is first added to the quorum, so a failing unit would look at the databag, seeadded_to_quorum=True
when the reality is that the unit is on a recovery state.
Even if we unset the property, a failing unit still is problematic. Chances are remove_from_quorum
was never called before the failure, thus still leading to a wrong recovery state on the unit. This is why I'm suggesting to generally have added_to_quorum
be a live check against Kafka itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great and valid points. For the time being, I've fixed the remove_from_quorum
part. Adding live checks against active quorum could be the next logical step.
initial_controllers=f"{self.charm.state.peer_cluster.bootstrap_unit_id}@{self.charm.state.peer_cluster.bootstrap_controller}:{self.charm.state.peer_cluster.bootstrap_replica_id}", | ||
) | ||
|
||
def _leader_elected(self, event: LeaderElectedEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I'm a bit itchy when seeing/using leader_elected
event. Following up @marcoppenheimer comment I would rather have this logic happen in update_status
or config_changed
(since update status calls config_changed).
- afaik, there are no guarantees that
leader_elected
triggers ever after the fist deployment - Triggers of
leader_elected
tend to happen in the context of controller failing to see a unit, which can mean unit failure, so this code being close to failure states is not the best :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I had a discussion with @marcoppenheimer and agreed to keep it on leader_elected
hook after refactoring to controller's own event handler. Basically, it's something we want to change when changes in juju leadership happen. While I agree that checking it on other hooks (update-status
or config-changed
) won't be hurtful, it seems redundant to me.
b93c564
to
7c449b7
Compare
7c449b7
to
378a867
Compare
378a867
to
36eb160
Compare
36eb160
to
eb186f1
Compare
if not self.peer_cluster.bootstrap_controller: # FIXME: peer_cluster or cluster? | ||
return Status.NO_BOOTSTRAP_CONTROLLER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Do I understand correctly that the FIXME
here is saying "Where should we look to determine if we have a bootstrap-controller? Peer relation (cluster
) or the 'large deployment' relation (peer_cluster
)?"
If so, I think it should be a property in state.bootstrap_controller
, and that property looks in both places.
Changes