Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

imanenami
Copy link
Contributor

@imanenami imanenami commented Dec 2, 2024

Changes

  • Upgrade to kafka 3.9.0
  • Switch from KRaft static quorum to dynamic quorum
  • Add scaling operations on single & multi-app charm

@imanenami imanenami force-pushed the wip-kafka-3-9 branch 2 times, most recently from 9ccaf8a to 40bf7bf Compare December 9, 2024 05:54
@imanenami imanenami marked this pull request as ready for review December 12, 2024 06:52
@imanenami imanenami force-pushed the wip-kafka-3-9 branch 5 times, most recently from 5465d74 to 0f35c71 Compare December 13, 2024 18:54
@imanenami imanenami changed the title WIP: upgrade to kafka 3.9 [DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) Dec 17, 2024
src/core/models.py Outdated Show resolved Hide resolved
src/core/models.py Show resolved Hide resolved
src/core/models.py Outdated Show resolved Hide resolved
src/core/models.py Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/managers/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From high-level, the code looks good to me. I have really appreciated the introduction of managers and handlers for the Controller, which I believe it is improving the structure of the code.

The business logic seems reasonable to me although I'm not a big expert on using Kraft, so I'd defer this to others. Probably if @zmraul has bandwidth, it could also be good to get his thoughts/review for the more Kraft specific business logic and concepts, unless both Marc and Alex feel comfortable with this.

Just a few small points here and there. Great work @imanenami !

I

src/managers/controller.py Outdated Show resolved Hide resolved
src/events/broker.py Show resolved Hide resolved
src/events/controller.py Show resolved Hide resolved
src/literals.py Outdated Show resolved Hide resolved
Copy link
Contributor

@zmraul zmraul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! I'm surprised how simple the implementation feels compared to the complexity of the task at hand, great work :)

I left a couple of TODOs, and some points that should be considered on the future.

@@ -226,6 +232,13 @@ def unit_broker(self) -> KafkaBroker:
substrate=self.substrate,
)

@property
def kraft_unit_id(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this needs to be called bootstrap_unit_id to be consistent with models no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I prefer to keep it this way because bootstrap_unit_id is the KRaft unit id of our assumed leader in the whole cluster.

src/core/cluster.py Show resolved Hide resolved
return self.relation_data.get("directory-id", "")

@property
def added_to_quorum(self) -> bool:
Copy link
Contributor

@zmraul zmraul Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of important points here, not really blocking but I think they are good for future improvements

  • Logic around added_to_quorum should probably be checked against the cluster itself, and be tracked as little as possible on the charm.
  • Something like an upgrade or a failure on the unit will not be correctly resolved.
    At the moment there is no unsetting of this property, it will stay as true once the unit is first added to the quorum, so a failing unit would look at the databag, see added_to_quorum=True when the reality is that the unit is on a recovery state.

Even if we unset the property, a failing unit still is problematic. Chances are remove_from_quorum was never called before the failure, thus still leading to a wrong recovery state on the unit. This is why I'm suggesting to generally have added_to_quorum be a live check against Kafka itself

Copy link
Contributor Author

@imanenami imanenami Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great and valid points. For the time being, I've fixed the remove_from_quorum part. Adding live checks against active quorum could be the next logical step.

src/events/controller.py Outdated Show resolved Hide resolved
src/events/controller.py Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
initial_controllers=f"{self.charm.state.peer_cluster.bootstrap_unit_id}@{self.charm.state.peer_cluster.bootstrap_controller}:{self.charm.state.peer_cluster.bootstrap_replica_id}",
)

def _leader_elected(self, event: LeaderElectedEvent) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I'm a bit itchy when seeing/using leader_elected event. Following up @marcoppenheimer comment I would rather have this logic happen in update_status or config_changed (since update status calls config_changed).

  • afaik, there are no guarantees that leader_elected triggers ever after the fist deployment
  • Triggers of leader_elected tend to happen in the context of controller failing to see a unit, which can mean unit failure, so this code being close to failure states is not the best :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I had a discussion with @marcoppenheimer and agreed to keep it on leader_elected hook after refactoring to controller's own event handler. Basically, it's something we want to change when changes in juju leadership happen. While I agree that checking it on other hooks (update-status or config-changed) won't be hurtful, it seems redundant to me.

@imanenami imanenami force-pushed the wip-kafka-3-9 branch 3 times, most recently from b93c564 to 7c449b7 Compare January 6, 2025 09:42
Comment on lines +547 to +548
if not self.peer_cluster.bootstrap_controller: # FIXME: peer_cluster or cluster?
return Status.NO_BOOTSTRAP_CONTROLLER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Do I understand correctly that the FIXME here is saying "Where should we look to determine if we have a bootstrap-controller? Peer relation (cluster) or the 'large deployment' relation (peer_cluster)?"

If so, I think it should be a property in state.bootstrap_controller, and that property looks in both places.

src/events/broker.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client relation-changed and relation-broken fails with ACL error in Kraft mode
4 participants