[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

imanenami · 2024-12-02T09:55:42Z

Changes

Upgrade to kafka 3.9.0
Switch from KRaft static quorum to dynamic quorum
Add scaling operations on single & multi-app charm

src/core/models.py

src/events/broker.py

src/managers/config.py

deusebio

From high-level, the code looks good to me. I have really appreciated the introduction of managers and handlers for the Controller, which I believe it is improving the structure of the code.

The business logic seems reasonable to me although I'm not a big expert on using Kraft, so I'd defer this to others. Probably if @zmraul has bandwidth, it could also be good to get his thoughts/review for the more Kraft specific business logic and concepts, unless both Marc and Alex feel comfortable with this.

Just a few small points here and there. Great work @imanenami !

I

src/managers/controller.py

src/events/broker.py

src/events/controller.py

src/literals.py

zmraul

Amazing work! I'm surprised how simple the implementation feels compared to the complexity of the task at hand, great work :)

I left a couple of TODOs, and some points that should be considered on the future.

zmraul · 2024-12-20T11:15:48Z

src/core/cluster.py

@@ -226,6 +232,13 @@ def unit_broker(self) -> KafkaBroker:
            substrate=self.substrate,
        )

+    @property
+    def kraft_unit_id(self) -> int:


todo: this needs to be called bootstrap_unit_id to be consistent with models no?

Thanks, I prefer to keep it this way because bootstrap_unit_id is the KRaft unit id of our assumed leader in the whole cluster.

src/core/cluster.py

zmraul · 2024-12-20T11:29:33Z

src/core/models.py

+        return self.relation_data.get("directory-id", "")
+
+    @property
+    def added_to_quorum(self) -> bool:


A couple of important points here, not really blocking but I think they are good for future improvements

Logic around added_to_quorum should probably be checked against the cluster itself, and be tracked as little as possible on the charm.

Something like an upgrade or a failure on the unit will not be correctly resolved.
At the moment there is no unsetting of this property, it will stay as true once the unit is first added to the quorum, so a failing unit would look at the databag, see added_to_quorum=True when the reality is that the unit is on a recovery state.

Even if we unset the property, a failing unit still is problematic. Chances are remove_from_quorum was never called before the failure, thus still leading to a wrong recovery state on the unit. This is why I'm suggesting to generally have added_to_quorum be a live check against Kafka itself

These are great and valid points. For the time being, I've fixed the remove_from_quorum part. Adding live checks against active quorum could be the next logical step.

src/events/controller.py

src/events/broker.py

zmraul · 2024-12-20T11:56:15Z

src/events/controller.py

+            initial_controllers=f"{self.charm.state.peer_cluster.bootstrap_unit_id}@{self.charm.state.peer_cluster.bootstrap_controller}:{self.charm.state.peer_cluster.bootstrap_replica_id}",
+        )
+
+    def _leader_elected(self, event: LeaderElectedEvent) -> None:


todo: I'm a bit itchy when seeing/using leader_elected event. Following up @marcoppenheimer comment I would rather have this logic happen in update_status or config_changed (since update status calls config_changed).

afaik, there are no guarantees that leader_elected triggers ever after the fist deployment

Triggers of leader_elected tend to happen in the context of controller failing to see a unit, which can mean unit failure, so this code being close to failure states is not the best :)

Thanks for the comment. I had a discussion with @marcoppenheimer and agreed to keep it on leader_elected hook after refactoring to controller's own event handler. Basically, it's something we want to change when changes in juju leadership happen. While I agree that checking it on other hooks (update-status or config-changed) won't be hurtful, it seems redundant to me.

marcoppenheimer · 2025-01-07T18:41:59Z

src/core/cluster.py

+            if not self.peer_cluster.bootstrap_controller:  # FIXME: peer_cluster or cluster?
+                return Status.NO_BOOTSTRAP_CONTROLLER


suggestion: Do I understand correctly that the FIXME here is saying "Where should we look to determine if we have a bootstrap-controller? Peer relation (cluster) or the 'large deployment' relation (peer_cluster)?"

If so, I think it should be a property in state.bootstrap_controller, and that property looks in both places.

src/events/broker.py

imanenami force-pushed the wip-kafka-3-9 branch 2 times, most recently from 9ccaf8a to 40bf7bf Compare December 9, 2024 05:54

imanenami marked this pull request as ready for review December 12, 2024 06:52

imanenami force-pushed the wip-kafka-3-9 branch 5 times, most recently from 5465d74 to 0f35c71 Compare December 13, 2024 18:54

imanenami changed the title ~~WIP: upgrade to kafka 3.9~~ [DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) Dec 17, 2024

imanenami requested review from marcoppenheimer and Batalex December 17, 2024 10:05

marcoppenheimer reviewed Dec 18, 2024

View reviewed changes

imanenami requested review from marcoppenheimer and deusebio December 18, 2024 10:39

deusebio approved these changes Dec 18, 2024

View reviewed changes

src/managers/controller.py Outdated Show resolved Hide resolved

src/events/broker.py Show resolved Hide resolved

src/events/controller.py Show resolved Hide resolved

src/literals.py Outdated Show resolved Hide resolved

imanenami requested a review from zmraul December 19, 2024 06:13

imanenami linked an issue Dec 19, 2024 that may be closed by this pull request

Client relation-changed and relation-broken fails with ACL error in Kraft mode #285

Open

zmraul requested changes Dec 20, 2024

View reviewed changes

zmraul reviewed Dec 20, 2024

View reviewed changes

imanenami force-pushed the wip-kafka-3-9 branch 3 times, most recently from b93c564 to 7c449b7 Compare January 6, 2025 09:42

Iman Enami added 8 commits January 6, 2025 13:44

feat: apply snap patch

2435640

skip rack awareness integration test

e4009cb

fix: snap file name

29242b1

fix: bump kafka dependency version

448584e

fix: test_inter_broker_protocol_version logic

73e0097

fix: use internal address instead of 0.0.0.0 for controller listener

af9cc35

[DPE-5232] feat: scaling operation in KRaft mode

16a56c5

rebase with SCRAM auth changes

c55e02d

Iman Enami added 10 commits January 6, 2025 13:44

add wait to test_scale_in & some tweaks

5b85d12

fix: race condition issue

1daf1f0

improve test_kraft

ca0f50c

revert hackish changes & update snap revision to 48

ff90a3c

add controller manager and event handler

397a8d5

some fixes

aea8707

don't emit restart on leader_elected

8d6323f

cast scope

5c1662e

apply fixes from Enrico's review

bbc47fe

remove controller-quorum-uris from codebase & apply Raul's comments

d9235d4

imanenami force-pushed the wip-kafka-3-9 branch from 7c449b7 to 378a867 Compare January 6, 2025 09:45

imanenami requested a review from zmraul January 6, 2025 11:35

imanenami force-pushed the wip-kafka-3-9 branch from 378a867 to 36eb160 Compare January 6, 2025 11:51

fix CI

eb186f1

imanenami force-pushed the wip-kafka-3-9 branch from 36eb160 to eb186f1 Compare January 7, 2025 06:55

marcoppenheimer approved these changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

imanenami commented Dec 2, 2024 •

edited

Loading

deusebio left a comment

zmraul left a comment

zmraul Dec 20, 2024

imanenami Jan 6, 2025

zmraul Dec 20, 2024 •

edited

Loading

imanenami Jan 6, 2025 •

edited

Loading

zmraul Dec 20, 2024

imanenami Jan 6, 2025

marcoppenheimer Jan 7, 2025

		if not self.peer_cluster.bootstrap_controller: # FIXME: peer_cluster or cluster?
		return Status.NO_BOOTSTRAP_CONTROLLER

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

Are you sure you want to change the base?

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

Conversation

imanenami commented Dec 2, 2024 • edited Loading

Changes

deusebio left a comment

Choose a reason for hiding this comment

zmraul left a comment

Choose a reason for hiding this comment

zmraul Dec 20, 2024

Choose a reason for hiding this comment

imanenami Jan 6, 2025

Choose a reason for hiding this comment

zmraul Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

imanenami Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

zmraul Dec 20, 2024

Choose a reason for hiding this comment

imanenami Jan 6, 2025

Choose a reason for hiding this comment

marcoppenheimer Jan 7, 2025

Choose a reason for hiding this comment

imanenami commented Dec 2, 2024 •

edited

Loading

zmraul Dec 20, 2024 •

edited

Loading

imanenami Jan 6, 2025 •

edited

Loading