Skip to content

✨ Implement warm replica support for controllers #3192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

godwinpang
Copy link
Contributor

@godwinpang godwinpang commented Apr 9, 2025

This change implements the proposal for warm replicas as proposed in #3121.
It adds an EnableWarmup option for controllers to optionally start as warmed replicas, which means that sources will be started before leader election has been won by the instance.

It also adds a new internal runnable interface called warmupRunnable with a single Warmup method that will be called before leader election. There is no guarantee that the Warmup method returns before leader election, just that it is called.

The controller's implementation of the Warmup method for the warmupRunnable starts the sources with a threadsafe method startEventSourcesAndQueueLocked that also adds events to the queue. In the case of a non-leader elected controller, it is intended for the Warmup method to race with the Start method to start the sources. The methods are synchronized by the didStartEventSourcesOnce sync.Once used in startEventSourcesAndQueueLocked.

For the most part, Warmup behaves exactly the same as Start with the following differences

  • There is no restriction as to how many times Warmup can be called, vs. Start which can only be called once.
  • Warmup doesn't initialize metrics for the controller, nor does it start the worker goroutines.

Shutdown

Warmup runnables are shutdown in a separate goroutine from the one that stops the leader election runnables, because there is no guarantee as to whether or not the Warmup or the Start method is holding the lock for the didStartEventSourcesOnce sync.Once instance.

Testing

On top of the unit + integration tests, this PR was tested with an internal project that ran this branch of controller-runtime for a week with (EnableWarmup=true`, and verified that the controllers on the non-leader elected replica had a populated workqueue.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @godwinpang. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@godwinpang godwinpang marked this pull request as draft April 9, 2025 07:12
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2025
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2025
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from e05677a to 1d07efc Compare May 2, 2025 05:56
@godwinpang
Copy link
Contributor Author

/retest

1 similar comment
@godwinpang
Copy link
Contributor Author

/retest

@godwinpang godwinpang force-pushed the warm-replica-impl branch from 1c32d37 to 7cc29dc Compare May 2, 2025 06:26
@alvaroaleman
Copy link
Member

@godwinpang any plans on wrapping this up?

@godwinpang
Copy link
Contributor Author

@godwinpang any plans on wrapping this up?

Sorry, I missed this Github notification. I've updated the PR description and also duplicated the tests between Start and Warmup as appropriate.

Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
/hold

in case Stefan has anything to add

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 781742318cfc9c5c72173ae51ebe9fb41673044d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, godwinpang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2025
@alvaroaleman alvaroaleman changed the title ✨ [Warm Replicas] Implement warm replica support for controllers. ✨ Implement warm replica support for controllers Jun 26, 2025
@godwinpang godwinpang requested a review from sbueringer June 26, 2025 23:14
@sbueringer
Copy link
Member

@godwinpang Thank you very much! We're almost there (only one idea for simplification in the prod code, otherwise just a few comments on tests)

@sbueringer
Copy link
Member

@godwinpang would be nice if we could get this PR done 😃

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@godwinpang
Copy link
Contributor Author

@godwinpang would be nice if we could get this PR done 😃

Sorry for the long wait, I've addressed the PR comments

@godwinpang godwinpang requested a review from sbueringer July 23, 2025 14:00
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, just some very minor findings now.

@alvaroaleman Can you please also take another look?

Comment on lines +340 to +341
q := c.Queue
if err := watch.Start(ctx, q); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
q := c.Queue
if err := watch.Start(ctx, q); err != nil {
if err := watch.Start(ctx, c.Queue); err != nil {

By("Unblocking leader election")
resourceLockWithHooks.UnblockLeaderElection()

By("Waiting for the leader election runnable to be executed without leader election being won")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By("Waiting for the leader election runnable to be executed without leader election being won")
By("Waiting for the leader election runnable to be executed after leader election was won")


// BlockLeaderElection blocks the leader election process when called. It will not be unblocked
// until UnblockLeaderElection is called.
// Not thread safe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure I follow what makes this not thread safe (same in l.46)

Or maybe the comment is outdated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated, sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants