Skip to content

perf(controllers): implement warm-pool assignment and async binding#497

Open
vicentefb wants to merge 3 commits intokubernetes-sigs:mainfrom
vicentefb:isolateKubeApiServer
Open

perf(controllers): implement warm-pool assignment and async binding#497
vicentefb wants to merge 3 commits intokubernetes-sigs:mainfrom
vicentefb:isolateKubeApiServer

Conversation

@vicentefb
Copy link
Copy Markdown
Member

This PR re-architects how the SandboxClaimReconciler handles massive concurrency bursts (e.g., 300+ simultaneous claims). It shifts the controller from a polling model to a event-driven Watcher model, reducing CPU load, eliminating memory leaks.

  1. In-Memory Sandbox Assignment
  • During a burst, 1,000 concurrent workers were executing r.List() to scan the entire namespace for available pods. This caused an O(N) memory leak, forcing the Go Garbage Collector to panic and consume 36%+ of the CPU.
  • The Fix: Built the WarmPoolAssigner to hook directly into the controller-runtime Informer cache. Ready Sandboxes are pushed into a buffered Go channel (chan types.NamespacedName). Workers now execute an O(1) channel pop, eliminating the O(N) memory allocations and database scans.
  1. .Atomic-Like Handoffs & Mitigation of Sandbox Leaks
  • The old two-step assignment process was non-atomic. If a transient error occurred during the claim status update, sandboxes would permanently leak.
  • The Fix: By using the Go channel, we guarantee exclusive ownership of a Sandbox ID the exact nanosecond it is popped. The adoption PATCH is now fired in an asynchronous background worker, and the claim UID is locked in an inFlightClaims sync map. This entirely prevents double-assignments and natively mitigates Informer Lag race conditions.

args used:

        args:
        - --leader-elect=true
        - --extensions
        - --enable-pprof-debug
        - --kube-api-qps=600
        - --kube-api-burst=600
        - --sandbox-concurrent-workers=1000
        - --sandbox-claim-concurrent-workers=1000
        - --sandbox-warm-pool-concurrent-workers=20

1 single burst of 300 sandboxclaims with a Warmpool size of 600.

Agent Sandbox Claim Startup Latency (ms)

Run P50 (ms) P90 (ms) P99 (ms)
Run 1 1588.3 3178.1 4817.8
Run 2 1174.9 2265.6 3221.8
Run 3 1369.4 2394.9 4576.4
Run 4 1577.9 3033.7 4803.3
Run 5 1691.1 3507.73 4850.7
Run 6 1744.20 3563.7 4925.2
Run 7 1402.3 2297.93 2499.4
Run 8 2036.2 4298.9 5200.0
Run 9 1723.76 3754.2 4875.4
Run 10 1391.6 2417.21 4630.76

Average Latencies

Percentile Average (ms)
P50 1569.97
P90 3071.20
P99 4440.08

CPU pprof baseline:

image

CPU pprof this PR:

image

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 31, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit bb5c2ac
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69cd85fd9708be00080577e3

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 31, 2026
@vicentefb
Copy link
Copy Markdown
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Mar 31, 2026
@vicentefb vicentefb force-pushed the isolateKubeApiServer branch 2 times, most recently from c719809 to dd1c815 Compare March 31, 2026 21:09
}

if isReady {
templateName := sandbox.Labels["agents.x-k8s.io/sandbox-template-ref"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constant SandboxTemplateRefAnnotation

@aditya-shantanu
Copy link
Copy Markdown
Contributor

/assign @barney-s

@aditya-shantanu
Copy link
Copy Markdown
Contributor

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 31, 2026
@vicentefb vicentefb changed the title perf(controllers): implement warm-pool assignment, async binding, and API request reduction perf(controllers): implement warm-pool assignment and async binding Apr 1, 2026
Copy link
Copy Markdown

@krzysied krzysied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this 3 changes (assigner, async and patches) shouldn't be in single PR. They mostly independent of each other and can be done as separate changes.
Can you split them into separate PR?

w.Pools[hash] = ch

var sandboxes v1alpha1.SandboxList
if err := w.Client.List(ctx, &sandboxes, client.MatchingLabels{"agents.x-k8s.io/sandbox-template-ref-hash": hash}); err == nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

}
}

if isReady {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In golang we try to omit unneeded nesting. Invert the if and return

}

isReady := false
for _, cond := range sandbox.Status.Conditions {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper method, pls

ch, exists := w.Pools[templateHash]
w.mu.RUnlock()

if !exists {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we drop the entries instead of adding queue when needed?


if _, inFlight := w.InFlight.Load(sandbox.Name); !inFlight {
select {
case ch <- types.NamespacedName{Name: sandbox.Name, Namespace: sandbox.Namespace}:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is going to drop the pod key when the capacity is reached. It will work well for small warm pools. With Warm pools above 1000 it will create an issue


// Must be Ready
isReady := false
for _, cond := range sb.Status.Conditions {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper method


type WarmPoolAssigner struct {
client.Client
mu sync.RWMutex
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's keep field under mutex separated

  • field1 w/o mutex
  • field2 w/o mutex
  • 1 line break
  • mutex
  • field3 w/ mutex

}

func (w *WarmPoolAssigner) GetOrCreatePool(ctx context.Context, hash string) chan types.NamespacedName {
w.mu.RLock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper method? It's easier to manage mutex this way

if err := r.Get(bgCtx, targetID, freshSandbox); err != nil {
logger.Error(err, "Async bind failed to fetch sandbox", "sandbox", targetID.Name)
r.inFlightClaims.Delete(owningClaim.UID)
r.Assigner.InFlight.Delete(targetID.Name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal details of implementation are spilled across the code. Can we create separation layer? Interface:

  • Get()
  • Forget(sandbox)
  • Done(sanbox)
    etc?

return
}

if _, inFlight := w.InFlight.Load(sandbox.Name); !inFlight {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more coherent to keep same key as in the queue


if extensions {
// 1. Initialize the Assigner
assigner := &extensionscontrollers.WarmPoolAssigner{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add NewWarmPoolAssigner(client) method. Pools is internal field - can be initialized with a constructor method

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aditya-shantanu, vicentefb
Once this PR has been reviewed and has the lgtm label, please ask for approval from barney-s. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2026
@vicentefb vicentefb force-pushed the isolateKubeApiServer branch from a254f15 to 0db12d0 Compare April 1, 2026 20:53
…oxes with channel pop

nit

fix

fix

nit

nit

nit

nit

clean up

clean up comments

nit

lint

fix
@vicentefb vicentefb force-pushed the isolateKubeApiServer branch from 0db12d0 to bb5c2ac Compare April 1, 2026 20:54
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@vicentefb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-lint-go bb5c2ac link true /test presubmit-agent-sandbox-lint-go
presubmit-agent-sandbox-e2e-test bb5c2ac link true /test presubmit-agent-sandbox-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 2, 2026

/unassign
/assign

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vicentefb
Copy link
Copy Markdown
Member Author

This PR will be divided into multiple smaller PRs

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. next-step:contributor ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants