perf(controllers): implement warm-pool assignment and async binding#497
perf(controllers): implement warm-pool assignment and async binding#497vicentefb wants to merge 3 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
/ok-to-test |
c719809 to
dd1c815
Compare
| } | ||
|
|
||
| if isReady { | ||
| templateName := sandbox.Labels["agents.x-k8s.io/sandbox-template-ref"] |
There was a problem hiding this comment.
Use constant SandboxTemplateRefAnnotation
|
/assign @barney-s |
|
/priority critical-urgent |
krzysied
left a comment
There was a problem hiding this comment.
IMO this 3 changes (assigner, async and patches) shouldn't be in single PR. They mostly independent of each other and can be done as separate changes.
Can you split them into separate PR?
| w.Pools[hash] = ch | ||
|
|
||
| var sandboxes v1alpha1.SandboxList | ||
| if err := w.Client.List(ctx, &sandboxes, client.MatchingLabels{"agents.x-k8s.io/sandbox-template-ref-hash": hash}); err == nil { |
| } | ||
| } | ||
|
|
||
| if isReady { |
There was a problem hiding this comment.
In golang we try to omit unneeded nesting. Invert the if and return
| } | ||
|
|
||
| isReady := false | ||
| for _, cond := range sandbox.Status.Conditions { |
| ch, exists := w.Pools[templateHash] | ||
| w.mu.RUnlock() | ||
|
|
||
| if !exists { |
There was a problem hiding this comment.
Why we drop the entries instead of adding queue when needed?
|
|
||
| if _, inFlight := w.InFlight.Load(sandbox.Name); !inFlight { | ||
| select { | ||
| case ch <- types.NamespacedName{Name: sandbox.Name, Namespace: sandbox.Namespace}: |
There was a problem hiding this comment.
IIUC this is going to drop the pod key when the capacity is reached. It will work well for small warm pools. With Warm pools above 1000 it will create an issue
|
|
||
| // Must be Ready | ||
| isReady := false | ||
| for _, cond := range sb.Status.Conditions { |
|
|
||
| type WarmPoolAssigner struct { | ||
| client.Client | ||
| mu sync.RWMutex |
There was a problem hiding this comment.
nit: Let's keep field under mutex separated
- field1 w/o mutex
- field2 w/o mutex
- 1 line break
- mutex
- field3 w/ mutex
| } | ||
|
|
||
| func (w *WarmPoolAssigner) GetOrCreatePool(ctx context.Context, hash string) chan types.NamespacedName { | ||
| w.mu.RLock() |
There was a problem hiding this comment.
helper method? It's easier to manage mutex this way
| if err := r.Get(bgCtx, targetID, freshSandbox); err != nil { | ||
| logger.Error(err, "Async bind failed to fetch sandbox", "sandbox", targetID.Name) | ||
| r.inFlightClaims.Delete(owningClaim.UID) | ||
| r.Assigner.InFlight.Delete(targetID.Name) |
There was a problem hiding this comment.
The internal details of implementation are spilled across the code. Can we create separation layer? Interface:
- Get()
- Forget(sandbox)
- Done(sanbox)
etc?
| return | ||
| } | ||
|
|
||
| if _, inFlight := w.InFlight.Load(sandbox.Name); !inFlight { |
There was a problem hiding this comment.
It will be more coherent to keep same key as in the queue
|
|
||
| if extensions { | ||
| // 1. Initialize the Assigner | ||
| assigner := &extensionscontrollers.WarmPoolAssigner{ |
There was a problem hiding this comment.
Please add NewWarmPoolAssigner(client) method. Pools is internal field - can be initialized with a constructor method
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditya-shantanu, vicentefb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a254f15 to
0db12d0
Compare
…oxes with channel pop nit fix fix nit nit nit nit clean up clean up comments nit lint fix
0db12d0 to
bb5c2ac
Compare
|
@vicentefb: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/unassign |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This PR will be divided into multiple smaller PRs /hold |
This PR re-architects how the
SandboxClaimReconcilerhandles massive concurrency bursts (e.g., 300+ simultaneous claims). It shifts the controller from a polling model to a event-driven Watcher model, reducing CPU load, eliminating memory leaks.r.List()to scan the entire namespace for available pods. This caused an O(N) memory leak, forcing the Go Garbage Collector to panic and consume 36%+ of the CPU.args used:
1 single burst of 300 sandboxclaims with a Warmpool size of 600.
Agent Sandbox Claim Startup Latency (ms)
Average Latencies
CPU pprof baseline:
CPU pprof this PR: