Skip to content

optimization: replace .Update() with .Patch() for sandbox updateStatus#509

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
vicentefb:patchSandboxStatus
Apr 3, 2026
Merged

optimization: replace .Update() with .Patch() for sandbox updateStatus#509
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
vicentefb:patchSandboxStatus

Conversation

@vicentefb
Copy link
Copy Markdown
Member

@vicentefb vicentefb commented Apr 2, 2026

In an effort to reduce "Operation cannot be fulfilled..." conflicts at scale, this PR switches to patching to the status of Sandbox resource status.

Tests from main without this change indicate:

526 operation cannot be fulfilled conflicts from sandboxclaim (protoPayload.resourceName="pods/sandboxclaim-" OR protoPayload.resourceName="sandboxclaims/")

With this change, it decreased to ~16 conflicts.

Test parameters:

# BURST_SIZE * TOTAL_BURSTS = Total sandbox claims created
BURST_SIZE=300
QPS=300
TOTAL_BURSTS=5
WARMPOOL_SIZE=600
RUNTIME_CLASS="" # Change to "gvisor" if your cluster supports it

Deployment args:

        args:
        - "--leader-elect=true"
        - "--extensions"
        - "--enable-tracing=true"
        - --zap-log-level=debug
        - --zap-encoder=json
        - --enable-pprof-debug
        - --kube-api-qps=1000
        - --kube-api-burst=2000
        - --sandbox-concurrent-workers=400
        - --sandbox-claim-concurrent-workers=400
        - --sandbox-warm-pool-concurrent-workers=1

@k8s-ci-robot k8s-ci-robot requested review from barney-s and soltysh April 2, 2026 19:11
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 2, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 3a8fbf2
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69d02ff405e56c00083921cc

@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 2, 2026

Duplicate of #508 or different?

@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 2, 2026

Never mind - different controllers!

@justinsb justinsb self-assigned this Apr 2, 2026
@justinsb
Copy link
Copy Markdown
Contributor

justinsb commented Apr 2, 2026

If there's a conflict here, does that mean a different controller is also updating Sandbox Status?

@aditya-shantanu
Copy link
Copy Markdown
Contributor

If there's a conflict here, does that mean a different controller is also updating Sandbox Status?

I think the situation here is two claim requests trying to claim the same sandbox.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2026
@vicentefb
Copy link
Copy Markdown
Member Author

vicentefb commented Apr 3, 2026

Hello @justinsb @aditya-shantanu

I ran two very small tests by having 1 claim with a warmpool of size 2.

BURST_SIZE=1
QPS=1
TOTAL_BURSTS=10
WARMPOOL_SIZE=2

This is the timeline of api calls. This is from OSS main (no optimizations).

  • Notice at step 4, the claim controller successfully updates the status. But later at8, it tries to do it again and hits a 409 Conflict.
  • The core sandbox controller throws two separate 409 Conflicts (16 and 21) on two different pods (ngxh4 and zdhtk). Proving that r.Status().Update() cannot keep up with the state changes during standard lifecycle events.
  • Because of the conflicts, it takes roughly 160 ms for a single claim to fully settle its own status

Also, because the r.Update() method causes K8s to throw 409 Conflicts, the reconcile loop is forced to restart and execute from the top five separate times. Every time it restarts, it fires a DELETE request for a Network Policy that doesn't exist, resulting in five wasted 404 Not Found API calls (maybe this is a separate issue to fix and get from cache)

Step Delta (ms) Method Target Resource Result / Notes
1 0 PATCH sandboxclaims/agent-claim-3 🟢 Start: Test Runner injects Claim 3
2 19 UPDATE sandboxes/warmpool-0-ngxh4 🟢 Claim Controller adopts Sandbox
3 31 DELETE networkpolicies/agent-claim-3-network-policy 🟡 404 Not Found
4 47 UPDATE sandboxclaims/agent-claim-3/status 🟢 Claim Controller updates status
5 47 CREATE sandboxes 🟢 WarmPool notices missing pod, orders replacement
6 57 DELETE networkpolicies/agent-claim-3-network-policy 🟡 404 Not Found (Reconcile retry)
7 65 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
8 68 UPDATE sandboxclaims/agent-claim-3/status 🔴 409 Conflict Claim controller collides on status update.
9 76 PATCH sandboxes/warmpool-0-zdhtk 🟢 WarmPool configures replacement sandbox
10 76 UPDATE pods/warmpool-0-ngxh4 🟢 Sandbox Controller updates adopted pod
11 87 DELETE networkpolicies/agent-claim-3-network-policy 🟡 404 Not Found (Reconcile retry)
12 96 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
13 97 UPDATE sandboxes/warmpool-0-ngxh4/status 🟢 Sandbox Controller updates status of adopted pod
14 112 UPDATE sandboxes/warmpool-0-ngxh4/status 🔴 409 Conflict Sandbox controller collides on status.
15 114 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
16 134 UPDATE sandboxes/warmpool-0-ngxh4/status 🔴 409 Conflict Sandbox controller collides again.
17 135 CREATE pods/warmpool-0-zdhtk 🟢 Sandbox Controller provisions physical replacement pod
18 141 DELETE networkpolicies/agent-claim-3-network-policy 🟡 404 Not Found (Reconcile retry)
19 157 UPDATE sandboxes/warmpool-0-zdhtk/status 🟢 Sandbox Controller updates new pod status
20 159 UPDATE sandboxclaims/agent-claim-3/status 🟢 Retry Succeeds: Claim status finally resolves.
21 176 UPDATE sandboxes/warmpool-0-zdhtk/status 🔴 409 Conflict Sandbox controller collides on replacement pod status.
22 187 DELETE networkpolicies/agent-claim-3-network-policy 🟡 404 Not Found (Reconcile retry)

On the other hand with .Patch() for the sandbox status we see the following behavior:

Step Delta (ms) Method Target Resource Result / Notes
1 0 PATCH sandboxclaims/agent-claim-5 🟢 Start: Test Runner injects Claim 5
2 20 UPDATE sandboxes/warmpool-0-5vghw 🟢 Claim Controller adopts Sandbox
3 30 DELETE networkpolicies/agent-claim-5-network-policy 🟡 404 Not Found (Only fires once!)
4 48 CREATE sandboxes 🟢 WarmPool orders replacement
5 49 UPDATE sandboxclaims/agent-claim-5/status 🟢 Claim Controller updates status
6 58 DELETE networkpolicies/agent-claim-5-network-policy 🟡 404 Not Found
7 66 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
8 68 UPDATE sandboxclaims/agent-claim-5/status 🔴 409 Conflict! (Claim Controller still unoptimized)
9 76 PATCH sandboxes/warmpool-0-r5c52 🟢 WarmPool configures replacement sandbox
10 78 DELETE networkpolicies/agent-claim-5-network-policy 🟡 404 Not Found
11 89 DELETE networkpolicies/agent-claim-5-network-policy 🟡 404 Not Found
12 92 PATCH sandboxes/warmpool-0-5vghw/status 🟢 Success: Sandbox Status Patched w/ no conflict!
13 96 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
14 109 PATCH sandboxes/warmpool-0-5vghw/status 🟢 Success: Sandbox Status Patched w/ no conflict!
15 112 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status
16 113 DELETE networkpolicies/agent-claim-5-network-policy 🟡 404 Not Found
17 118 CREATE pods/warmpool-0-r5c52 🟢 Sandbox Controller provisions physical replacement pod
18 129 UPDATE sandboxclaims/agent-claim-5/status 🟢 Retry Succeeds: Claim status resolves.
19 141 CREATE services/warmpool-0-r5c52 🟢 Service provisioned for new pod
20 1091 PATCH sandboxes/warmpool-0-r5c52/status 🟢 Success: Replacement Sandbox Status Patched!
21 1120 PATCH sandboxwarmpools/warmpool-0/status 🟢 WarmPool updates pool status

So technically, two controllers are acting on the exact same resource but they are acting on different sub-domains of that resource. The claim controls owns the metadata, and the sandbox controller owns the status. The 409 Conflicts weren't caused by a logic collision; they were caused by r.Update() on the entire object. By moving to r.Patch(), we allow the API server to safely merge metadata updates and status updates concurrently. ALso, at scale, if a worker thread fails an Update due to a 409 Conflict, it would put the sandbox back in the queue and says, "Wait 5 milliseconds before trying again." If it hits another 409, it says, "Wait 10ms." Then 20ms, 40ms, 80ms... etc

@vicentefb vicentefb force-pushed the patchSandboxStatus branch from dfff39d to 3a8fbf2 Compare April 3, 2026 21:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2026
@barney-s
Copy link
Copy Markdown
Contributor

barney-s commented Apr 3, 2026

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: barney-s, vicentefb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 3, 2026
@k8s-ci-robot k8s-ci-robot merged commit 953032b into kubernetes-sigs:main Apr 3, 2026
10 checks passed
@barney-s
Copy link
Copy Markdown
Contributor

barney-s commented Apr 3, 2026

@vicentefb do we see conflicts in status update behavior with single resource (not scale testing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants