Skip to content

Add ControllerStartupLatency metric for SandboxClaims#522

Open
igooch wants to merge 1 commit intokubernetes-sigs:mainfrom
igooch:sandboxclaim-controller-latency
Open

Add ControllerStartupLatency metric for SandboxClaims#522
igooch wants to merge 1 commit intokubernetes-sigs:mainfrom
igooch:sandboxclaim-controller-latency

Conversation

@igooch
Copy link
Copy Markdown
Contributor

@igooch igooch commented Apr 4, 2026

This PR introduces a new metric, agent_sandbox_claim_controller_startup_latency_ms, to provide higher precision tracking of SandboxClaim startup performance.

Problem

Currently, startup latency is measured using the standard Kubernetes creationTimestamp. However, this timestamp has one-second granularity. For fast-provisioning resources like SandboxClaims, where target latencies are often in the millisecond range, this granularity is too coarse and leads to inaccurate P50/P90 metrics.

Proposed Solution

The controller now stamps a high-precision controller-first-observed-at annotation during its first reconciliation cycle. The new metric measures the duration from this observation point to the "Ready" state.

Notes for the reviewer

  • Measures Controller-Observed Latency: This tracks the duration from the controller's first observation to the "Ready" state, rather than total client-perceived creation time. (A separate SDK metric will be created to track the full client-to-Ready latency).
  • Excludes Pre-Reconciliation Overhead: It omits initial API server processing, watch latency, and workqueue delays occurring before the first reconciliation cycle. This makes it a "partial" server-side metric focused strictly on controller performance.
  • Requires Inline Patching: Recording the high-precision timestamp adds an extra API call (inline patch) during the first reconciliation. To minimize API overhead, this is bundled with the tracing annotation patch whenever tracing is enabled.

@netlify
Copy link
Copy Markdown

netlify bot commented Apr 4, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 638ce4f
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69d0ac3e83823e0008a55658

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: igooch

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 4, 2026
// Inline patch, no early return, to avoid forcing a second reconcile cycle.
tc := r.Tracer.GetTraceContext(ctx)
if tc != "" && (claim.Annotations == nil || claim.Annotations[asmetrics.TraceContextAnnotation] == "") {
obsAnnotation := "agent-sandbox.kubernetes.io/controller-first-observed-at"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would just call the variables traceContext, observabilityAnnotation, needObservabilityPatch, needTraceContextPatch; it makes it easier to read

tc := r.Tracer.GetTraceContext(ctx)
if tc != "" && (claim.Annotations == nil || claim.Annotations[asmetrics.TraceContextAnnotation] == "") {
obsAnnotation := "agent-sandbox.kubernetes.io/controller-first-observed-at"
needObsPatch := claim.Annotations == nil || claim.Annotations[obsAnnotation] == ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
needObsPatch := claim.Annotations == nil || claim.Annotations[obsAnnotation] == ""
needObsPatch := claim.Annotations[obsAnnotation] == ""

You can read from a nil map, it is treated as the empty map

if tc != "" && (claim.Annotations == nil || claim.Annotations[asmetrics.TraceContextAnnotation] == "") {
obsAnnotation := "agent-sandbox.kubernetes.io/controller-first-observed-at"
needObsPatch := claim.Annotations == nil || claim.Annotations[obsAnnotation] == ""
needTcPatch := tc != "" && (claim.Annotations == nil || claim.Annotations[asmetrics.TraceContextAnnotation] == "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
needTcPatch := tc != "" && (claim.Annotations == nil || claim.Annotations[asmetrics.TraceContextAnnotation] == "")
needTcPatch := tc != "" && claim.Annotations[asmetrics.TraceContextAnnotation] == ""


// Record controller startup latency
obsAnnotation := "agent-sandbox.kubernetes.io/controller-first-observed-at"
if claim.Annotations != nil && claim.Annotations[obsAnnotation] != "" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if claim.Annotations != nil && claim.Annotations[obsAnnotation] != "" {
if obsStr := claim.Annotations[obsAnnotation]; obsStr != "" {

}
claim.Annotations[asmetrics.TraceContextAnnotation] = tc
if needObsPatch {
claim.Annotations[obsAnnotation] = time.Now().Format(time.RFC3339Nano)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider just keeping an in memory map, but ... given we're already writing to the apiserver for the trace... sgtm

@aditya-shantanu
Copy link
Copy Markdown
Contributor

Thanks Ivy.

I think we also need to have a version of the original metric where it optionally looks at a client provided timestamp (in a pre-defined annotation).
if that time isn't set, we use now() and effectively that metric will "fall back" to this one. thoughts ?

OR

we can skip emitting the claim_latency_metric if that annotation is not set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants