Skip to content

Conversation

OdedViner
Copy link

@OdedViner OdedViner commented Aug 21, 2025

  • integrate controller-runtime healthz and readyz endpoints into the namespace-scoped manager
  • expose probe server on :8081 and register default checks
  • update operator Deployment with liveness, readiness, and startup probes

Tested locally on my machine:

1.Create a private image and push to quay.io:
$ make all
$ docker tag noobaa/noobaa-operator:5.20.0 quay.io/oviner/noobaa-operator:liveness1
$ docker push quay.io/oviner/noobaa-operator:liveness1

2.Create minikube:
$ minikube delete --all
$ minikube config set memory 8000
$ minikube config set cpus 5
$ minikube start

3.create noobaa-cli
$ make cli
oviner~/DEV_REPOS/noobaa-operator/build/_output/bin(liveness_probe)$ ./noobaa-operator-local version
INFO[0000] CLI version: 5.20.0                          
INFO[0000] noobaa-image: noobaa/noobaa-core:master-20250521 
INFO[0000] operator-image: noobaa/noobaa-operator:5.20.0 

4.Deploy Noobaa-operator
oviner~/DEV_REPOS/noobaa-operator/build/_output/bin(liveness_probe)$ ./noobaa-operator-local install --dev --operator-image='quay.io/oviner/noobaa-operator:liveness1'
$ kubectl get pods
NAME                                               READY   STATUS    RESTARTS   AGE
cnpg-controller-manager-bc6cc5889-wl8n8            1/1     Running   0          5m37s
noobaa-core-0                                      2/2     Running   0          3m51s
noobaa-db-pg-cluster-1                             1/1     Running   0          4m43s
noobaa-db-pg-cluster-2                             1/1     Running   0          4m3s
noobaa-default-backing-store-noobaa-pod-9c730ee1   1/1     Running   0          2m38s
noobaa-endpoint-66499bd857-25tjn                   1/1     Running   0          2m51s
noobaa-operator-7fdbdbdfb6-6t5pq                   1/1     Running   0          5m37s

$ kubectl get pods noobaa-operator-7fdbdbdfb6-6t5pq -o yaml| grep oviner
    image: quay.io/oviner/noobaa-operator:liveness1
    image: quay.io/oviner/noobaa-operator:liveness1
    imageID: docker-pullable://quay.io/oviner/noobaa-operator@sha256:e46e35f954f8722db16283ad36d7b742dec76875e2f2d1fcf8c39490373c29b2

$ kubectl get pods noobaa-operator-7fdbdbdfb6-6t5pq -o yaml| grep -i liveness -C 4
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: healthz


5. Force Deployment to Single Pod (no overlap rollouts)
$ kubectl patch deploy noobaa-operator --type='json' -p='[
  {"op":"remove","path":"/spec/strategy/rollingUpdate"},
  {"op":"replace","path":"/spec/strategy/type","value":"Recreate"}
]'
deployment.apps/noobaa-operator patched

$ kubectl scale deploy noobaa-operator --replicas=0
deployment.apps/noobaa-operator scaled

$ kubectl wait --for=delete pod -l noobaa-operator=deployment --timeout=90s || true

$ kubectl scale deploy noobaa-operator --replicas=1
deployment.apps/noobaa-operator scaled

$ kubectl rollout status deploy/noobaa-operator
Waiting for deployment "noobaa-operator" rollout to finish: 0 of 1 updated replicas are available...
deployment "noobaa-operator" successfully rolled out

$ kubectl get pods noobaa-operator-7fdbdbdfb6-2wkss
NAME                               READY   STATUS    RESTARTS   AGE
noobaa-operator-7fdbdbdfb6-2wkss   1/1     Running   0          2m2s

6 Break Liveness Probe:
$ kubectl patch deploy noobaa-operator --type='json' -p='[
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/bad"},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":5},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":5},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":1}
]'
deployment.apps/noobaa-operator patched

7.Observe Pod Restarts:
$ kubectl get pods noobaa-operator-7dff47889d-9b2kt 
NAME                               READY   STATUS             RESTARTS     AGE
noobaa-operator-7dff47889d-9b2kt   0/1     CrashLoopBackOff   4 (7s ago)   2m2s

8. Restore Healthy Liveness Probe
$ kubectl patch deploy noobaa-operator --type='json' -p='[
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/healthz"},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/httpGet/port","value":"healthz"},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds","value":15},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/periodSeconds","value":10},
  {"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/failureThreshold","value":3}
]'
deployment.apps/noobaa-operator patched

$ kubectl get pods noobaa-operator-7fdbdbdfb6-75mcw
NAME                               READY   STATUS    RESTARTS   AGE
noobaa-operator-7fdbdbdfb6-75mcw   1/1     Running   0          2m7s

Explain the changes

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • Added health monitoring endpoints with readiness, liveness, and startup probes for improved reliability.
    • Exposed a dedicated health port to support external health checks.
  • Chores

    • Updated default memory resources to improve stability (requests: 256Mi, limits: 512Mi).

Copy link

coderabbitai bot commented Aug 21, 2025

Walkthrough

Adds health probes and memory settings to the operator Deployment, wires controller-runtime health probe endpoint in the manager, and updates the embedded deployment bundle and its SHA to reflect these YAML changes.

Changes

Cohort / File(s) Summary
Operator deployment YAML updates
deploy/operator.yaml, pkg/bundle/deploy.go
Adds container args (operator, run, --health-probe-bind-address=:8081), exposes port healthz:8081, and configures readinessProbe, livenessProbe, startupProbe hitting /readyz on healthz. Updates memory requests/limits (256Mi/512Mi). Bundled YAML and Sha256_deploy_operator_yaml updated accordingly.
Manager health probe wiring
pkg/operator/manager.go
Introduces HealthProbeBindAddress=":8081" for the non-cluster manager and registers readiness check mgr.AddReadyzCheck("readyz", healthz.Ping) with error handling.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant KS as Kubernetes Scheduler
    participant Pod as Operator Pod/Container
    participant Mgr as Controller-Runtime Manager
    participant HP as Health Probe Endpoint (:8081 /readyz)
    participant Kubelet as Kubelet Probes

    KS->>Pod: Start container with args (operator run --health-probe-bind-address=:8081)
    Pod->>Mgr: Initialize managers
    Mgr->>Mgr: Set HealthProbeBindAddress=":8081"
    Mgr->>HP: Serve health endpoints
    Mgr->>Mgr: AddReadyzCheck("readyz", healthz.Ping)

    rect rgba(230,245,255,0.5)
    Note over Kubelet,HP: Startup probe sequence
    loop until success or 30 failures
      Kubelet->>HP: HTTP GET /readyz
      HP-->>Kubelet: 200/Failure
    end
    end

    rect rgba(240,255,240,0.5)
    Note over Kubelet,HP: Readiness probe
    loop every 10s (post initialDelay 5s)
      Kubelet->>HP: HTTP GET /readyz
      HP-->>Kubelet: 200/Failure
    end
    end

    rect rgba(255,245,230,0.5)
    Note over Kubelet,HP: Liveness probe
    loop every 10s (post initialDelay 15s)
      Kubelet->>HP: HTTP GET /readyz
      HP-->>Kubelet: 200/Failure (restart on threshold)
    end
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • dannyzaken
  • tangledbytes

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f87a9a7 and 3fa2e24.

📒 Files selected for processing (1)
  • pkg/operator/manager.go (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/manager.go
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
pkg/operator/manager.go (3)

48-50: Harden flag parsing: fall back to a sane default and log.

If the flag isn’t present (e.g., invoked via the wrong subcommand) or empty, you’ll end up setting an empty HealthProbeBindAddress, which disables the probe server. Add a defensive fallback.

- // Probe address from CLI flag (defaults to :8081)
- probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")
+ // Probe address from CLI flag; fall back to default if missing
+ probeAddr, err := cmd.Flags().GetString("health-probe-bind-address")
+ if err != nil || probeAddr == "" {
+   probeAddr = ":8081"
+   log.Infof("health-probe-bind-address not set; defaulting to %s", probeAddr)
+ }

72-73: Health probe server bound only on the namespace-scoped manager. Confirm this is intentional.

Only mgr has HealthProbeBindAddress; cmgr does not. If the cluster-scoped manager wedges, probes will still report healthy. If that’s acceptable, ignore. Otherwise consider proxying a simple cmgr health indicator into mgr’s checks.


104-112: Consider a stronger readiness check (cache sync) instead of Ping-only.

healthz.Ping always returns success. For readiness, it’s common to gate on informer cache sync to avoid advertising Ready before controllers are usable.

Example:

// after mgr is created
if err := mgr.AddReadyzCheck("cache-sync", func(_ *http.Request) error {
    if mgr.GetCache().WaitForCacheSync(util.Context()) {
        return nil
    }
    return fmt.Errorf("caches not yet synced")
}); err != nil {
    log.Fatalf("Failed to add readiness check: %s", err)
}

If blocking per request is a concern, you can set a one-time “synced” atomic flag from a Runnable that waits on cache sync and have the checker read that flag.

deploy/operator.yaml (2)

48-49: Fix indentation for ports block to satisfy yamllint and avoid accidental structural errors.

Yamllint warns: expected indent 10 but found 12. Align ports with args/volumeMounts under the container spec.

-          ports:
-            - name: healthz
+          ports:
+          - name: healthz
             containerPort: 8081

You can validate with:

  • yamllint deploy/operator.yaml
  • kubectl apply --dry-run=client -f deploy/operator.yaml

59-67: Readiness probe looks good.

/readyz on the named port is correct. Timings are reasonable for controller-runtime defaults. If you adopt cache-sync readiness, /readyz will naturally gate until caches are synced.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e0a1a8d and e07011f.

📒 Files selected for processing (3)
  • deploy/operator.yaml (2 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[error] 45-45: trailing spaces

(trailing-spaces)


[warning] 48-48: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (3)
pkg/operator/manager.go (1)

27-27: Import of controller-runtime healthz is appropriate.

This is the correct package for registering /healthz and /readyz checks.

deploy/operator.yaml (2)

68-75: Liveness probe is sensible.

/healthz with a modest initialDelaySeconds is a standard setup for controller-runtime. No concerns.


76-81: Startup probe configuration is conservative and safe.

failureThreshold 30 with periodSeconds 10 gives up to 5 minutes for cold starts and image pulls. This helps prevent premature restarts.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e07011f and c6acf02.

📒 Files selected for processing (3)
  • deploy/operator.yaml (2 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/operator.go
  • pkg/operator/manager.go
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[warning] 49-49: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (2)
deploy/operator.yaml (2)

60-82: LGTM! Health probe configuration looks good.

The probe configuration is well-structured:

  • Readiness probe uses /readyz endpoint with a short initial delay (5s)
  • Liveness probe uses /healthz endpoint with appropriate initial delay (15s)
  • Startup probe provides sufficient time for initialization (300s total with 30 failures × 10s)
  • All probes use consistent timeout and period settings

The probe timings follow Kubernetes best practices for operator workloads.


52-54: Approved: bound-sa-token mount can safely remain readOnly

I searched the codebase for any write operations targeting /var/run/secrets/openshift/serviceaccount (including Go standard library file writes) and found only references for reading the token file—no code writes to this path. Therefore, mounting the service account token as readOnly: true is safe and aligns with security best practices.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
deploy/operator.yaml (3)

44-47: Good: explicit "run" subcommand included.

This addresses the earlier issue where the manager wouldn't start without "run". With the whitespace fix above, the CLI will parse as intended.


48-50: Fix ports block indentation to conform with lint rules.

Yamllint expects the dash items to align directly under the key. Apply:

-          ports:
-            - name: healthz
-              containerPort: 8081
+          ports:
+          - name: healthz
+            containerPort: 8081

44-47: Trailing spaces in args will break subcommand parsing; fix and align list indentation to satisfy yamllint.

The values "operator␠␠" and "run␠␠␠␠␠␠␠" include trailing spaces, which become part of the argv and will not match Cobra subcommands. Also, yamllint flags the sequence indentation under args. Apply:

-          args:
-            - operator  
-            - run       
-            - --health-probe-bind-address=:8081
+          args:
+          - operator
+          - run
+          - --health-probe-bind-address=:8081
🧹 Nitpick comments (1)
deploy/operator.yaml (1)

68-75: When using startupProbe, drop liveness initialDelaySeconds to avoid redundant delay.

Kubelet suppresses liveness until startup passes, so initialDelaySeconds on liveness is unnecessary and can delay recovery after restarts. Suggest:

           livenessProbe:
             httpGet:
               path: /healthz
               port: healthz
-            initialDelaySeconds: 15
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c6acf02 and 63cf177.

📒 Files selected for processing (3)
  • deploy/operator.yaml (2 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/operator.go
  • pkg/operator/manager.go
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[warning] 45-45: wrong indentation: expected 10 but found 12

(indentation)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[warning] 49-49: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (3)
deploy/operator.yaml (3)

60-67: Readiness probe looks correct and matches controller-runtime defaults.

/readyz on the named port is standard; timings are reasonable.


76-82: Startup probe parameters are sane.

/healthz with failureThreshold: 30 and period: 10s gives up to ~5 minutes for the operator to come up, which is acceptable for CRD installs/migrations.


44-82: Health/Readiness Probes Configuration Verified

All checks confirm that the operator binary and manifest are correctly configured:

  • The --health-probe-bind-address flag is defined with default :8081 in pkg/operator/operator.go:37 and consumed in pkg/operator/manager.go:49.
  • The health and readiness endpoints are registered via mgr.AddHealthzCheck("healthz", healthz.Ping) and mgr.AddReadyzCheck("readyz", healthz.Ping) in pkg/operator/manager.go:105,109.
  • No other container in the manifest binds port 8081, and the probe in deploy/operator.yaml:47–50 matches these settings.

No changes are required.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
deploy/operator.yaml (2)

44-47: Trim trailing spaces and fix args indentation (yamllint errors).

YAMLlint flags trailing spaces on Lines 45–46 and wrong indentation on Line 45. Reduce the list item indent by 2 and remove trailing spaces so CI passes and the manifest parses consistently.

Apply:

           args:
-            - operator  
-            - run       
-            - --health-probe-bind-address=:8081
+          - operator
+          - run
+          - --health-probe-bind-address=:8081

Run to verify:

#!/bin/bash
pip install -q yamllint
yamllint -d "{extends: default, rules: {line-length: disable}}" deploy/operator.yaml

48-50: Align ports list indentation with container-level keys.

List item under ports is over-indented (yamllint: expected 10, found 12). Align the dash with two spaces more than the ports key.

Apply:

           ports:
-            - name: healthz
-              containerPort: 8081
+          - name: healthz
+            containerPort: 8081

Re-run yamllint (same command as above) to confirm no indentation warnings.

🧹 Nitpick comments (5)
pkg/operator/manager.go (2)

72-72: Health probes attached only to the namespace-scoped manager

You bind the probe server only on mgr (namespace-scoped). If the cluster-scoped manager (cmgr) fails to start or crashes, the pod’s readiness/liveness may still report healthy because the probe targets only mgr. If that’s intentional, ignore; otherwise consider one of:

  • Also enabling a probe server for cmgr (e.g., on a sibling port like :8082) and wiring probes for it.
  • Or keep a single probe server (on mgr) but add a readiness check that reflects cmgr health (e.g., flip a flag once cmgr caches sync or controllers start).

If you prefer the second approach, I can sketch a checker that reports NotReady until cmgr signals it’s up.


104-111: Readiness uses Ping only; consider gating on cache sync to avoid false positives

Registering healthz.Ping for both health and readiness works, but readiness will go green as soon as the HTTP server is up, not when informers/controllers are actually ready. Consider adding a cache-sync readiness check so /readyz turns 200 only after caches synchronize.

Example change (adds a cache-sync runnable and checker):

  // Register health and readiness endpoints on mgr
  if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
    log.Fatalf("Failed to add health check: %s", err)
  }

- if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
-   log.Fatalf("Failed to add readiness check: %s", err)
- }
+ // Ready when caches have synced
+ if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+   log.Fatalf("Failed to add readiness ping: %s", err)
+ }
+ if err := mgr.AddReadyzCheck("caches", cacheReadyChecker()); err != nil {
+   log.Fatalf("Failed to add cache readiness check: %s", err)
+ }

Add this helper (outside the shown range), which uses a one-time sync and a fast check per request:

// at file top-level:
var cachesSynced atomic.Bool

// near other runnables, before starting managers:
util.Panic(mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
  if ok := mgr.GetCache().WaitForCacheSync(ctx); !ok {
    return fmt.Errorf("cache sync failed")
  }
  cachesSynced.Store(true)
  <-ctx.Done()
  return nil
})))

// helper:
func cacheReadyChecker() healthz.Checker {
  return func(_ *http.Request) error {
    if !cachesSynced.Load() {
      return fmt.Errorf("caches not yet synced")
    }
    return nil
  }
}

Note: you’ll need import "net/http" and import "sync/atomic" for the helper.

pkg/bundle/deploy.go (2)

6251-6273: Probe set is sensible; consider strengthening readyz semantics.

  • The readiness, liveness, and startup probes use /readyz and /healthz on the dedicated probe server. Timing defaults (readiness=5s initial delay, liveness=15s, startup=30×10s) look balanced for controller-runtime operators.
  • One potential gap: if your manager registers readyz with healthz.Ping, readiness will always succeed once the probe server is up (it doesn’t wait for caches/leader election). If that’s your current implementation, consider a readiness check that flips true only after caches are synced (or after leader election if that’s your intended signal).

Example (Go) of a non-blocking cache-sync readiness gate:

// after mgr is created:
var cachesSynced atomic.Bool
go func() {
    // Wait until the manager starts and caches sync once
    <-mgr.Elected() // optional: if leader election enabled
    if ok := mgr.GetCache().WaitForCacheSync(context.Background()); ok {
        cachesSynced.Store(true)
    }
}()

// register probes
_ = mgr.AddHealthzCheck("healthz", healthz.Ping)
_ = mgr.AddReadyzCheck("cache-sync", func(_ *http.Request) error {
    if cachesSynced.Load() {
        return nil
    }
    return fmt.Errorf("caches not yet synced")
})

This avoids reporting Pods as Ready before controllers can actually reconcile.

If you’ve already wired a stronger readyz check in manager.go, feel free to ignore; otherwise I can open a follow-up PR with this change.


6251-6273: Nit: explicitly set probe scheme for forward-compatibility (optional).

Kubelet defaults HTTP for httpGet, but being explicit can prevent surprises if defaults change or if TLS is enabled later:

 readinessProbe:
   httpGet:
     path: /readyz
-    port: healthz
+    port: healthz
+    scheme: HTTP
 ...
 livenessProbe:
   httpGet:
     path: /healthz
-    port: healthz
+    port: healthz
+    scheme: HTTP
 ...
 startupProbe:
   httpGet:
     path: /healthz
-    port: healthz
+    port: healthz
+    scheme: HTTP
deploy/operator.yaml (1)

60-82: Probes wiring looks correct; optional tuning with startupProbe present.

Endpoints and named port resolve properly. With startupProbe enabled, initialDelaySeconds on liveness/readiness is redundant; kubelet suppresses them until startup succeeds. Consider removing the initial delays to simplify timing semantics.

Optional diff:

           readinessProbe:
             httpGet:
               path: /readyz
               port: healthz
-            initialDelaySeconds: 5
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3
           livenessProbe:
             httpGet:
               path: /healthz
               port: healthz
-            initialDelaySeconds: 15
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3

If you keep the delays, that’s fine too—behavior is still correct.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 63cf177 and 3eb84c9.

📒 Files selected for processing (4)
  • deploy/operator.yaml (2 hunks)
  • pkg/bundle/deploy.go (3 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/operator.go
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[warning] 45-45: wrong indentation: expected 10 but found 12

(indentation)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[warning] 49-49: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (3)
pkg/operator/manager.go (1)

27-27: LGTM: correct import for controller-runtime health checks

Importing sigs.k8s.io/controller-runtime/pkg/healthz is appropriate for registering /healthz and /readyz endpoints.

pkg/bundle/deploy.go (2)

6235-6238: Args look correct; verify entrypoint supports subcommand and flag.

  • The container now invokes: operator run --health-probe-bind-address=:8081. That matches controller-runtime convention, but only if your binary’s root command exposes the run subcommand and recognizes the probe flag.
  • Please sanity-check the operator image you ship in all deployment channels accepts these args; otherwise the container may exit immediately.

Would you like a quick script to pull the image used in CI and validate the help/flags exposed by the binary? I can supply one.


6240-6241: Port naming and probe port reference are consistent.

Defining a named container port healthz: 8081 and referencing it from the probes is a good practice that makes refactors safer. No issues here.

Comment on lines 48 to 50
// Probe address from CLI flag (defaults to :8081)
probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Don’t ignore flag parsing errors; add a safe default for probe address

cmd.Flags().GetString(...) error is discarded. While it should succeed if the flag is defined, ignoring the error can mask regressions and leaves probeAddr empty (which disables the probe server) if the flag is missing or mistyped.

Apply this minimal diff to handle the error and ensure a sane default:

-	// Probe address from CLI flag (defaults to :8081)
-	probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")
+	// Probe address from CLI flag (defaults to :8081)
+	probeAddr, err := cmd.Flags().GetString("health-probe-bind-address")
+	if err != nil {
+		log.Warnf("Failed reading --health-probe-bind-address flag: %v (defaulting to :8081)", err)
+		probeAddr = ":8081"
+	}
+	if probeAddr == "" {
+		log.Warn("Empty --health-probe-bind-address disables probes; defaulting to :8081")
+		probeAddr = ":8081"
+	}

To double-check the flag wiring across the repo, run:


🏁 Script executed:

#!/bin/bash
# Verify the flag is defined and used in manifests
rg -n --type=go -C2 'health-probe-bind-address'
rg -n -C3 'health-probe-bind-address' -- 'deploy/**' 'deploy/*.yaml' 'pkg/bundle/**' 2>/dev/null || true

Length of output: 1109


Handle health-probe-bind-address flag parsing errors in pkg/operator/manager.go

  • Verified that --health-probe-bind-address is defined as a persistent flag in pkg/operator/operator.go with a default of :8081, and is referenced in the bundled manifests.
  • In pkg/operator/manager.go (around line 49), calling cmd.Flags().GetString(...)
    1. Discards any lookup error (e.g., if the flag name is mistyped or not registered on this command)
    2. May never pick up the persistent flag, since it lives in the parent command’s flag set
    3. Leaves probeAddr empty on error, inadvertently disabling health probes
  • To ensure the flag is read correctly and to guard against missing or empty values, switch to the persistent flag set, check the error, and fall back to the known default.

Apply this diff:

--- a/pkg/operator/manager.go
+++ b/pkg/operator/manager.go
@@ -46,4 +46,12 @@ func NewManagerCommand(...) {
-   // Probe address from CLI flag (defaults to :8081)
-   probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")
+   // Probe address from CLI flag (defaults to :8081)
+   probeAddr, err := cmd.PersistentFlags().GetString("health-probe-bind-address")
+   if err != nil {
+       log.Warnf("Failed to read --health-probe-bind-address: %v, defaulting to :8081", err)
+       probeAddr = ":8081"
+   }
+   if probeAddr == "" {
+       log.Warn("Empty --health-probe-bind-address disables probes; defaulting to :8081")
+       probeAddr = ":8081"
+   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Probe address from CLI flag (defaults to :8081)
probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")
// Probe address from CLI flag (defaults to :8081)
probeAddr, err := cmd.PersistentFlags().GetString("health-probe-bind-address")
if err != nil {
log.Warnf("Failed to read --health-probe-bind-address: %v, defaulting to :8081", err)
probeAddr = ":8081"
}
if probeAddr == "" {
log.Warn("Empty --health-probe-bind-address disables probes; defaulting to :8081")
probeAddr = ":8081"
}
🤖 Prompt for AI Agents
In pkg/operator/manager.go around lines 48 to 50, the call to
cmd.Flags().GetString("health-probe-bind-address") ignores errors and reads the
local flagset (so it can miss the persistent flag), which can leave probeAddr
empty and disable probes; update to use
cmd.PersistentFlags().GetString("health-probe-bind-address"), check and handle
the returned error (log/return or fallback), and if the value is empty or an
error occurred, set probeAddr to the known default ":8081" so health probes
remain enabled.

@nimrod-becker
Copy link
Contributor

@dannyzaken @tangledbytes please take a look

@OdedViner OdedViner force-pushed the liveness_probe branch 2 times, most recently from 4e1b572 to 6095110 Compare August 31, 2025 14:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
deploy/operator.yaml (2)

44-47: Trim trailing spaces in args to keep YAML clean and avoid linter failures.
Matches prior feedback; still present on "operator" and "run".

Apply:

           args:
-            - operator  
-            - run       
+            - operator
+            - run
            - --health-probe-bind-address=:8081

48-50: Fix ports block indentation.
Keep it aligned with other container-level keys.

-          ports:
-            - name: healthz
-              containerPort: 8081
+          ports:
+          - name: healthz
+            containerPort: 8081
pkg/operator/manager.go (1)

48-49: Don’t ignore flag errors; read from PersistentFlags and default safely.
Ignoring the error can disable probes if the flag isn’t found.

-	// Probe address from CLI flag (defaults to :8081)
-	probeAddr, _ := cmd.Flags().GetString("health-probe-bind-address")
+	// Probe address from CLI flag (defaults to :8081)
+	probeAddr, err := cmd.PersistentFlags().GetString("health-probe-bind-address")
+	if err != nil {
+		log.Warnf("Failed reading --health-probe-bind-address: %v (defaulting to :8081)", err)
+		probeAddr = ":8081"
+	}
+	if probeAddr == "" {
+		log.Warn("Empty --health-probe-bind-address disables probes; defaulting to :8081")
+		probeAddr = ":8081"
+	}

Run to confirm the flag is defined and referenced:

#!/bin/bash
rg -n --type=go -C2 'health-probe-bind-address'
🧹 Nitpick comments (1)
pkg/operator/manager.go (1)

104-111: Checks registration LGTM; consider clearer names and optional cache-ready check.
Minor: use names like "ping" instead of repeating endpoint names; optionally add a cache sync readiness check if needed.

-	if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
+	if err := mgr.AddHealthzCheck("ping", healthz.Ping); err != nil {
 		log.Fatalf("Failed to add health check: %s", err)
 	}
-	if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+	if err := mgr.AddReadyzCheck("ping", healthz.Ping); err != nil {
 		log.Fatalf("Failed to add readiness check: %s", err)
 	}

Optional cache-ready example (if you want stricter readiness):

ready := atomic.Bool{}
_ = mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
    // mark ready after caches start
    ready.Store(true)
    <-ctx.Done()
    return nil
}))
_ = mgr.AddReadyzCheck("started", func(_ *http.Request) error {
    if !ready.Load() { return fmt.Errorf("not started") }
    return nil
})
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4e1b572 and 6095110.

📒 Files selected for processing (4)
  • deploy/operator.yaml (2 hunks)
  • pkg/bundle/deploy.go (3 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/operator.go
  • pkg/bundle/deploy.go
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[warning] 45-45: wrong indentation: expected 10 but found 12

(indentation)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[warning] 49-49: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (3)
deploy/operator.yaml (1)

76-82: Startup probe endpoint choice is fine.
Using /readyz here is acceptable; no change needed.

pkg/operator/manager.go (2)

27-27: Import healthz looks good.
Required for adding checks.


72-72: Wiring HealthProbeBindAddress into mgr is correct.
This enables /healthz and /readyz.

Comment on lines +60 to +76
readinessProbe:
httpGet:
path: /readyz
port: healthz
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /readyz
port: healthz
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Use /healthz for liveness; /readyz for readiness (current YAML uses /readyz for liveness).
This avoids killing a live process just because it isn’t ready yet and matches the PR description.

           readinessProbe:
             httpGet:
               path: /readyz
               port: healthz
             initialDelaySeconds: 5
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3
           livenessProbe:
             httpGet:
-              path: /readyz
+              path: /healthz
               port: healthz
             initialDelaySeconds: 15
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3

🏁 Script executed:

#!/usr/bin/env bash
set -e
# Locate the operator.yaml file
OP_FILE=$(fd --type f operator.yaml || true)
echo "Found file: $OP_FILE"
[ -n "$OP_FILE" ] || { echo "operator.yaml not found"; exit 1; }
# Print lines 60–76 for verification
sed -n '60,76p' "$OP_FILE"

Length of output: 698


Use /healthz for livenessProbe
Swap the livenessProbe HTTP path to /healthz (keep readinessProbe on /readyz) to avoid killing a live process before it’s ready.

🤖 Prompt for AI Agents
In deploy/operator.yaml around lines 60 to 76, the livenessProbe is using the
/readyz path which can cause a live process to be killed before it is fully
ready; change the livenessProbe httpGet.path value from /readyz to /healthz
while leaving the readinessProbe on /readyz (no other probe changes required).

failureThreshold: 3
livenessProbe:
httpGet:
path: /readyz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all probes be the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only perform ping, I suppose it's okay for them to be set to the same path. In the future, we might want to use more advanced probes, but for now, it's enough.

}

// Register health and readiness endpoints on mgr
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use readyz for both probes, I think we can remove healthz

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Short: "Deployment using operator",
}
// health-probe flag available on operator and inherited by subcommands
cmd.PersistentFlags().String("health-probe-bind-address", ":8081", "HTTP address for health/readiness probes (e.g., :8081)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have a flag for the probe address? If there isn't a need to change it for some cases, we can have it defined in manager.go (e.g.: similar to the metrics address\port here)

- integrate controller-runtime healthz and readyz endpoints into the namespace-scoped manager
- expose probe server on :8081 and register default checks
- update operator Deployment with liveness, readiness, and startup probes

Signed-off-by: Oded Viner <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deploy/operator.yaml (2)

48-50: Align ports list indentation under 'ports:'.

Re-align the sequence to satisfy yamllint’s expected indentation.

-          ports:
-            - name: healthz
-              containerPort: 8081
+          ports:
+          - name: healthz
+            containerPort: 8081

68-75: Use /healthz for livenessProbe (keep /readyz for readiness/startup).

Best practice is liveness→/healthz, readiness/startup→/readyz. With current ping checks it’s mostly equivalent, but /healthz avoids coupling liveness to readiness state.

           livenessProbe:
             httpGet:
-              path: /readyz
+              path: /healthz
               port: healthz
🧹 Nitpick comments (1)
pkg/bundle/deploy.go (1)

6270-6275: Optional: consider /healthz for startupProbe for consistency.

Startup can also target /healthz; keeping readiness on /readyz and liveness/startup on /healthz is a common pattern.

           startupProbe:
             httpGet:
-              path: /readyz
+              path: /healthz
               port: healthz
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 30
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6095110 and f87a9a7.

📒 Files selected for processing (4)
  • deploy/operator.yaml (2 hunks)
  • pkg/bundle/deploy.go (3 hunks)
  • pkg/operator/manager.go (4 hunks)
  • pkg/operator/operator.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/operator/manager.go
  • pkg/operator/operator.go
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/operator.yaml

[warning] 45-45: wrong indentation: expected 10 but found 12

(indentation)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)


[warning] 49-49: wrong indentation: expected 10 but found 12

(indentation)

🔇 Additional comments (3)
pkg/bundle/deploy.go (3)

6237-6241: Add health probe bind address CLI flag (good).

Flag looks correct for controller-runtime; confirms the probe server will listen on :8081. Ensure the operator binary exposes this flag and uses it to start the health server.


6242-6244: Expose named health port (good).

Named port "healthz" matches the httpGet probe references; no issues.


6253-6261: Readiness probe config looks sensible.

/readyz path, timing, and thresholds are reasonable defaults.

Comment on lines +44 to +47
args:
- operator
- run
- --health-probe-bind-address=:8081
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Trim trailing spaces and normalize args indentation.

Lines 45–46 have trailing spaces; fix to keep CI/yamllint green and avoid noisy diffs.

-          args:
-            - operator  
-            - run       
-            - --health-probe-bind-address=:8081
+          args:
+            - operator
+            - run
+            - --health-probe-bind-address=:8081

To verify no trailing spaces remain:


🏁 Script executed:

#!/usr/bin/env bash
grep -nE ' +$' deploy/operator.yaml || echo "No trailing spaces"

Length of output: 93


Trim trailing spaces in deploy/operator.yaml args
Lines 45–46 contain trailing spaces; remove them to keep CI/yamllint green.

🧰 Tools
🪛 YAMLlint (1.37.1)

[warning] 45-45: wrong indentation: expected 10 but found 12

(indentation)


[error] 45-45: trailing spaces

(trailing-spaces)


[error] 46-46: trailing spaces

(trailing-spaces)

🤖 Prompt for AI Agents
In deploy/operator.yaml around lines 44 to 47, the args entries ("operator" and
"run") include trailing spaces on lines 45–46; remove the trailing whitespace
characters from those two args so each arg is trimmed to exact tokens (e.g.,
"operator" and "run") to satisfy yamllint/CI and keep YAML clean.

Comment on lines +6262 to +6269
httpGet:
path: /readyz
port: healthz
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Liveness probe should hit /healthz, not /readyz.

controller-runtime exposes both /healthz and /readyz; liveness is expected to target /healthz. Pointing liveness to /readyz can mask unhealthy-but-ready states. Switch the path to /healthz.

           livenessProbe:
             httpGet:
-              path: /readyz
+              path: /healthz
               port: healthz
             initialDelaySeconds: 15
             periodSeconds: 10
             timeoutSeconds: 5
             failureThreshold: 3
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
httpGet:
path: /readyz
port: healthz
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: healthz
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
🤖 Prompt for AI Agents
In pkg/bundle/deploy.go around lines 6262 to 6269 the livenessProbe httpGet path
is set to /readyz but should target /healthz; change the livenessProbe httpGet
path value from "/readyz" to "/healthz" so the liveness check uses the
controller-runtime health endpoint and does not mask unhealthy-but-ready states.

Signed-off-by: Oded Viner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants