Skip to content

Conversation

robinlem
Copy link

Add dedicated health check endpoints for improved Kubernetes integration:

  • /probe/livez: Liveness probe with heartbeat-based monitoring
  • /probe/readyz: Readiness probe checking data availability

Key features:

  • Lightweight checks (<10µs response time) using atomic operations
  • Support for both passive (interval=0) and active collection modes
  • JSON response format with status, timestamp, and duration
  • Configurable tolerance (2x collection interval) for liveness detection
  • Thread-safe implementation with comprehensive test coverage

Implementation:

  • Add LiveChecker and ReadyChecker interfaces to service package
  • Implement health checks in PowerMonitor with heartbeat tracking
  • Create HealthProbeService for HTTP endpoint handling
  • Update Helm chart to use new endpoints by default

Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.

Closes 2282

Add dedicated health check endpoints for improved Kubernetes integration:
- /probe/livez: Liveness probe with heartbeat-based monitoring
- /probe/readyz: Readiness probe checking data availability

Key features:
* Lightweight checks (<10µs response time) using atomic operations
* Support for both passive (interval=0) and active collection modes
* JSON response format with status, timestamp, and duration
* Configurable tolerance (2x collection interval) for liveness detection
* Thread-safe implementation with comprehensive test coverage

Implementation:
* Add LiveChecker and ReadyChecker interfaces to service package
* Implement health checks in PowerMonitor with heartbeat tracking
* Create HealthProbeService for HTTP endpoint handling
* Update Helm chart to use new endpoints by default

Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics
for health probes, providing more accurate health status detection.
@github-actions github-actions bot added the feat A new feature or enhancement label Sep 17, 2025
Add healthCheckTolerance option to monitor for flexible liveness probe timing.
Default remains 2.0x interval for backward compatibility.
Comment on lines +483 to +485
if pm.snapshot.Load() == nil {
return false, fmt.Errorf("no data yet")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make it not ready. No snapshot only means that no scrape has been made and does not mean monitor is not in ready state ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitor is Ready once Run is called.

Copy link
Author

@robinlem robinlem Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think snapshot-based readiness check is actually better.

Why: Run() can be called successfully, but if firstReading() or calculatePower() fail in refreshSnapshot() (called with run() ), we get:

  • running = true (service started)
  • snapshot = nil (no data due to collection error)

Don't you think for a monitoring service, "ready" should mean "can provide data", not just "is running".
I think your thoughs make more sense for a startup probe but not a readyness probe.

I have read this documentation : https://kubernetes.io/docs/reference/using-api/health-checks/
It's not crystal clear but they say “The kubelet uses readiness probes to know when a container is ready to start accepting traffic.”

)

// Create health probe service
healthProbeService := server.NewHealthProbeService(apiServer, pm, pm, logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need 2 pm 🤔 ? Shouldn't the health-probe have access to all services, filter those that have Liveness and Readyness checks?

Also keep in mind that when a service's Init() is done, and all services's Run (see internal/service.Runner) is blocked, kepler should be in Ready state. (We may have to rethink the readiness probe, there is chance to simplify it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pm (PowerMonitor) is passed twice because it implements both LiveChecker and ReadyChecker interfaces, serving as both the liveness and readiness probe checker. I wanted a clear split for both, but we can change.

@robinlem robinlem marked this pull request as ready for review September 22, 2025 13:42
@sthaha sthaha requested a review from vimalk78 September 30, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat A new feature or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API endpoints for health(z)
2 participants