feat: add Kubernetes health probe endpoints #2327

robinlem · 2025-09-17T23:42:11Z

Add dedicated health check endpoints for improved Kubernetes integration:

/probe/livez: Liveness probe with heartbeat-based monitoring
/probe/readyz: Readiness probe checking data availability

Key features:

Lightweight checks (<10µs response time) using atomic operations
Support for both passive (interval=0) and active collection modes
JSON response format with status, timestamp, and duration
Configurable tolerance (2x collection interval) for liveness detection
Thread-safe implementation with comprehensive test coverage

Implementation:

Add LiveChecker and ReadyChecker interfaces to service package
Implement health checks in PowerMonitor with heartbeat tracking
Create HealthProbeService for HTTP endpoint handling
Update Helm chart to use new endpoints by default

Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.

Closes 2282

Add dedicated health check endpoints for improved Kubernetes integration: - /probe/livez: Liveness probe with heartbeat-based monitoring - /probe/readyz: Readiness probe checking data availability Key features: * Lightweight checks (<10µs response time) using atomic operations * Support for both passive (interval=0) and active collection modes * JSON response format with status, timestamp, and duration * Configurable tolerance (2x collection interval) for liveness detection * Thread-safe implementation with comprehensive test coverage Implementation: * Add LiveChecker and ReadyChecker interfaces to service package * Implement health checks in PowerMonitor with heartbeat tracking * Create HealthProbeService for HTTP endpoint handling * Update Helm chart to use new endpoints by default Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.

Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.

sthaha · 2025-09-18T05:12:00Z

internal/monitor/monitor.go

+	if pm.snapshot.Load() == nil {
+		return false, fmt.Errorf("no data yet")
+	}


This doesn't make it not ready. No snapshot only means that no scrape has been made and does not mean monitor is not in ready state ...

Monitor is Ready once Run is called.

I think snapshot-based readiness check is actually better.

Why: Run() can be called successfully, but if firstReading() or calculatePower() fail in refreshSnapshot() (called with run() ), we get:

running = true (service started)

snapshot = nil (no data due to collection error)

Don't you think for a monitoring service, "ready" should mean "can provide data", not just "is running".
I think your thoughs make more sense for a startup probe but not a readyness probe.

I have read this documentation : https://kubernetes.io/docs/reference/using-api/health-checks/
It's not crystal clear but they say “The kubelet uses readiness probes to know when a container is ready to start accepting traffic.”

sthaha · 2025-09-18T05:27:32Z

cmd/kepler/main.go

 	)

+	// Create health probe service
+	healthProbeService := server.NewHealthProbeService(apiServer, pm, pm, logger)


why do we need 2 pm 🤔 ? Shouldn't the health-probe have access to all services, filter those that have Liveness and Readyness checks?

Also keep in mind that when a service's Init() is done, and all services's Run (see internal/service.Runner) is blocked, kepler should be in Ready state. (We may have to rethink the readiness probe, there is chance to simplify it).

The pm (PowerMonitor) is passed twice because it implements both LiveChecker and ReadyChecker interfaces, serving as both the liveness and readiness probe checker. I wanted a clear split for both, but we can change.

github-actions bot added the feat A new feature or enhancement label Sep 17, 2025

feat: make health check tolerance configurable

50efc05

Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.

sthaha reviewed Sep 18, 2025

View reviewed changes

robinlem marked this pull request as ready for review September 22, 2025 13:42

sthaha requested a review from vimalk78 September 30, 2025 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Kubernetes health probe endpoints #2327

feat: add Kubernetes health probe endpoints #2327

Uh oh!

robinlem commented Sep 17, 2025

Uh oh!

sthaha Sep 18, 2025

Uh oh!

sthaha Sep 18, 2025

Uh oh!

robinlem Sep 19, 2025 •

edited

Loading

Uh oh!

sthaha Sep 18, 2025

Uh oh!

robinlem Sep 19, 2025

Uh oh!

Uh oh!

feat: add Kubernetes health probe endpoints #2327

Are you sure you want to change the base?

feat: add Kubernetes health probe endpoints #2327

Uh oh!

Conversation

robinlem commented Sep 17, 2025

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

robinlem Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sthaha Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

robinlem Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

robinlem Sep 19, 2025 •

edited

Loading