-
Notifications
You must be signed in to change notification settings - Fork 215
feat: add Kubernetes health probe endpoints #2327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add Kubernetes health probe endpoints #2327
Conversation
Add dedicated health check endpoints for improved Kubernetes integration: - /probe/livez: Liveness probe with heartbeat-based monitoring - /probe/readyz: Readiness probe checking data availability Key features: * Lightweight checks (<10µs response time) using atomic operations * Support for both passive (interval=0) and active collection modes * JSON response format with status, timestamp, and duration * Configurable tolerance (2x collection interval) for liveness detection * Thread-safe implementation with comprehensive test coverage Implementation: * Add LiveChecker and ReadyChecker interfaces to service package * Implement health checks in PowerMonitor with heartbeat tracking * Create HealthProbeService for HTTP endpoint handling * Update Helm chart to use new endpoints by default Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.
Add healthCheckTolerance option to monitor for flexible liveness probe timing. Default remains 2.0x interval for backward compatibility.
if pm.snapshot.Load() == nil { | ||
return false, fmt.Errorf("no data yet") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make it not ready. No snapshot only means that no scrape has been made and does not mean monitor is not in ready state ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Monitor is Ready once Run
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think snapshot-based readiness check is actually better.
Why: Run() can be called successfully, but if firstReading() or calculatePower() fail in refreshSnapshot() (called with run() ), we get:
- running = true (service started)
- snapshot = nil (no data due to collection error)
Don't you think for a monitoring service, "ready" should mean "can provide data", not just "is running".
I think your thoughs make more sense for a startup probe but not a readyness probe.
I have read this documentation : https://kubernetes.io/docs/reference/using-api/health-checks/
It's not crystal clear but they say “The kubelet uses readiness probes to know when a container is ready to start accepting traffic.”
) | ||
|
||
// Create health probe service | ||
healthProbeService := server.NewHealthProbeService(apiServer, pm, pm, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need 2 pm 🤔 ? Shouldn't the health-probe have access to all services, filter those that have Liveness and Readyness checks?
Also keep in mind that when a service's Init() is done, and all services's Run (see internal/service.Runner) is blocked, kepler should be in Ready state. (We may have to rethink the readiness probe, there is chance to simplify it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pm (PowerMonitor) is passed twice because it implements both LiveChecker and ReadyChecker interfaces, serving as both the liveness and readiness probe checker. I wanted a clear split for both, but we can change.
Add dedicated health check endpoints for improved Kubernetes integration:
Key features:
Implementation:
Breaking change: Helm chart now uses /probe/* endpoints instead of /metrics for health probes, providing more accurate health status detection.
Closes 2282