-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: add metric counters for livez/readyz health checks. #16797
Conversation
703ad1d
to
83858b6
Compare
rootHealthCheckGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Namespace: "etcd", | ||
Subsystem: "server", | ||
Name: "root_health", | ||
Help: "This metric records the result of the root readyz/livez check.", | ||
}, | ||
[]string{"type"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this root metric? Can't we just compute this from healthCheckGauge
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user would need to know that min of all healthCheckGauge gives you the overall health gauge. That might be inconvenient or error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty easy to do that in a prometheus query.. That's exactly what labels are meant for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Removed rootHealthCheckGauge.
83858b6
to
72dec35
Compare
}) | ||
rootHealthCheckCounter = prometheus.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the accum counter can not be easily be derived from individual checks. The sum of all the checks depends on the number of checks there is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Han, this is redundant with the other counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved the root counts to healthCheckCounter with name="/"
72dec35
to
0ee14c2
Compare
0ee14c2
to
92ab4d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
It's a known flake. Could you please rebase on top of main or |
Or ping @etcd-io/maintainers-etcd to rerun the test :P |
@@ -241,7 +268,7 @@ func (reg *CheckRegistry) InstallHttpEndpoints(lg *zap.Logger, mux *http.ServeMu | |||
} | |||
|
|||
func (reg *CheckRegistry) runHealthChecks(ctx context.Context, checkNames ...string) Health { | |||
h := Health{Health: "true"} | |||
h := Health{Status: HealthStatusSuccess} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not to break the existing /health
endpoint in 3.6.
h := Health{Status: HealthStatusSuccess} | |
h := Health{Health: "true", Status: HealthStatusSuccess} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used in /health endpoint. It is only used for livez/healthz.
@@ -250,9 +277,11 @@ func (reg *CheckRegistry) runHealthChecks(ctx context.Context, checkNames ...str | |||
} | |||
if err := check(ctx); err != nil { | |||
fmt.Fprintf(&individualCheckOutput, "[-]%s failed: %v\n", checkName, err) | |||
h.Health = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, let's keep it so as not to break the existing /health
in 3.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used in /health endpoint. It is only used for livez/healthz.
Health string `json:"health"` | ||
Reason string `json:"reason"` | ||
// Status field is used in new /readyz or /livez health checks instead of the Health field. | ||
Status string `json:"-"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intentionally use "-"?
Status string `json:"-"` | |
Status string `json:"status"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ignored because I don't want to change the response of existing /health endpoint. If use json:"status"
, the response would add an empty status field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce this field Status
? Why not to reuse the existing Health
to check whether the health check is successful or failed? If you are planning to remove the fields "Health" and "Reason" and the legacy health check endpoint /health in future release e.g. 4.0, please add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still would like to use the Health
struct. But the Health
string field is itself not very descriptive and not using constant strings. Added a comment about deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have separate Health
struct for new endpoints?
Old will use Health
and Reason
fields, while new will use Reason
and Status
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old will use
Health
andReason
fields, while new will useReason
andStatus
.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new HealthStatus
struct.
806e2ca
to
a572bf4
Compare
bd3dbc4
to
8f3c923
Compare
8f3c923
to
f3c6db5
Compare
c0579ea
to
35b0ca3
Compare
Signed-off-by: Siyuan Zhang <[email protected]>
35b0ca3
to
3897103
Compare
} | ||
|
||
// newHealthHandler generates a http HandlerFunc for a health check function hfunc. | ||
func newHealthHandler(path string, lg *zap.Logger, hfunc func(*http.Request) Health) http.HandlerFunc { | ||
func newHealthHandler(path string, lg *zap.Logger, hfunc func(*http.Request) HealthStatus) http.HandlerFunc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should path
be replaced with checkType
? It would make more sense that health probe logs emit the checkType instead of the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path
seems better here, because it may be
- a root path something like
/livez
or/readyz
; - a subpath something like
/livez/serializable_read
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths are useful for HTTP access level logging,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thank you!
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
part of the work for #16007
label field consistent with K8s health metrics