Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: add metric counters for livez/readyz health checks. #16797

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

siyuanfoundation
Copy link
Contributor

@siyuanfoundation siyuanfoundation commented Oct 18, 2023

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

part of the work for #16007

label field consistent with K8s health metrics

server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health_test.go Outdated Show resolved Hide resolved
@siyuanfoundation siyuanfoundation force-pushed the metrics branch 4 times, most recently from 703ad1d to 83858b6 Compare October 18, 2023 22:04
@siyuanfoundation
Copy link
Contributor Author

cc @logicalhan @serathius

Comment on lines 120 to 127
rootHealthCheckGauge = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Namespace: "etcd",
Subsystem: "server",
Name: "root_health",
Help: "This metric records the result of the root readyz/livez check.",
},
[]string{"type"},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this root metric? Can't we just compute this from healthCheckGauge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user would need to know that min of all healthCheckGauge gives you the overall health gauge. That might be inconvenient or error prone.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty easy to do that in a prometheus query.. That's exactly what labels are meant for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Removed rootHealthCheckGauge.

})
rootHealthCheckCounter = prometheus.NewCounterVec(prometheus.CounterOpts{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the accum counter can not be easily be derived from individual checks. The sum of all the checks depends on the number of checks there is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Han, this is redundant with the other counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the root counts to healthCheckCounter with name="/"

Copy link

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

server/etcdserver/api/etcdhttp/health_test.go Show resolved Hide resolved
@chaochn47
Copy link
Member

=== FAIL: integration/clientv3/examples  (0.00s)
2023/10/19 17:53:43 Running tests (examples) in dir(/tmp/etcd-integration1751247847): ...
panic: test timed out after 15m0s

It's a known flake. Could you please rebase on top of main or git commit --amend --no-edit and then push ?

@siyuanfoundation

@serathius
Copy link
Member

serathius commented Oct 28, 2023

It's a known flake. Could you please rebase on top of main or git commit --amend --no-edit and then push ?

Or ping @etcd-io/maintainers-etcd to rerun the test :P
Sorry that this is not available to etcd members, Github sucks. #15659

@@ -241,7 +268,7 @@ func (reg *CheckRegistry) InstallHttpEndpoints(lg *zap.Logger, mux *http.ServeMu
}

func (reg *CheckRegistry) runHealthChecks(ctx context.Context, checkNames ...string) Health {
h := Health{Health: "true"}
h := Health{Status: HealthStatusSuccess}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not to break the existing /health endpoint in 3.6.

Suggested change
h := Health{Status: HealthStatusSuccess}
h := Health{Health: "true", Status: HealthStatusSuccess}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used in /health endpoint. It is only used for livez/healthz.

@@ -250,9 +277,11 @@ func (reg *CheckRegistry) runHealthChecks(ctx context.Context, checkNames ...str
}
if err := check(ctx); err != nil {
fmt.Fprintf(&individualCheckOutput, "[-]%s failed: %v\n", checkName, err)
h.Health = "false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, let's keep it so as not to break the existing /health in 3.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used in /health endpoint. It is only used for livez/healthz.

server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
Health string `json:"health"`
Reason string `json:"reason"`
// Status field is used in new /readyz or /livez health checks instead of the Health field.
Status string `json:"-"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally use "-"?

Suggested change
Status string `json:"-"`
Status string `json:"status"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ignored because I don't want to change the response of existing /health endpoint. If use json:"status", the response would add an empty status field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce this field Status? Why not to reuse the existing Health to check whether the health check is successful or failed? If you are planning to remove the fields "Health" and "Reason" and the legacy health check endpoint /health in future release e.g. 4.0, please add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still would like to use the Health struct. But the Health string field is itself not very descriptive and not using constant strings. Added a comment about deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have separate Health struct for new endpoints?
Old will use Health and Reason fields, while new will use Reason and Status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old will use Health and Reason fields, while new will use Reason and Status.

Sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new HealthStatus struct.

server/etcdserver/api/etcdhttp/health.go Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Show resolved Hide resolved
server/etcdserver/api/etcdhttp/health.go Outdated Show resolved Hide resolved
}

// newHealthHandler generates a http HandlerFunc for a health check function hfunc.
func newHealthHandler(path string, lg *zap.Logger, hfunc func(*http.Request) Health) http.HandlerFunc {
func newHealthHandler(path string, lg *zap.Logger, hfunc func(*http.Request) HealthStatus) http.HandlerFunc {
Copy link
Member

@serathius serathius Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should path be replaced with checkType ? It would make more sense that health probe logs emit the checkType instead of the path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path seems better here, because it may be

  • a root path something like /livez or /readyz;
  • a subpath something like /livez/serializable_read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths are useful for HTTP access level logging,

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thank you!

@ahrtr ahrtr merged commit d0114cf into etcd-io:main Nov 15, 2023
33 checks passed
@siyuanfoundation siyuanfoundation deleted the metrics branch November 15, 2023 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants