Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DENA-512] add cockroachdb stock alerts #89

Merged
merged 3 commits into from
Jun 17, 2024
Merged

[DENA-512] add cockroachdb stock alerts #89

merged 3 commits into from
Jun 17, 2024

Conversation

MarcinGinszt
Copy link
Contributor

@MarcinGinszt MarcinGinszt commented Jun 6, 2024

I wrote the queries myself, since example cockroachdb alerts are mostly covered by our container stock alerts.

@MarcinGinszt MarcinGinszt requested a review from a team June 6, 2024 13:44
@MarcinGinszt MarcinGinszt requested a review from a team as a code owner June 6, 2024 13:44
Copy link

linear bot commented Jun 6, 2024

groups:
- name: CockroachDB
rules:
- alert: CockroachdbQueriesErroring
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we should alert about that- I see errors are happening for a few teams now. Even though I think there is no reason to accept erroring queries.

Or, should we maybe add some higher treshold?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by the description of the metric:

This metric is a high-level indicator of workload and application degradation with query failures. Use the [Insights page](https://www.cockroachlabs.com/docs/v24.1/ui-insights-page) to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.

it sounds important. I'd say keep it until somebody complains about it and then revisit

annotations:
summary: "CockroachDB cluster {{$labels.app}} in namespace {{$labels.namespace}} cannot access some chunks of data"
impact: "Data inaccessibility"
action: "See replication dashboard, check if there is there enough nodes in the cluster"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
action: "See replication dashboard, check if there is there enough nodes in the cluster"
action: "See replication dashboard, check if there are enough nodes in the cluster"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

action: "See replication dashboard, check if there is there enough nodes in the cluster"
link: "https://www.cockroachlabs.com/docs/v24.1/cluster-setup-troubleshooting#replication-issues"
- alert: CockroachdbAdmissionOverload
expr: (sum(label_replace(admission_io_overload, "namespace", "$1", "kubernetes_namespace", "(.*)")) by (namespace, app) > 0.8) * on (namespace) group_left(team) uw_namespace_oncall_team
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the description in here: https://www.cockroachlabs.com/docs/stable/essential-metrics-dedicated, I think we should alert if greater than 1:

If the value of this metric exceeds 1, then it indicates overload. You can also look at the metrics storage.l0-num-files, storage.l0-sublevels or rocksdb.read-amplification directly. A healthy LSM shape is defined as “read-amp < 20” and “L0-files < 1000”, looking at [cluster settings](https://www.cockroachlabs.com/docs/v24.1/cluster-settings) admission.l0_sub_level_count_overload_threshold and admission.l0_file_count_overload_threshold respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I see you mention it is "close to overload". In this case it's ok 0.8

@MarcinGinszt
Copy link
Contributor Author

@george-angel sorry for late answer!

crdb... dashboards are example dashboards provided by Cockroachlabs- see this comment. They were created just for sake of creating "our" Grafana dashboard (you can see their saved versions).

I created our version of CockroachDB dashboard instead of directly using their dashboards, in order to keep the consistency with the other "common tools" dashboards.

They indeed contain some metrics that our dashboard doesn't contain (and vice versa).

I was planning to remove them. WDYT?

@george-angel
Copy link
Member

I was planning to remove them. WDYT?

I would prefer if they stayed, I maintain them and they are useful to debug issues which happens often enough to want to maintain them.

If we you want to introduce a new, separate dashboards - thats fine.

@MarcinGinszt MarcinGinszt requested a review from sbuliarca June 17, 2024 07:38
@MarcinGinszt MarcinGinszt merged commit 80add2d into main Jun 17, 2024
1 check passed
@MarcinGinszt MarcinGinszt deleted the DENA-512 branch June 17, 2024 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants