[DENA-512] add cockroachdb stock alerts #89

MarcinGinszt · 2024-06-06T13:44:17Z

I wrote the queries myself, since example cockroachdb alerts are mostly covered by our container stock alerts.

linear · 2024-06-06T13:44:19Z

DENA-512 Add stock alerts for cockroachdb

MarcinGinszt · 2024-06-06T13:46:28Z

common/stock/cockroachdb.yaml.tmpl

+groups:
+  - name: CockroachDB
+    rules:
+      - alert: CockroachdbQueriesErroring


I'm not sure whether we should alert about that- I see errors are happening for a few teams now. Even though I think there is no reason to accept erroring queries.

Or, should we maybe add some higher treshold?

Judging by the description of the metric:

This metric is a high-level indicator of workload and application degradation with query failures. Use the [Insights page](https://www.cockroachlabs.com/docs/v24.1/ui-insights-page) to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.

it sounds important. I'd say keep it until somebody complains about it and then revisit

common/stock/cockroachdb.yaml.tmpl

sbuliarca · 2024-06-06T14:18:31Z

common/stock/cockroachdb.yaml.tmpl

+        annotations:
+          summary: "CockroachDB cluster {{$labels.app}} in namespace {{$labels.namespace}} cannot access some chunks of data"
+          impact: "Data inaccessibility"
+          action: "See replication dashboard, check if there is there enough nodes in the cluster"


Suggested change

action: "See replication dashboard, check if there is there enough nodes in the cluster"

action: "See replication dashboard, check if there are enough nodes in the cluster"

Fixed, I also added link to cockroachdb dashboard

sbuliarca · 2024-06-06T14:21:34Z

common/stock/cockroachdb.yaml.tmpl

+          action: "See replication dashboard, check if there is there enough nodes in the cluster"
+          link: "https://www.cockroachlabs.com/docs/v24.1/cluster-setup-troubleshooting#replication-issues"
+      - alert: CockroachdbAdmissionOverload
+        expr: (sum(label_replace(admission_io_overload,  "namespace", "$1", "kubernetes_namespace", "(.*)")) by (namespace, app) > 0.8)  * on (namespace) group_left(team) uw_namespace_oncall_team


Looking at the description in here: https://www.cockroachlabs.com/docs/stable/essential-metrics-dedicated, I think we should alert if greater than 1:

If the value of this metric exceeds 1, then it indicates overload. You can also look at the metrics storage.l0-num-files, storage.l0-sublevels or rocksdb.read-amplification directly. A healthy LSM shape is defined as “read-amp < 20” and “L0-files < 1000”, looking at [cluster settings](https://www.cockroachlabs.com/docs/v24.1/cluster-settings) admission.l0_sub_level_count_overload_threshold and admission.l0_file_count_overload_threshold respectively.

Actually I see you mention it is "close to overload". In this case it's ok 0.8

george-angel · 2024-06-07T05:37:16Z

@MarcinGinszt Have you seen these dashboards?

MarcinGinszt · 2024-06-17T06:32:02Z

@george-angel sorry for late answer!

crdb... dashboards are example dashboards provided by Cockroachlabs- see this comment. They were created just for sake of creating "our" Grafana dashboard (you can see their saved versions).

I created our version of CockroachDB dashboard instead of directly using their dashboards, in order to keep the consistency with the other "common tools" dashboards.

They indeed contain some metrics that our dashboard doesn't contain (and vice versa).

I was planning to remove them. WDYT?

george-angel · 2024-06-17T07:02:02Z

I was planning to remove them. WDYT?

I would prefer if they stayed, I maintain them and they are useful to debug issues which happens often enough to want to maintain them.

If we you want to introduce a new, separate dashboards - thats fine.

add cockroachdb stock alerts

60f3231

MarcinGinszt requested a review from a team June 6, 2024 13:44

MarcinGinszt requested a review from a team as a code owner June 6, 2024 13:44

MarcinGinszt commented Jun 6, 2024

View reviewed changes

sbuliarca reviewed Jun 6, 2024

View reviewed changes

common/stock/cockroachdb.yaml.tmpl Outdated Show resolved Hide resolved

sbuliarca reviewed Jun 6, 2024

View reviewed changes

MarcinGinszt added 2 commits June 17, 2024 09:34

fix typo

7762a11

add link to cockroachdb dashboard

eeda992

MarcinGinszt requested a review from sbuliarca June 17, 2024 07:38

sbuliarca approved these changes Jun 17, 2024

View reviewed changes

MarcinGinszt merged commit 80add2d into main Jun 17, 2024
1 check passed

MarcinGinszt deleted the DENA-512 branch June 17, 2024 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DENA-512] add cockroachdb stock alerts #89

[DENA-512] add cockroachdb stock alerts #89

MarcinGinszt commented Jun 6, 2024 •

edited

Loading

linear bot commented Jun 6, 2024

MarcinGinszt Jun 6, 2024

sbuliarca Jun 6, 2024

sbuliarca Jun 6, 2024

MarcinGinszt Jun 17, 2024

sbuliarca Jun 6, 2024

sbuliarca Jun 6, 2024

george-angel commented Jun 7, 2024

MarcinGinszt commented Jun 17, 2024

george-angel commented Jun 17, 2024

	action: "See replication dashboard, check if there is there enough nodes in the cluster"
	action: "See replication dashboard, check if there are enough nodes in the cluster"

[DENA-512] add cockroachdb stock alerts #89

[DENA-512] add cockroachdb stock alerts #89

Conversation

MarcinGinszt commented Jun 6, 2024 • edited Loading

linear bot commented Jun 6, 2024

MarcinGinszt Jun 6, 2024

Choose a reason for hiding this comment

sbuliarca Jun 6, 2024

Choose a reason for hiding this comment

sbuliarca Jun 6, 2024

Choose a reason for hiding this comment

MarcinGinszt Jun 17, 2024

Choose a reason for hiding this comment

sbuliarca Jun 6, 2024

Choose a reason for hiding this comment

sbuliarca Jun 6, 2024

Choose a reason for hiding this comment

george-angel commented Jun 7, 2024

MarcinGinszt commented Jun 17, 2024

george-angel commented Jun 17, 2024

MarcinGinszt commented Jun 6, 2024 •

edited

Loading