-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DENA-512] add cockroachdb stock alerts #89
Conversation
groups: | ||
- name: CockroachDB | ||
rules: | ||
- alert: CockroachdbQueriesErroring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we should alert about that- I see errors are happening for a few teams now. Even though I think there is no reason to accept erroring queries.
Or, should we maybe add some higher treshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging by the description of the metric:
This metric is a high-level indicator of workload and application degradation with query failures. Use the [Insights page](https://www.cockroachlabs.com/docs/v24.1/ui-insights-page) to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.
it sounds important. I'd say keep it until somebody complains about it and then revisit
common/stock/cockroachdb.yaml.tmpl
Outdated
annotations: | ||
summary: "CockroachDB cluster {{$labels.app}} in namespace {{$labels.namespace}} cannot access some chunks of data" | ||
impact: "Data inaccessibility" | ||
action: "See replication dashboard, check if there is there enough nodes in the cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
action: "See replication dashboard, check if there is there enough nodes in the cluster" | |
action: "See replication dashboard, check if there are enough nodes in the cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
action: "See replication dashboard, check if there is there enough nodes in the cluster" | ||
link: "https://www.cockroachlabs.com/docs/v24.1/cluster-setup-troubleshooting#replication-issues" | ||
- alert: CockroachdbAdmissionOverload | ||
expr: (sum(label_replace(admission_io_overload, "namespace", "$1", "kubernetes_namespace", "(.*)")) by (namespace, app) > 0.8) * on (namespace) group_left(team) uw_namespace_oncall_team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the description in here: https://www.cockroachlabs.com/docs/stable/essential-metrics-dedicated, I think we should alert if greater than 1:
If the value of this metric exceeds 1, then it indicates overload. You can also look at the metrics storage.l0-num-files, storage.l0-sublevels or rocksdb.read-amplification directly. A healthy LSM shape is defined as “read-amp < 20” and “L0-files < 1000”, looking at [cluster settings](https://www.cockroachlabs.com/docs/v24.1/cluster-settings) admission.l0_sub_level_count_overload_threshold and admission.l0_file_count_overload_threshold respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I see you mention it is "close to overload". In this case it's ok 0.8
@george-angel sorry for late answer!
I created our version of CockroachDB dashboard instead of directly using their dashboards, in order to keep the consistency with the other "common tools" dashboards. They indeed contain some metrics that our dashboard doesn't contain (and vice versa). I was planning to remove them. WDYT? |
I would prefer if they stayed, I maintain them and they are useful to debug issues which happens often enough to want to maintain them. If we you want to introduce a new, separate dashboards - thats fine. |
I wrote the queries myself, since example cockroachdb alerts are mostly covered by our container stock alerts.