You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 30, 2024. It is now read-only.
monitoring: relax mean_blocked_seconds_per_conn_request alerts (#59507)
https://github.com/sourcegraph/sourcegraph/pull/59284 dramatically reduced the `mean_blocked_seconds_per_conn_request` issues we've been seeing, but overall delays are still higher, even with generally healthy Cloud SQL resource utilization.
<img width="1630" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/91615471-5187-4d15-83e7-5cc94595303c">
Spot-checking the spikes in load in Cloud SQL, it seems that there is a variety of causes for each spike (analytics workloads, Cody Gateway syncs, code intel workloads, gitserver things, `ListSourcegraphDotComIndexableRepos` etc) so I'm chalking this up to "expected". Since this alert is seen firing on a Cloud instance, let's just relax it for now so that it only fires a critical alert on very significant delays.
(cherry picked from commit fc37f74)
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
910
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
910
911
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#frontend-mean-blocked-seconds-per-conn-request).
911
912
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
912
913
@@ -922,9 +923,9 @@ Generated query for critical alert: `max((sum(increase(src_cloudkms_cryptographi
922
923
<details>
923
924
<summary>Technical details</summary>
924
925
925
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.05)`
926
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.1)`
926
927
927
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.1)`
928
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="frontend"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="frontend"}[5m]))) >= 0.5)`
928
929
929
930
</details>
930
931
@@ -1644,13 +1645,14 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
1644
1645
1645
1646
**Descriptions**
1646
1647
1647
-
- <spanclass="badge badge-warning">warning</span> gitserver: 0.05s+ mean blocked seconds per conn request for 10m0s
1648
-
- <spanclass="badge badge-critical">critical</span> gitserver: 0.1s+ mean blocked seconds per conn request for 15m0s
1648
+
- <spanclass="badge badge-warning">warning</span> gitserver: 0.1s+ mean blocked seconds per conn request for 10m0s
1649
+
- <spanclass="badge badge-critical">critical</span> gitserver: 0.5s+ mean blocked seconds per conn request for 10m0s
1649
1650
1650
1651
**Next steps**
1651
1652
1652
1653
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
1655
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
1654
1656
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#gitserver-mean-blocked-seconds-per-conn-request).
1655
1657
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
1656
1658
@@ -1666,9 +1668,9 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
1666
1668
<details>
1667
1669
<summary>Technical details</summary>
1668
1670
1669
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.05)`
1671
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.1)`
1670
1672
1671
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.1)`
1673
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="gitserver"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="gitserver"}[5m]))) >= 0.5)`
1672
1674
1673
1675
</details>
1674
1676
@@ -2519,13 +2521,14 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
2519
2521
2520
2522
**Descriptions**
2521
2523
2522
-
- <spanclass="badge badge-warning">warning</span> precise-code-intel-worker: 0.05s+ mean blocked seconds per conn request for 10m0s
2523
-
- <spanclass="badge badge-critical">critical</span> precise-code-intel-worker: 0.1s+ mean blocked seconds per conn request for 15m0s
2524
+
- <spanclass="badge badge-warning">warning</span> precise-code-intel-worker: 0.1s+ mean blocked seconds per conn request for 10m0s
2525
+
- <spanclass="badge badge-critical">critical</span> precise-code-intel-worker: 0.5s+ mean blocked seconds per conn request for 10m0s
2524
2526
2525
2527
**Next steps**
2526
2528
2527
2529
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
2531
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
2529
2532
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#precise-code-intel-worker-mean-blocked-seconds-per-conn-request).
2530
2533
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
2531
2534
@@ -2541,9 +2544,9 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
2541
2544
<details>
2542
2545
<summary>Technical details</summary>
2543
2546
2544
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.05)`
2547
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.1)`
2545
2548
2546
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.1)`
2549
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="precise-code-intel-worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="precise-code-intel-worker"}[5m]))) >= 0.5)`
2547
2550
2548
2551
</details>
2549
2552
@@ -3531,13 +3534,14 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
3531
3534
3532
3535
**Descriptions**
3533
3536
3534
-
- <spanclass="badge badge-warning">warning</span> worker: 0.05s+ mean blocked seconds per conn request for 10m0s
3535
-
- <spanclass="badge badge-critical">critical</span> worker: 0.1s+ mean blocked seconds per conn request for 15m0s
3537
+
- <spanclass="badge badge-warning">warning</span> worker: 0.1s+ mean blocked seconds per conn request for 10m0s
3538
+
- <spanclass="badge badge-critical">critical</span> worker: 0.5s+ mean blocked seconds per conn request for 10m0s
3536
3539
3537
3540
**Next steps**
3538
3541
3539
3542
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
3544
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
3541
3545
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#worker-mean-blocked-seconds-per-conn-request).
3542
3546
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
3543
3547
@@ -3553,9 +3557,9 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
3553
3557
<details>
3554
3558
<summary>Technical details</summary>
3555
3559
3556
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.05)`
3560
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.1)`
3557
3561
3558
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.1)`
3562
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="worker"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="worker"}[5m]))) >= 0.5)`
3559
3563
3560
3564
</details>
3561
3565
@@ -4779,13 +4783,14 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
4779
4783
4780
4784
**Descriptions**
4781
4785
4782
-
- <spanclass="badge badge-warning">warning</span> repo-updater: 0.05s+ mean blocked seconds per conn request for 10m0s
4783
-
- <spanclass="badge badge-critical">critical</span> repo-updater: 0.1s+ mean blocked seconds per conn request for 15m0s
4786
+
- <spanclass="badge badge-warning">warning</span> repo-updater: 0.1s+ mean blocked seconds per conn request for 10m0s
4787
+
- <spanclass="badge badge-critical">critical</span> repo-updater: 0.5s+ mean blocked seconds per conn request for 10m0s
4784
4788
4785
4789
**Next steps**
4786
4790
4787
4791
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
4793
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
4789
4794
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-mean-blocked-seconds-per-conn-request).
4790
4795
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
4791
4796
@@ -4801,9 +4806,9 @@ Generated query for warning alert: `max((sum by (category) (increase(src_fronten
4801
4806
<details>
4802
4807
<summary>Technical details</summary>
4803
4808
4804
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.05)`
4809
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.1)`
4805
4810
4806
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.1)`
4811
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="repo-updater"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="repo-updater"}[5m]))) >= 0.5)`
4807
4812
4808
4813
</details>
4809
4814
@@ -5223,13 +5228,14 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
5223
5228
5224
5229
**Descriptions**
5225
5230
5226
-
- <spanclass="badge badge-warning">warning</span> searcher: 0.05s+ mean blocked seconds per conn request for 10m0s
5227
-
- <spanclass="badge badge-critical">critical</span> searcher: 0.1s+ mean blocked seconds per conn request for 15m0s
5231
+
- <spanclass="badge badge-warning">warning</span> searcher: 0.1s+ mean blocked seconds per conn request for 10m0s
5232
+
- <spanclass="badge badge-critical">critical</span> searcher: 0.5s+ mean blocked seconds per conn request for 10m0s
5228
5233
5229
5234
**Next steps**
5230
5235
5231
5236
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
5238
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
5233
5239
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#searcher-mean-blocked-seconds-per-conn-request).
5234
5240
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
5235
5241
@@ -5245,9 +5251,9 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
5245
5251
<details>
5246
5252
<summary>Technical details</summary>
5247
5253
5248
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.05)`
5254
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.1)`
5249
5255
5250
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.1)`
5256
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="searcher"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="searcher"}[5m]))) >= 0.5)`
5251
5257
5252
5258
</details>
5253
5259
@@ -5644,13 +5650,14 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
5644
5650
5645
5651
**Descriptions**
5646
5652
5647
-
- <spanclass="badge badge-warning">warning</span> symbols: 0.05s+ mean blocked seconds per conn request for 10m0s
5648
-
- <spanclass="badge badge-critical">critical</span> symbols: 0.1s+ mean blocked seconds per conn request for 15m0s
5653
+
- <spanclass="badge badge-warning">warning</span> symbols: 0.1s+ mean blocked seconds per conn request for 10m0s
5654
+
- <spanclass="badge badge-critical">critical</span> symbols: 0.5s+ mean blocked seconds per conn request for 10m0s
5649
5655
5650
5656
**Next steps**
5651
5657
5652
5658
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
5660
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
5654
5661
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#symbols-mean-blocked-seconds-per-conn-request).
5655
5662
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
5656
5663
@@ -5666,9 +5673,9 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
5666
5673
<details>
5667
5674
<summary>Technical details</summary>
5668
5675
5669
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.05)`
5676
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.1)`
5670
5677
5671
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.1)`
5678
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="symbols"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="symbols"}[5m]))) >= 0.5)`
5672
5679
5673
5680
</details>
5674
5681
@@ -8060,13 +8067,14 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
8060
8067
8061
8068
**Descriptions**
8062
8069
8063
-
- <spanclass="badge badge-warning">warning</span> embeddings: 0.05s+ mean blocked seconds per conn request for 10m0s
8064
-
- <spanclass="badge badge-critical">critical</span> embeddings: 0.1s+ mean blocked seconds per conn request for 15m0s
8070
+
- <spanclass="badge badge-warning">warning</span> embeddings: 0.1s+ mean blocked seconds per conn request for 10m0s
8071
+
- <spanclass="badge badge-critical">critical</span> embeddings: 0.5s+ mean blocked seconds per conn request for 10m0s
8065
8072
8066
8073
**Next steps**
8067
8074
8068
8075
- Increase SRC_PGSQL_MAX_OPEN together with giving more memory to the database if needed
- Scale up Postgres memory/cpus - [see our scaling guide](https://docs.sourcegraph.com/admin/config/postgres-conf)
8077
+
- If using GCP Cloud SQL, check for high lock waits or CPU usage in query insights
8070
8078
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#embeddings-mean-blocked-seconds-per-conn-request).
8071
8079
-**Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
8072
8080
@@ -8082,9 +8090,9 @@ Generated query for critical alert: `max((max(max_over_time(src_conf_client_time
8082
8090
<details>
8083
8091
<summary>Technical details</summary>
8084
8092
8085
-
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="embeddings"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="embeddings"}[5m]))) >= 0.05)`
8093
+
Generated query for warning alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="embeddings"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="embeddings"}[5m]))) >= 0.1)`
8086
8094
8087
-
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="embeddings"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="embeddings"}[5m]))) >= 0.1)`
8095
+
Generated query for critical alert: `max((sum by (app_name, db_name) (increase(src_pgsql_conns_blocked_seconds{app_name="embeddings"}[5m])) / sum by (app_name, db_name) (increase(src_pgsql_conns_waited_for{app_name="embeddings"}[5m]))) >= 0.5)`
0 commit comments