Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[risk=low][RW-14558] Set all duration+alignment windows to (23h30m + 1 hr) #65

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jmthibault79
Copy link
Contributor

@jmthibault79 jmthibault79 commented Mar 12, 2025

There is some configuration difference between Test/Local and Staging/Stable (I haven't tried Prod/Preprod) which made it impossible to apply the Test alerting rules to Staging/Stable: cannot set duration above 23h30m. Oddly, this appears to be a GCP-wide restriction, and I can't figure out why it's allowed in Test!

For example, I see this error on Staging when I try to apply workbench-terraform-modules v0.4.6, but not in Test.

╷
│ Error: Error updating AlertPolicy "projects/all-of-us-rw-staging/alertPolicies/3191228135367459128": googleapi: Error 400: Field alert_policy.conditions[0].condition_absent.duration had an invalid value of "24h59m": Durations longer than 23h30m are not supported.
│
│   with module.workbench.module.monitoring.module.alert_policies.google_monitoring_alert_policy.policy["cron_failure_upload_reporting_snapshot"],
│   on .terraform/modules/workbench/modules/workbench/modules/monitoring/modules/alert_policies/main.tf line 20, in resource "google_monitoring_alert_policy" "policy":
│   20: resource "google_monitoring_alert_policy" "policy" {

The reason we want a duration of ~ 25 hours is to avoid false positives for daily runs: we don't want to trigger if we see a completion signal at 11:01 one day but 11:03 the next. But looking further, it's not only the duration which matters. It's the duration plus an alignmentPeriod. We can satisfy the rule and our requirement in all environments (confirmed) by setting a duration of 23h30m (84600 seconds) and an alignmentPeriod of 1h (3600 s) for a total of 24h30m.

https://cloud.google.com/monitoring/alerts/concepts-indepth

Crons which run more frequently than daily are not affected by this, so they are not updated here.

@jmthibault79 jmthibault79 marked this pull request as ready for review March 12, 2025 13:25
@jmthibault79 jmthibault79 requested review from evrii and Qi77Qi March 12, 2025 13:25
Copy link
Contributor

@evrii evrii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me

@jmthibault79 jmthibault79 changed the title [risk=low][RW-14558] Set all horizons to (23h30m + 1 hr) [risk=low][RW-14558] Set all duration+alignment windows to (23h30m + 1 hr) Mar 12, 2025
@Qi77Qi Qi77Qi requested a review from yonghaoy March 12, 2025 18:52
@Qi77Qi
Copy link

Qi77Qi commented Mar 12, 2025

huddled with @jmthibault79 ...the logic is still pretty weird to me, but seems it's been working...

@yonghaoy do u know the alert policy well enough to see if the change makes sense

Copy link
Collaborator

@yonghaoy yonghaoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when applied in test last time, we might be using older version of terraform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants