DP-30933: add generic alert module, rename existing alerts #164
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This looks much bigger than it really is, so I will give a bit of background info:
In DP-30669: add tags to alert-conditions-cloudfront #160, I added the option to add new relic tags to the CloudFront alert conditions so we could use the alert muting rules. This was actually kind of a pain, because tags are a separete resource type, and must contain at least 1 tag. Since many of the alerts these modules create are optional, this made it awkward to try to stack the conditions, and I ended up with something like
count = length(var.tags) > 0 ? (var.throughput_enabled ? 1 : 0) : 0on the tags resource.I wanted to add tags to the other alerts as well, but it didn't seem worth the extra effort since we only needed them on CloudFront alerts right now.
Many of our alerts were set up in NR and then exported as terraform, so they include all of the default values, as well as the ones we actually set intentionally. I would have removed them, but the NR terraform documentation is weirdly vague about default values, and I was worried that a future version may change the behavior of defaults and break all our alerts.
Something about the way the alert resource is structured just makes it seem difficult to edit for me. Trying to add in a bunch of extra name formatting made it even worse.
The end result is that I made a
nrql-alertsubmodule that handles the alert + any tags. It contains reasonable defaults for many of the values, so we don't have to supply those manually in every alert anymore. The resulting terraform ismuch easier to read, and as proof, I spotted/fixed 2 or 3 bugs in the existing code.
The real, noticible change this module makes is that the names are formatted nicely now. Examples:
DC EC2 prod - CPU utilization over 90% for at least 120 secondsDC EC2 prod - No metrics reported for at least 600 secondsDC EC2 prod - Memory usage over 90% for at least 120 secondsDC EC2 prod - Storage usage over 90% for at least 120 secondsDC ECS Container nonprod - Memory usage over 90% for at least 300 secondsDC ECS Container nonprod - More than 5 restarts in 7200 secondsDC ECS Cluster prod - CPU utilization over 90% for at least 300 secondsDC ECS Cluster prod - Memory usage over 90% for at least 300 secondsDC Lambda - Average function duration greater than 300 seconds for at least 3600 secondsDC Lambda - Error percent over 5% for at least 3600 secondsDC Lambda - More than 1 events dropped in 3600 secondsDC RDS - 1600GB - CPU utilization over 90% for at least 300 secondsDC RDS - 1600GB - Less than 10% space free for at least 300 secondsDC CloudFront public - Error rate over 8% for at least 300 secondsDC CloudFront public - Less than 5 requests per 60 seconds for over 300 secondsI really wish terraform had a convenient way to format durations of time (
3600 seconds->1 hour, etc). But it doesn't, so for now, everything is seconds.