DP-30933: add generic alert module, rename existing alerts #164

cs-ma · 2023-12-04T17:32:15Z

This looks much bigger than it really is, so I will give a bit of background info:

In DP-30669: add tags to alert-conditions-cloudfront #160, I added the option to add new relic tags to the CloudFront alert conditions so we could use the alert muting rules. This was actually kind of a pain, because tags are a separete resource type, and must contain at least 1 tag. Since many of the alerts these modules create are optional, this made it awkward to try to stack the conditions, and I ended up with something like
count = length(var.tags) > 0 ? (var.throughput_enabled ? 1 : 0) : 0 on the tags resource.
I wanted to add tags to the other alerts as well, but it didn't seem worth the extra effort since we only needed them on CloudFront alerts right now.
Many of our alerts were set up in NR and then exported as terraform, so they include all of the default values, as well as the ones we actually set intentionally. I would have removed them, but the NR terraform documentation is weirdly vague about default values, and I was worried that a future version may change the behavior of defaults and break all our alerts.
Something about the way the alert resource is structured just makes it seem difficult to edit for me. Trying to add in a bunch of extra name formatting made it even worse.

The end result is that I made a nrql-alert submodule that handles the alert + any tags. It contains reasonable defaults for many of the values, so we don't have to supply those manually in every alert anymore. The resulting terraform is
much easier to read, and as proof, I spotted/fixed 2 or 3 bugs in the existing code.

The real, noticible change this module makes is that the names are formatted nicely now. Examples:

DC EC2 prod - CPU utilization over 90% for at least 120 seconds
DC EC2 prod - No metrics reported for at least 600 seconds
DC EC2 prod - Memory usage over 90% for at least 120 seconds
DC EC2 prod - Storage usage over 90% for at least 120 seconds
DC ECS Container nonprod - Memory usage over 90% for at least 300 seconds
DC ECS Container nonprod - More than 5 restarts in 7200 seconds
DC ECS Cluster prod - CPU utilization over 90% for at least 300 seconds
DC ECS Cluster prod - Memory usage over 90% for at least 300 seconds
DC Lambda - Average function duration greater than 300 seconds for at least 3600 seconds
DC Lambda - Error percent over 5% for at least 3600 seconds
DC Lambda - More than 1 events dropped in 3600 seconds
DC RDS - 1600GB - CPU utilization over 90% for at least 300 seconds
DC RDS - 1600GB - Less than 10% space free for at least 300 seconds
DC CloudFront public - Error rate over 8% for at least 300 seconds
DC CloudFront public - Less than 5 requests per 60 seconds for over 300 seconds

I really wish terraform had a convenient way to format durations of time (3600 seconds -> 1 hour, etc). But it doesn't, so for now, everything is seconds.

…erts

jaredhm

this is really great, nice one

add generic alert module with reasonable defaults, rename existing al…

d285f5d

…erts

jaredhm approved these changes Dec 4, 2023

View reviewed changes

Update CHANGELOG.md

ce0b8a8

cs-ma merged commit be84c6c into 1.x Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DP-30933: add generic alert module, rename existing alerts #164

DP-30933: add generic alert module, rename existing alerts #164

Uh oh!

cs-ma commented Dec 4, 2023

Uh oh!

jaredhm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DP-30933: add generic alert module, rename existing alerts #164

DP-30933: add generic alert module, rename existing alerts #164

Uh oh!

Conversation

cs-ma commented Dec 4, 2023

Uh oh!

jaredhm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants