Skip to content

Conversation

@cs-ma
Copy link
Contributor

@cs-ma cs-ma commented Dec 4, 2023

This looks much bigger than it really is, so I will give a bit of background info:

  • In DP-30669: add tags to alert-conditions-cloudfront #160, I added the option to add new relic tags to the CloudFront alert conditions so we could use the alert muting rules. This was actually kind of a pain, because tags are a separete resource type, and must contain at least 1 tag. Since many of the alerts these modules create are optional, this made it awkward to try to stack the conditions, and I ended up with something like
    count = length(var.tags) > 0 ? (var.throughput_enabled ? 1 : 0) : 0 on the tags resource.
    I wanted to add tags to the other alerts as well, but it didn't seem worth the extra effort since we only needed them on CloudFront alerts right now.

  • Many of our alerts were set up in NR and then exported as terraform, so they include all of the default values, as well as the ones we actually set intentionally. I would have removed them, but the NR terraform documentation is weirdly vague about default values, and I was worried that a future version may change the behavior of defaults and break all our alerts.

  • Something about the way the alert resource is structured just makes it seem difficult to edit for me. Trying to add in a bunch of extra name formatting made it even worse.

The end result is that I made a nrql-alert submodule that handles the alert + any tags. It contains reasonable defaults for many of the values, so we don't have to supply those manually in every alert anymore. The resulting terraform is
much easier to read, and as proof, I spotted/fixed 2 or 3 bugs in the existing code.

The real, noticible change this module makes is that the names are formatted nicely now. Examples:

  • DC EC2 prod - CPU utilization over 90% for at least 120 seconds

  • DC EC2 prod - No metrics reported for at least 600 seconds

  • DC EC2 prod - Memory usage over 90% for at least 120 seconds

  • DC EC2 prod - Storage usage over 90% for at least 120 seconds

  • DC ECS Container nonprod - Memory usage over 90% for at least 300 seconds

  • DC ECS Container nonprod - More than 5 restarts in 7200 seconds

  • DC ECS Cluster prod - CPU utilization over 90% for at least 300 seconds

  • DC ECS Cluster prod - Memory usage over 90% for at least 300 seconds

  • DC Lambda - Average function duration greater than 300 seconds for at least 3600 seconds

  • DC Lambda - Error percent over 5% for at least 3600 seconds

  • DC Lambda - More than 1 events dropped in 3600 seconds

  • DC RDS - 1600GB - CPU utilization over 90% for at least 300 seconds

  • DC RDS - 1600GB - Less than 10% space free for at least 300 seconds

  • DC CloudFront public - Error rate over 8% for at least 300 seconds

  • DC CloudFront public - Less than 5 requests per 60 seconds for over 300 seconds

I really wish terraform had a convenient way to format durations of time (3600 seconds -> 1 hour, etc). But it doesn't, so for now, everything is seconds.

Copy link
Contributor

@jaredhm jaredhm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really great, nice one

@cs-ma cs-ma merged commit be84c6c into 1.x Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants