Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatadogMonitor finalizer removed on deletion despite monitor still existing within DataDog. #1327

Open
cehoffman opened this issue Jul 26, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@cehoffman
Copy link

cehoffman commented Jul 26, 2024

Output of the info page (if this is a bug)

{"Monitor ID":"149735579", "datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "error":"error deleting monitor: 503 Service Unavailable: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: 111", "level":"ERROR", "logger":"controllers.DatadogMonitor", "msg":"failed to finalize monitor", "ts":"2024-07-24T05:03:08Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T05:04:08Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T16:20:25Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Adding Finalizer for the DatadogMonitor", "ts":"2024-07-24T16:20:25Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T16:22:10Z"}
{"Monitor ID":0, "Monitor Name":"dynamic-pooled-cost-runner-develop-platform-failed", "Monitor Namespace":"multi", "datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Added required tags", "ts":"2024-07-24T16:22:10Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "error":"error creating monitor: 400 Bad Request: {"errors":["Duplicate of an existing monitor_id:149735579 org_id:313359"]}", "level":"ERROR", "logger":"controllers.DatadogMonitor", "msg":"error creating monitor", "ts":"2024-07-24T16:22:10Z"}

These logs start with a DatadogMonitor deletion being processed and then later the DatadogMonitor resource is recreated and fails due to the previous incarnation still existing within DataDog.

Describe what happened:
We have some ephemeral applicatons that come and go at irregular times. As part of this they related monitors defined by DatadogMonitor are created or deleted as part of the application. Sometimes if the operator encounters an error response from the DataDog API, the DatadogMonitor can get garbage collected while the monitor remains within DataDog. Once the application comes back into existence and the DatadogMonitors are recreated, some will fail to create due to an already existing monitor.

Describe what you expected:
Expect the operator to not allow a DatadogMonitor to be garbage collected until the monitor has been confirmed deleted from DataDog.

Steps to reproduce the issue:
We delete 51 monitors at the same time as part of the application teardown. It is unknown if this burst is an issue for the DataDog API. Creating and deleting a batch of monitors with unchanging details will cause this to happen intermittently. Only a few, > 5, will fail to delete at DataDog.

Additional environment details (Operating System, Cloud provider, etc):
Google Coud GKE 1.29.6

@cehoffman
Copy link
Author

We've had to disable operator managed monitors in our more volatile environments due to this issue.

@fanny-jiang
Copy link
Contributor

Hi @cehoffman, thanks for reporting this issue. I've created a card in our backlog to address this.

@fanny-jiang fanny-jiang added the bug Something isn't working label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants