Add metrics about exponential backoff #246
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces new Prometheus metrics to improve observability for retry and backoff behaviors in both runner deletion and instance addition workflows. Metrics are now recorded for the number of retries and the duration of exponential backoff in each process, allowing for better monitoring and troubleshooting.
Metrics instrumentation for runner deletion (
pkg/runner):DeleteRunnerBackoffDurationhistogram andDeleteRunnerRetryTotalcounter tometrics.goto track backoff duration and retry counts when deleting runners, labeled byrunner_uuid.removeRunnerslogic inrunner_delete.goto increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.Metrics instrumentation for instance addition (
pkg/starter):AddInstanceBackoffDurationhistogram andAddInstanceRetryTotalcounter tometrics.goto track backoff duration and retry counts when adding instances, labeled byjob_uuid.runlogic instarter.goto increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.