Skip to content

Conversation

@whywaita
Copy link
Owner

@whywaita whywaita commented Oct 3, 2025

This pull request introduces new Prometheus metrics to improve observability for retry and backoff behaviors in both runner deletion and instance addition workflows. Metrics are now recorded for the number of retries and the duration of exponential backoff in each process, allowing for better monitoring and troubleshooting.

Metrics instrumentation for runner deletion (pkg/runner):

  • Added DeleteRunnerBackoffDuration histogram and DeleteRunnerRetryTotal counter to metrics.go to track backoff duration and retry counts when deleting runners, labeled by runner_uuid.
  • Updated removeRunners logic in runner_delete.go to increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.

Metrics instrumentation for instance addition (pkg/starter):

  • Added AddInstanceBackoffDuration histogram and AddInstanceRetryTotal counter to metrics.go to track backoff duration and retry counts when adding instances, labeled by job_uuid.
  • Updated run logic in starter.go to increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.

@whywaita whywaita requested a review from Copilot October 3, 2025 06:35
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Prometheus metrics to improve observability for exponential backoff behavior in both runner deletion and instance addition workflows. The changes enable monitoring of retry counts and backoff durations to help with troubleshooting and performance analysis.

  • Added histogram metrics for exponential backoff duration tracking
  • Added counter metrics for retry count tracking
  • Instrumented retry logic in both starter and runner modules with metrics recording

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/starter/metrics.go Defines Prometheus metrics for instance addition backoff duration and retry counts
pkg/starter/starter.go Instruments retry logic to record metrics when retries occur during instance addition
pkg/runner/metrics.go Defines Prometheus metrics for runner deletion backoff duration and retry counts
pkg/runner/runner_delete.go Instruments retry logic to record metrics when retries occur during runner deletion

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@site0801 site0801 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@whywaita whywaita force-pushed the feat/metrisc-exponential-backoff branch from 9d004c1 to 6ee4998 Compare October 3, 2025 07:09
@whywaita whywaita merged commit f0e965c into master Oct 15, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants