Add metrics about exponential backoff #246

whywaita · 2025-10-03T06:35:45Z

This pull request introduces new Prometheus metrics to improve observability for retry and backoff behaviors in both runner deletion and instance addition workflows. Metrics are now recorded for the number of retries and the duration of exponential backoff in each process, allowing for better monitoring and troubleshooting.

Metrics instrumentation for runner deletion (pkg/runner):

Added DeleteRunnerBackoffDuration histogram and DeleteRunnerRetryTotal counter to metrics.go to track backoff duration and retry counts when deleting runners, labeled by runner_uuid.
Updated removeRunners logic in runner_delete.go to increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.

Metrics instrumentation for instance addition (pkg/starter):

Added AddInstanceBackoffDuration histogram and AddInstanceRetryTotal counter to metrics.go to track backoff duration and retry counts when adding instances, labeled by job_uuid.
Updated run logic in starter.go to increment retry count and observe backoff duration in Prometheus metrics whenever a retry occurs.

Copilot

Pull Request Overview

This PR adds Prometheus metrics to improve observability for exponential backoff behavior in both runner deletion and instance addition workflows. The changes enable monitoring of retry counts and backoff durations to help with troubleshooting and performance analysis.

Added histogram metrics for exponential backoff duration tracking
Added counter metrics for retry count tracking
Instrumented retry logic in both starter and runner modules with metrics recording

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
pkg/starter/metrics.go	Defines Prometheus metrics for instance addition backoff duration and retry counts
pkg/starter/starter.go	Instruments retry logic to record metrics when retries occur during instance addition
pkg/runner/metrics.go	Defines Prometheus metrics for runner deletion backoff duration and retry counts
pkg/runner/runner_delete.go	Instruments retry logic to record metrics when retries occur during runner deletion

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pkg/starter/starter.go

pkg/runner/runner_delete.go

site0801

LGTM

whywaita requested a review from Copilot October 3, 2025 06:35

Copilot AI reviewed Oct 3, 2025

View reviewed changes

pkg/starter/starter.go Show resolved Hide resolved

pkg/runner/runner_delete.go Show resolved Hide resolved

whywaita mentioned this pull request Oct 3, 2025

We don't handling exponential backoff to infinite #245

Open

2 tasks

site0801 approved these changes Oct 3, 2025

View reviewed changes

whywaita added 2 commits October 3, 2025 16:09

Add metrics about exponential backoff

376c322

Record job / runner uuid

6ee4998

whywaita force-pushed the feat/metrisc-exponential-backoff branch from 9d004c1 to 6ee4998 Compare October 3, 2025 07:09

whywaita merged commit f0e965c into master Oct 15, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add metrics about exponential backoff #246

Add metrics about exponential backoff #246

Uh oh!

whywaita commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

site0801 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add metrics about exponential backoff #246

Add metrics about exponential backoff #246

Uh oh!

Conversation

whywaita commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

site0801 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants