Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CICD metrics #1681

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

christophe-kamphaus-jemmic
Copy link
Contributor

Fixes #1600

Changes

This PR adds metrics for CICD systems and related attributes.

Merge requirement checklist

docs/attributes-registry/cicd.md Outdated Show resolved Hide resolved
docs/cicd/cicd-metrics.md Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Show resolved Hide resolved
- id: metric.cicd.errors
type: metric
metric_name: cicd.errors
brief: 'The number of errors in the controller of the CICD system.'
Copy link
Contributor

@lmolkova lmolkova Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cicd.errors name seems too broad - it seems it would cover all kinds of errors in all CI/CD system parts and then it stops being practical.

The first time I read it, I assumed it to count pipeline run errors, but it's probably something else.

Wonder if we could limit the scope of this metric to something reasonable. E.g. should it be cicd.controller.errors? I could envision we'd want to differentiate per component, e.g. have a cicd.scheduler.errors, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.

It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).

Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.

I will define the following metrics to cover these:
cicd.pipeline.run.errors for errors encountered as part of the pipeline run execution
cicd.system.errors for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fc48dd2, 0be8d4d and 996e809.

model/cicd/registry.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
model/cicd/metrics.yaml Show resolved Hide resolved
model/cicd/metrics.yaml Outdated Show resolved Hide resolved
- id: metric.cicd.errors
type: metric
metric_name: cicd.errors
brief: 'The number of errors in the controller of the CICD system.'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to count the errors from the CICD system (controller, scheduler, agent) in this metric and not errors from the pipeline run.

It is indeed good to be able to distinguish the different system components as well as errors from the pipeline runs.
Adriel also made a good point that it would be good to be able to count pipeline run errors separately from just the cicd.pipeline.run.duration with result=error because a single pipeline run might encounter several different errors (some recoverable and others not).

Most likely this will be a derived metric, eg. by using a span metrics connector on the pipeline run spans or count connector on the controller logs.

I will define the following metrics to cover these:
cicd.pipeline.run.errors for errors encountered as part of the pipeline run execution
cicd.system.errors for errrors encountered in CICD system components with an attribute component (eg. scheduler, agent, controller, …)

model/cicd/registry.yaml Outdated Show resolved Hide resolved
@carlosalberto
Copy link
Contributor

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

@christophe-kamphaus-jemmic
Copy link
Contributor Author

Overall LGTM. A small (non-blocking) question: were cicd.pipeline.run.queued and cicd.pipeline.run.active considered as a single metric, with a state or similar label? Not proposing that, just curious on whether that was considered, as they are very similar.

No I did not consider it until now. It's only with the latest changes that the similarity between these two metrics became apparent.
Also the cicd.pipeline.run.duration and cicd.pipeline.run.time_in_queue seem very similar.

I'm open to combine these two metric pairs and distinguishing them with phase attribute if additional reviewers are in favor of this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs More Approval
Development

Successfully merging this pull request may close these issues.

Add metrics for CICD job queues
9 participants