Skip to content

Conversation

@empovit
Copy link
Member

@empovit empovit commented Dec 22, 2025

Users of this console plugin are required to replace their current list of DCGM metrics with the list installed by the plugin, because the plugin uses a few metrics that are not included with the DCGM exporter by default.

In order to make this configuration switch less intrusive, and to cover the most likely case when no custom DCGM configuration is used, we want to include the metrics required by this plugin in addition to, and not instead of, the default ones.

Summary by CodeRabbit

  • Chores
    • Released version 0.2.6.
    • Reorganized DCGM metrics handling to use external file sourcing for improved maintainability and easier updates.
    • Added metrics synchronization utility to streamline metric management processes.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link

openshift-ci bot commented Dec 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: empovit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@empovit empovit force-pushed the auto-generate-metrics-configmap branch 3 times, most recently from ced5098 to ce6cfd0 Compare December 22, 2025 19:41
@empovit
Copy link
Member Author

empovit commented Dec 22, 2025

/hold

@empovit empovit changed the title Auto-generate metrics config by merging with upstream [WIP] Auto-generate metrics config by merging with upstream Dec 26, 2025
@yakovbeder
Copy link
Collaborator

Request changes

  • Dedup is whitespace-sensitive: scripts/update-metrics.sh uses awk -F',' '!seen[$1]++', so $1 variants like DCGM_FI_DEV_DEC_UTIL vs DCGM_FI_DEV_DEC_UTIL won't dedupe. Generated deployment/.../files/dcgm-metrics.csv already contains whitespace variants (space before comma), risking duplicates/invalid metric names. Fix by trimming/normalizing $1 before keying.
  • Upstream fetch should fail hard: curl -sSL silently accepts HTTP errors; use curl -fsSL and/or validate non-empty parsed upstream before writing output.
  • Nit: .vscode/settings.json change is unrelated — drop or split.

Helm change (.Files.Get "files/dcgm-metrics.csv" | indent 4) LGTM once the generated CSV is normalized.

@empovit
Copy link
Member Author

empovit commented Dec 29, 2025

Thanks @yakovbeder, good catch! Fixing those

@empovit empovit force-pushed the auto-generate-metrics-configmap branch from ce6cfd0 to 0ddcdc5 Compare December 29, 2025 04:09
@empovit
Copy link
Member Author

empovit commented Jan 1, 2026

/cc @yakovbeder @TomerNewman

Copy link
Collaborator

@yakovbeder yakovbeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @empovit 👋

The script fixes look good:

✅ curl -fsSL for proper error handling
✅ Whitespace trimming in awk dedup logic
✅ .vscode/settings.json removed

One question about the relationship with PR #69:

required-metrics.csv includes DCGM_FI_PROF_GR_ENGINE_ACTIVE, but #69 aims to remove profiling metrics due to pre-Volta GPU compatibility issues. Should this PR use DCGM_FI_DEV_GPU_UTIL instead in the required list, or is the plan to have both available and let the plugin code choose?

Also, small nit: required-metrics.csv has a trailing blank line (line 28).

@empovit empovit force-pushed the auto-generate-metrics-configmap branch from 0ddcdc5 to 0ce70bd Compare January 4, 2026 07:41
@coderabbitai
Copy link

coderabbitai bot commented Jan 4, 2026

Warning

Rate limit exceeded

@empovit has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 46 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 9780cb2 and 61ad14c.

⛔ Files ignored due to path filters (2)
  • deployment/console-plugin-nvidia-gpu/files/dcgm-metrics.csv is excluded by !**/*.csv
  • scripts/required-metrics.csv is excluded by !**/*.csv
📒 Files selected for processing (5)
  • .github/workflows/helm-lint-test.yaml
  • deployment/console-plugin-nvidia-gpu/.helmignore
  • deployment/console-plugin-nvidia-gpu/Chart.yaml
  • deployment/console-plugin-nvidia-gpu/templates/configmap.yaml
  • scripts/update-metrics.sh

Walkthrough

Version bump to Chart.yaml and refactoring of DCGM metric configuration by replacing inline metric definitions in a ConfigMap with an external file reference. A new shell script is introduced to fetch, filter, deduplicate, and manage metrics from upstream sources.

Changes

Cohort / File(s) Summary
Version Update
deployment/console-plugin-nvidia-gpu/Chart.yaml
Version incremented from 0.2.5 to 0.2.6
ConfigMap Template Refactoring
deployment/console-plugin-nvidia-gpu/templates/configmap.yaml
Replaced hardcoded DCGM metric definitions with external file reference using {{ .Files.Get "files/dcgm-metrics.csv" | indent 4 }}. Metrics data now sourced from external file rather than embedded inline.
Metrics Management Script
scripts/update-metrics.sh
New bash utility script that fetches upstream DCGM exporter metrics, filters and merges with required metrics, deduplicates entries by unique key using awk, and writes consolidated metrics file. Includes strict bash options, temporary file cleanup, and remote CSV sourcing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: auto-generating metrics config by merging with upstream DCGM metrics, which is the primary purpose across all three modified files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@empovit empovit force-pushed the auto-generate-metrics-configmap branch from 0ce70bd to 9780cb2 Compare January 4, 2026 07:42
@empovit
Copy link
Member Author

empovit commented Jan 4, 2026

@yakovbeder #69 is wrong. We will have to support both metrics, depending on which one is available. I believe in some cases both old and new can be found in the same cluster (node with old and new GPUs).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
deployment/console-plugin-nvidia-gpu/templates/configmap.yaml (1)

9-9: The external metrics file exists and is properly configured.

The dcgm-metrics.csv file is present (2,357 bytes, 30 lines) in files/ and contains valid metric definitions. The scripts/update-metrics.sh script is in place to manage updates. The Helm template syntax is correct and will properly load the file.

To strengthen the setup, consider:

  • Adding a .helmignore check to ensure files/dcgm-metrics.csv is packaged with releases
  • Documenting in the README that the metrics file is auto-generated and the update script must be run before deployment
  • Adding CI validation to verify the file is non-empty before release
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6535796 and 9780cb2.

⛔ Files ignored due to path filters (2)
  • deployment/console-plugin-nvidia-gpu/files/dcgm-metrics.csv is excluded by !**/*.csv
  • scripts/required-metrics.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • deployment/console-plugin-nvidia-gpu/Chart.yaml
  • deployment/console-plugin-nvidia-gpu/templates/configmap.yaml
  • scripts/update-metrics.sh
🔇 Additional comments (3)
deployment/console-plugin-nvidia-gpu/Chart.yaml (1)

15-15: LGTM! Appropriate version bump.

The patch version increment aligns with the addition of the auto-generated metrics configuration feature.

scripts/update-metrics.sh (2)

1-19: LGTM! Good use of strict bash options and cleanup trap.

The script setup follows best practices with set -euo pipefail for error handling and a trap to clean up temporary files on exit.


28-30: LGTM! Clear auto-generated file header.

The header comment appropriately indicates the file is auto-generated.

@empovit empovit force-pushed the auto-generate-metrics-configmap branch 2 times, most recently from 6c56b8f to 766ab7f Compare January 4, 2026 08:06
Users of this console plugin are required to replace their
current list of DCGM metrics with the list installed by the
plugin, because the plugin uses a few metrics that are not
included with the DCGM exporter by default.

In order to make this configuration switch less intrusive,
and to cover the most likely case when no custom DCGM
configuration is used, we want to include the metrics required
by this plugin in addition to, and not instead of, the default ones.
@empovit empovit force-pushed the auto-generate-metrics-configmap branch from 766ab7f to 61ad14c Compare January 4, 2026 08:11
@yakovbeder
Copy link
Collaborator

I see. In that case, this PR is /lgtm

@yakovbeder
Copy link
Collaborator

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants