Skip to content

A way to install metrics exporter based monitoring for a group of nodes or cluster#208

Merged
venksrin09 merged 1 commit into
mainfrom
metrics_exp
Jun 3, 2026
Merged

A way to install metrics exporter based monitoring for a group of nodes or cluster#208
venksrin09 merged 1 commit into
mainfrom
metrics_exp

Conversation

@venksrin09

Copy link
Copy Markdown
Collaborator

Initial commit for setting up time series monitoring with AMD metrics exporter, Grafana, Loki, Prometheus for any node groups or cluster. The current cluster-mon in CVS is based of ssh and gathering information from amd-smi commands while this tool uses AMD's metric's exporter and provides time series monitoring data. The whole installation is simplified with a fleet metrics monitoring web application. It also creates all the standard Grafana dashboards need for monitoring.

@venksrin09

Copy link
Copy Markdown
Collaborator Author

Ignatious, please review.

@@ -0,0 +1,65 @@
2026-05-28T19:12:57.884Z [ERROR] Failed to save config with lock: Error: ENOENT: no such file or directory, lstat '.claude/.claude.json'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@venksrin09 i see .claude dir is also added to the commit

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken care of this. Thanks for catching this.

install all required packages on the GPU nodes and setup Grafana
dashboards

Signed-off-by: venksrin09 <venksrin@amd.com>
@venksrin09 venksrin09 merged commit f4cf92a into main Jun 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants