Skip to content

Namespace NVIDIA telemetry metrics#976

Draft
FrankD412 wants to merge 2 commits into
mainfrom
fdinatale/aip-905-namespace-nvidia-specific-telemetry-metrics
Draft

Namespace NVIDIA telemetry metrics#976
FrankD412 wants to merge 2 commits into
mainfrom
fdinatale/aip-905-namespace-nvidia-specific-telemetry-metrics

Conversation

@FrankD412
Copy link
Copy Markdown
Contributor

@FrankD412 FrankD412 commented May 21, 2026

Refs AIP-905

Summary

  • Namespace NVIDIA GPU telemetry metrics under nvidia_*.
  • Keep legacy NVIDIA field aliases for ingest/plot compatibility.
  • Update exports, plots, docs, and tests for vendor-specific telemetry names.

Validation

  • pre-commit run

Summary by CodeRabbit

  • New Features

    • Added per-GPU platform identification (nvidia/amd/unknown) and a Platform column in CSV exports.
    • Console telemetry now shows a platform-specific disclaimer and updated summary header wording.
  • Breaking Changes

    • JSON export schema bumped to 1.4.
    • NVIDIA telemetry metric names are renamed to nvidia_* (AMD remains amd_*); CSV/JSON/plots reflect namespaced metric keys.
  • Documentation

    • Tutorials and reference docs updated with platform-specific naming and examples.

Review Change Stack

Refs AIP-905

Signed-off-by: Francesco Di Natale <[email protected]>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f97aa484c2757a51d67863e8bf3e5c42ad93ea68

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f97aa484c2757a51d67863e8bf3e5c42ad93ea68

Last updated for commit: f97aa48Browse code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Walkthrough

This PR standardizes NVIDIA GPU telemetry field naming to nvidia_*, adds a per-GPU platform field propagated through collectors/accumulator/exports, updates plotting/data-loading to prefer NVIDIA fields, bumps JSON export schema to 1.4, and updates docs and tests accordingly.

Changes

NVIDIA Telemetry Field Standardization (nvidia_ Prefix)*

Layer / File(s) Summary
Data Models, Schema Version, and Platform Constants
src/aiperf/common/models/telemetry_models.py, src/aiperf/common/models/export_models.py, src/aiperf/gpu_telemetry/constants.py
TelemetryMetrics replaced/expanded with nvidia_* fields (AliasChoices aliases); GpuMetadata and GpuSummary include platform; JsonExportData.SCHEMA_VERSION"1.4"; platform constants and NVIDIA alias mapping added.
GPU Telemetry Collectors: DCGM, PyNVML, AMDSMI
src/aiperf/gpu_telemetry/dcgm_collector.py, src/aiperf/gpu_telemetry/pynvml_collector.py, src/aiperf/gpu_telemetry/amdsmi_collector.py, src/aiperf/gpu_telemetry/constants.py
Collectors set platform on GpuMetadata (nvidia for DCGM/PyNVML, amd for AMDSMI); scaling-factor identifiers and internal field mappings switched to nvidia_* keys where applicable.
Telemetry Accumulator and Export Integration
src/aiperf/gpu_telemetry/accumulator.py, src/aiperf/exporters/metrics_csv_exporter.py, src/aiperf/exporters/gpu_telemetry_console_exporter.py, src/aiperf/gpu_telemetry/manager.py
Accumulator populates GpuSummary.platform; CSV exporter adds Platform column and writes per-GPU platform; console exporter prepends platform-disclaimer text, adjusts header text and reachability wording to “telemetry sources reachable”; manager status strings updated.
Data Loading, Aggregation, Plotting, and Metric Utilities
src/aiperf/plot/core/data_loader.py, src/aiperf/plot/core/data_preparation.py, src/aiperf/plot/metric_names.py, src/aiperf/plot/default_plot_config.yaml, src/aiperf/plot/handlers/single_run_handlers.py
DataLoader remaps legacy NVIDIA field names via aliases when loading JSONL; aggregate_gpu_telemetry prefers nvidia_gpu_utilization with gpu_utilization fallback and supports output_col; metric_names integrates NVIDIA aliases; plot config and handlers accept nvidia_gpu_utilization.
Metrics Config / Custom CSV Handling
src/aiperf/gpu_telemetry/metrics_config.py
Unit inference accepts additional size aliases (gib, mib, kib); CSV custom metric name derivation strips DCGM_FI_PROF_ as well and generates nvidia_<suffix> internal names.
Documentation: JSON Schema and GPU Telemetry Tutorial
docs/reference/json-export-schema.md, docs/tutorials/gpu-telemetry.md
Schema docs updated to 1.4 showing telemetry_data examples with per-GPU platform and nvidia_* keys; tutorial updated with platform naming guidance and example console/CSV/JSON outputs using nvidia_* names.
Test Fixtures and Exporter Tests
tests/unit/exporters/conftest.py, tests/unit/exporters/test_gpu_telemetry_console_exporter.py, tests/unit/exporters/test_metrics_csv_exporter.py, tests/unit/exporters/test_metrics_json_exporter.py
Exporter fixtures now populate TelemetryMetrics with nvidia_* keys and platform="nvidia"; exporter tests updated to expect schema_version 1.4, Platform CSV column, nvidia_* JSON keys, and updated console wording.
GPU Telemetry Unit Tests: Core Models and Collectors
tests/unit/gpu_telemetry/* and related test modules
Extensive test updates across fixtures and assertions to use nvidia_* metric names, validate AliasChoices behavior, check platform propagation, and update counter-metric allowlists to nvidia_* names.
Integration Tests and Plotting Tests
tests/integration/*, tests/unit/plot/*, tests/component_integration/*
Integration and plotting tests updated to expect nvidia_* telemetry keys, adjusted filtering/aggregation logic for nvidia_gpu_utilization, and verified plotting prep registry includes nvidia_gpu_utilization.
Remaining Test Updates and Baselines
multiple tests/* files
All remaining tests, fixtures, and numeric-baseline files updated to the nvidia_* naming and platform parameter where applicable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through code with twitching nose,

Replaced old tags where the telemetry flows.
nvidia_* now sings the GPU tune,
A schema bumped beneath the moon.
Platform fields hop in, tidy and spry.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Namespace NVIDIA telemetry metrics" clearly and specifically summarizes the primary change: renaming/prefixing NVIDIA GPU telemetry metrics with the nvidia_* namespace.
Docstring Coverage ✅ Passed Docstring coverage is 94.24% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/unit/gpu_telemetry/test_metrics_config.py (1)

328-335: 💤 Low value

Minor: Unused unpacked variables in loop.

The loop unpacks display_name and unit but only uses field_name. Consider using _ for clarity:

-            for metric in custom_metrics:
-                display_name, field_name, unit = metric
+            for _, field_name, _ in custom_metrics:
                 assert field_name in [
                     "nvidia_sm_clock",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/gpu_telemetry/test_metrics_config.py` around lines 328 - 335, The
loop that iterates over custom_metrics unpacks display_name, field_name, unit
but only uses field_name; change the unpacking in that loop to use placeholders
(e.g., _, field_name, _) instead of display_name and unit so unused variables
are clearly ignored and lint warnings are removed; update the for-loop header
where custom_metrics is iterated to only bind field_name and use the existing
assertion on field_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/json-export-schema.md`:
- Line 146: The docs state a telemetry metric-name breaking change in `1.4`
(renaming built-in NVIDIA metrics to `nvidia_*`, keeping AMD under `amd_*`, and
adding per-GPU `platform`) but the versioning policy above requires coordination
and a major bump for renames/removals; update docs to remove ambiguity by either
(A) adding an explicit exception note next to the `1.4` entry that explains why
this telemetry rename is allowed under the policy (e.g., consumer-safe,
transitional mapping provided, and date/compatibility guarantees), or (B) modify
the versioning policy text to state that telemetry metric-name normalizations
(specifically `telemetry_data` metric renames like `nvidia_*`/`amd_*` and
`platform`) are allowed within minor releases with required migration guidance;
reference the `1.4` version entry, the `telemetry_data` term, and the
`nvidia_*`/`amd_*` metric names so readers can find and reconcile the change.

In `@tests/unit/gpu_telemetry/test_pynvml_collector.py`:
- Line 559: The async test functions (e.g.,
test_nvidia_energy_consumption_collected and the other renamed tests referenced)
are missing explicit return type annotations; update each async test definition
to include an explicit return hint (use "-> None") to comply with the repo
standard for type hints on all functions, ensuring the annotation is added to
the async def lines for functions like test_nvidia_energy_consumption_collected
and the other tests noted.

---

Nitpick comments:
In `@tests/unit/gpu_telemetry/test_metrics_config.py`:
- Around line 328-335: The loop that iterates over custom_metrics unpacks
display_name, field_name, unit but only uses field_name; change the unpacking in
that loop to use placeholders (e.g., _, field_name, _) instead of display_name
and unit so unused variables are clearly ignored and lint warnings are removed;
update the for-loop header where custom_metrics is iterated to only bind
field_name and use the existing assertion on field_name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4e3e0901-66d8-4cac-bdbe-da180293e84e

📥 Commits

Reviewing files that changed from the base of the PR and between ea1ce37 and 1727c80.

📒 Files selected for processing (46)
  • docs/reference/json-export-schema.md
  • docs/tutorials/gpu-telemetry.md
  • src/aiperf/common/models/export_models.py
  • src/aiperf/common/models/telemetry_models.py
  • src/aiperf/exporters/gpu_telemetry_console_exporter.py
  • src/aiperf/exporters/metrics_csv_exporter.py
  • src/aiperf/gpu_telemetry/accumulator.py
  • src/aiperf/gpu_telemetry/amdsmi_collector.py
  • src/aiperf/gpu_telemetry/constants.py
  • src/aiperf/gpu_telemetry/dcgm_collector.py
  • src/aiperf/gpu_telemetry/manager.py
  • src/aiperf/gpu_telemetry/metrics_config.py
  • src/aiperf/gpu_telemetry/pynvml_collector.py
  • src/aiperf/plot/core/data_loader.py
  • src/aiperf/plot/core/data_preparation.py
  • src/aiperf/plot/default_plot_config.yaml
  • src/aiperf/plot/handlers/single_run_handlers.py
  • src/aiperf/plot/metric_names.py
  • tests/component_integration/gpu_telemetry/test_gpu_telemetry.py
  • tests/integration/test_custom_gpu_metrics.py
  • tests/integration/test_dcgm_faker.py
  • tests/integration/test_gpu_telemetry.py
  • tests/unit/common/models/test_record_models.py
  • tests/unit/exporters/conftest.py
  • tests/unit/exporters/test_gpu_telemetry_console_exporter.py
  • tests/unit/exporters/test_metrics_csv_exporter.py
  • tests/unit/exporters/test_metrics_json_exporter.py
  • tests/unit/gpu_telemetry/conftest.py
  • tests/unit/gpu_telemetry/test_accumulator.py
  • tests/unit/gpu_telemetry/test_amdsmi_collector.py
  • tests/unit/gpu_telemetry/test_jsonl_writer.py
  • tests/unit/gpu_telemetry/test_metrics_config.py
  • tests/unit/gpu_telemetry/test_pynvml_collector.py
  • tests/unit/gpu_telemetry/test_telemetry_data_collector.py
  • tests/unit/gpu_telemetry/test_telemetry_integration.py
  • tests/unit/gpu_telemetry/test_telemetry_manager.py
  • tests/unit/gpu_telemetry/test_telemetry_models.py
  • tests/unit/metrics/test_metrics_registry.py
  • tests/unit/plot/test_data_loader.py
  • tests/unit/plot/test_metric_names.py
  • tests/unit/plot/test_png_exporter.py
  • tests/unit/plot/test_single_run_handlers.py
  • tests/unit/post_processors/conftest.py
  • tests/unit/property/_numeric_bounds_baseline.txt
  • tests/unit/records/test_records_manager.py
  • tests/unit/server/test_dcgm_faker.py

| `1.1` | Added `count` and `sum` to per-metric stats blocks. Backward-compatible for readers that ignore unknown fields; the new fields are present only on record-type metrics, omitted on derived/aggregate. |
| `1.2` | Added top-level `run_info` block (`random_seed`, `trial`, `run_label`, `variation_label`, `variation_index`, `variation_values`). Backward-compatible: readers that don't need reproducibility can ignore the field. |
| `1.3` | Added `benchmark_id`, `sweep_id`, and `cli_command` to `run_info`. `benchmark_id` duplicates the top-level field so `run_info` is self-contained; `sweep_id` (UUID4 of the outer sweep) lets readers join all per-run exports from one plan without consulting the parent multi-run artifact directory; `cli_command` records the redacted command line when available. Backward-compatible: nullable fields default to `null` when unavailable. |
| `1.4` | Added per-GPU telemetry `platform` and renamed built-in NVIDIA GPU telemetry metrics to `nvidia_*`. AMD telemetry remains under `amd_*`. This is a telemetry metric-name breaking change for consumers of `telemetry_data`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify versioning policy vs this breaking change.

Line 146 documents a telemetry metric-name breaking change in 1.4, but the versioning policy above says renames/removals should coordinate a major bump. Please explicitly clarify the exception (or update the policy text) to avoid downstream parser confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/json-export-schema.md` at line 146, The docs state a telemetry
metric-name breaking change in `1.4` (renaming built-in NVIDIA metrics to
`nvidia_*`, keeping AMD under `amd_*`, and adding per-GPU `platform`) but the
versioning policy above requires coordination and a major bump for
renames/removals; update docs to remove ambiguity by either (A) adding an
explicit exception note next to the `1.4` entry that explains why this telemetry
rename is allowed under the policy (e.g., consumer-safe, transitional mapping
provided, and date/compatibility guarantees), or (B) modify the versioning
policy text to state that telemetry metric-name normalizations (specifically
`telemetry_data` metric renames like `nvidia_*`/`amd_*` and `platform`) are
allowed within minor releases with required migration guidance; reference the
`1.4` version entry, the `telemetry_data` term, and the `nvidia_*`/`amd_*`
metric names so readers can find and reconcile the change.


@pytest.mark.asyncio
async def test_energy_consumption_collected(self, initialized_collector):
async def test_nvidia_energy_consumption_collected(self, initialized_collector):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add return type hints to the renamed async test functions.

These changed test definitions should include explicit return annotations to match the repo standard.

Proposed fix
-    async def test_nvidia_energy_consumption_collected(self, initialized_collector):
+    async def test_nvidia_energy_consumption_collected(
+        self, initialized_collector
+    ) -> None:
@@
-    async def test_nvidia_sm_utilization_sums_multiple_processes(self, patch_pynvml):
+    async def test_nvidia_sm_utilization_sums_multiple_processes(
+        self, patch_pynvml
+    ) -> None:
@@
-    async def test_empty_process_list_zero_nvidia_sm_utilization(self, patch_pynvml):
+    async def test_empty_process_list_zero_nvidia_sm_utilization(
+        self, patch_pynvml
+    ) -> None:
@@
-    async def test_nvidia_sm_utilization_capped_at_100(self, patch_pynvml):
+    async def test_nvidia_sm_utilization_capped_at_100(
+        self, patch_pynvml
+    ) -> None:

As per coding guidelines, "Type hints on ALL functions (params and return)".

Also applies to: 570-570, 593-593, 612-612

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/gpu_telemetry/test_pynvml_collector.py` at line 559, The async
test functions (e.g., test_nvidia_energy_consumption_collected and the other
renamed tests referenced) are missing explicit return type annotations; update
each async test definition to include an explicit return hint (use "-> None") to
comply with the repo standard for type hints on all functions, ensuring the
annotation is added to the async def lines for functions like
test_nvidia_energy_consumption_collected and the other tests noted.

After namespacing the telemetry rename, aggregate_gpu_telemetry always
returned a column named nvidia_gpu_utilization. The dual-axis plot
generator looks up df[y2_metric] using the literal spec name, so any
PlotSpec still referencing gpu_utilization raised KeyError. Pass the
requested name through aggregate_gpu_telemetry so both the new and the
legacy spec names work, restoring the 5 failing TestDualAxisHandler
tests.

Refs AIP-905

Signed-off-by: Francesco Di Natale <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/plot/core/data_preparation.py`:
- Around line 356-358: The current rename can create duplicate column labels
when gpu_index is missing and both nvidia_gpu_utilization and gpu_utilization
exist; before renaming in the block that uses gpu_df, check if output_col
already exists in gpu_df.columns: if output_col is absent, proceed with rename
as now; if output_col is present and different from utilization_col, copy
gpu_df[utilization_col] into gpu_df[output_col] (or skip copying) and then drop
the original utilization_col to avoid duplicate labels; reference gpu_df,
output_col, utilization_col, gpu_index, nvidia_gpu_utilization, gpu_utilization
and downstream df[y2_metric] lookups when making this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66483101-1242-4bfa-96cc-6353432055ee

📥 Commits

Reviewing files that changed from the base of the PR and between 1727c80 and f97aa48.

📒 Files selected for processing (2)
  • src/aiperf/plot/core/data_preparation.py
  • src/aiperf/plot/handlers/single_run_handlers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aiperf/plot/handlers/single_run_handlers.py

Comment on lines +356 to +358
if output_col != utilization_col:
gpu_df = gpu_df.rename(columns={utilization_col: output_col})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent duplicate telemetry column names during rename.

When gpu_index is absent and both nvidia_gpu_utilization and gpu_utilization already exist, this rename can create duplicate column labels. Downstream df[y2_metric] lookups may then return a DataFrame instead of a Series.

Proposed fix
-    if output_col != utilization_col:
-        gpu_df = gpu_df.rename(columns={utilization_col: output_col})
+    if output_col != utilization_col:
+        if output_col in gpu_df.columns:
+            gpu_df = gpu_df.drop(columns=[output_col])
+        gpu_df = gpu_df.rename(columns={utilization_col: output_col})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/plot/core/data_preparation.py` around lines 356 - 358, The current
rename can create duplicate column labels when gpu_index is missing and both
nvidia_gpu_utilization and gpu_utilization exist; before renaming in the block
that uses gpu_df, check if output_col already exists in gpu_df.columns: if
output_col is absent, proceed with rename as now; if output_col is present and
different from utilization_col, copy gpu_df[utilization_col] into
gpu_df[output_col] (or skip copying) and then drop the original utilization_col
to avoid duplicate labels; reference gpu_df, output_col, utilization_col,
gpu_index, nvidia_gpu_utilization, gpu_utilization and downstream df[y2_metric]
lookups when making this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant