Namespace NVIDIA telemetry metrics#976
Conversation
Refs AIP-905 Signed-off-by: Francesco Di Natale <[email protected]>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f97aa484c2757a51d67863e8bf3e5c42ad93ea68Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@f97aa484c2757a51d67863e8bf3e5c42ad93ea68Last updated for commit: |
WalkthroughThis PR standardizes NVIDIA GPU telemetry field naming to ChangesNVIDIA Telemetry Field Standardization (nvidia_ Prefix)*
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/unit/gpu_telemetry/test_metrics_config.py (1)
328-335: 💤 Low valueMinor: Unused unpacked variables in loop.
The loop unpacks
display_nameandunitbut only usesfield_name. Consider using_for clarity:- for metric in custom_metrics: - display_name, field_name, unit = metric + for _, field_name, _ in custom_metrics: assert field_name in [ "nvidia_sm_clock",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/gpu_telemetry/test_metrics_config.py` around lines 328 - 335, The loop that iterates over custom_metrics unpacks display_name, field_name, unit but only uses field_name; change the unpacking in that loop to use placeholders (e.g., _, field_name, _) instead of display_name and unit so unused variables are clearly ignored and lint warnings are removed; update the for-loop header where custom_metrics is iterated to only bind field_name and use the existing assertion on field_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/json-export-schema.md`:
- Line 146: The docs state a telemetry metric-name breaking change in `1.4`
(renaming built-in NVIDIA metrics to `nvidia_*`, keeping AMD under `amd_*`, and
adding per-GPU `platform`) but the versioning policy above requires coordination
and a major bump for renames/removals; update docs to remove ambiguity by either
(A) adding an explicit exception note next to the `1.4` entry that explains why
this telemetry rename is allowed under the policy (e.g., consumer-safe,
transitional mapping provided, and date/compatibility guarantees), or (B) modify
the versioning policy text to state that telemetry metric-name normalizations
(specifically `telemetry_data` metric renames like `nvidia_*`/`amd_*` and
`platform`) are allowed within minor releases with required migration guidance;
reference the `1.4` version entry, the `telemetry_data` term, and the
`nvidia_*`/`amd_*` metric names so readers can find and reconcile the change.
In `@tests/unit/gpu_telemetry/test_pynvml_collector.py`:
- Line 559: The async test functions (e.g.,
test_nvidia_energy_consumption_collected and the other renamed tests referenced)
are missing explicit return type annotations; update each async test definition
to include an explicit return hint (use "-> None") to comply with the repo
standard for type hints on all functions, ensuring the annotation is added to
the async def lines for functions like test_nvidia_energy_consumption_collected
and the other tests noted.
---
Nitpick comments:
In `@tests/unit/gpu_telemetry/test_metrics_config.py`:
- Around line 328-335: The loop that iterates over custom_metrics unpacks
display_name, field_name, unit but only uses field_name; change the unpacking in
that loop to use placeholders (e.g., _, field_name, _) instead of display_name
and unit so unused variables are clearly ignored and lint warnings are removed;
update the for-loop header where custom_metrics is iterated to only bind
field_name and use the existing assertion on field_name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4e3e0901-66d8-4cac-bdbe-da180293e84e
📒 Files selected for processing (46)
docs/reference/json-export-schema.mddocs/tutorials/gpu-telemetry.mdsrc/aiperf/common/models/export_models.pysrc/aiperf/common/models/telemetry_models.pysrc/aiperf/exporters/gpu_telemetry_console_exporter.pysrc/aiperf/exporters/metrics_csv_exporter.pysrc/aiperf/gpu_telemetry/accumulator.pysrc/aiperf/gpu_telemetry/amdsmi_collector.pysrc/aiperf/gpu_telemetry/constants.pysrc/aiperf/gpu_telemetry/dcgm_collector.pysrc/aiperf/gpu_telemetry/manager.pysrc/aiperf/gpu_telemetry/metrics_config.pysrc/aiperf/gpu_telemetry/pynvml_collector.pysrc/aiperf/plot/core/data_loader.pysrc/aiperf/plot/core/data_preparation.pysrc/aiperf/plot/default_plot_config.yamlsrc/aiperf/plot/handlers/single_run_handlers.pysrc/aiperf/plot/metric_names.pytests/component_integration/gpu_telemetry/test_gpu_telemetry.pytests/integration/test_custom_gpu_metrics.pytests/integration/test_dcgm_faker.pytests/integration/test_gpu_telemetry.pytests/unit/common/models/test_record_models.pytests/unit/exporters/conftest.pytests/unit/exporters/test_gpu_telemetry_console_exporter.pytests/unit/exporters/test_metrics_csv_exporter.pytests/unit/exporters/test_metrics_json_exporter.pytests/unit/gpu_telemetry/conftest.pytests/unit/gpu_telemetry/test_accumulator.pytests/unit/gpu_telemetry/test_amdsmi_collector.pytests/unit/gpu_telemetry/test_jsonl_writer.pytests/unit/gpu_telemetry/test_metrics_config.pytests/unit/gpu_telemetry/test_pynvml_collector.pytests/unit/gpu_telemetry/test_telemetry_data_collector.pytests/unit/gpu_telemetry/test_telemetry_integration.pytests/unit/gpu_telemetry/test_telemetry_manager.pytests/unit/gpu_telemetry/test_telemetry_models.pytests/unit/metrics/test_metrics_registry.pytests/unit/plot/test_data_loader.pytests/unit/plot/test_metric_names.pytests/unit/plot/test_png_exporter.pytests/unit/plot/test_single_run_handlers.pytests/unit/post_processors/conftest.pytests/unit/property/_numeric_bounds_baseline.txttests/unit/records/test_records_manager.pytests/unit/server/test_dcgm_faker.py
| | `1.1` | Added `count` and `sum` to per-metric stats blocks. Backward-compatible for readers that ignore unknown fields; the new fields are present only on record-type metrics, omitted on derived/aggregate. | | ||
| | `1.2` | Added top-level `run_info` block (`random_seed`, `trial`, `run_label`, `variation_label`, `variation_index`, `variation_values`). Backward-compatible: readers that don't need reproducibility can ignore the field. | | ||
| | `1.3` | Added `benchmark_id`, `sweep_id`, and `cli_command` to `run_info`. `benchmark_id` duplicates the top-level field so `run_info` is self-contained; `sweep_id` (UUID4 of the outer sweep) lets readers join all per-run exports from one plan without consulting the parent multi-run artifact directory; `cli_command` records the redacted command line when available. Backward-compatible: nullable fields default to `null` when unavailable. | | ||
| | `1.4` | Added per-GPU telemetry `platform` and renamed built-in NVIDIA GPU telemetry metrics to `nvidia_*`. AMD telemetry remains under `amd_*`. This is a telemetry metric-name breaking change for consumers of `telemetry_data`. | |
There was a problem hiding this comment.
Clarify versioning policy vs this breaking change.
Line 146 documents a telemetry metric-name breaking change in 1.4, but the versioning policy above says renames/removals should coordinate a major bump. Please explicitly clarify the exception (or update the policy text) to avoid downstream parser confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/reference/json-export-schema.md` at line 146, The docs state a telemetry
metric-name breaking change in `1.4` (renaming built-in NVIDIA metrics to
`nvidia_*`, keeping AMD under `amd_*`, and adding per-GPU `platform`) but the
versioning policy above requires coordination and a major bump for
renames/removals; update docs to remove ambiguity by either (A) adding an
explicit exception note next to the `1.4` entry that explains why this telemetry
rename is allowed under the policy (e.g., consumer-safe, transitional mapping
provided, and date/compatibility guarantees), or (B) modify the versioning
policy text to state that telemetry metric-name normalizations (specifically
`telemetry_data` metric renames like `nvidia_*`/`amd_*` and `platform`) are
allowed within minor releases with required migration guidance; reference the
`1.4` version entry, the `telemetry_data` term, and the `nvidia_*`/`amd_*`
metric names so readers can find and reconcile the change.
|
|
||
| @pytest.mark.asyncio | ||
| async def test_energy_consumption_collected(self, initialized_collector): | ||
| async def test_nvidia_energy_consumption_collected(self, initialized_collector): |
There was a problem hiding this comment.
Add return type hints to the renamed async test functions.
These changed test definitions should include explicit return annotations to match the repo standard.
Proposed fix
- async def test_nvidia_energy_consumption_collected(self, initialized_collector):
+ async def test_nvidia_energy_consumption_collected(
+ self, initialized_collector
+ ) -> None:
@@
- async def test_nvidia_sm_utilization_sums_multiple_processes(self, patch_pynvml):
+ async def test_nvidia_sm_utilization_sums_multiple_processes(
+ self, patch_pynvml
+ ) -> None:
@@
- async def test_empty_process_list_zero_nvidia_sm_utilization(self, patch_pynvml):
+ async def test_empty_process_list_zero_nvidia_sm_utilization(
+ self, patch_pynvml
+ ) -> None:
@@
- async def test_nvidia_sm_utilization_capped_at_100(self, patch_pynvml):
+ async def test_nvidia_sm_utilization_capped_at_100(
+ self, patch_pynvml
+ ) -> None:As per coding guidelines, "Type hints on ALL functions (params and return)".
Also applies to: 570-570, 593-593, 612-612
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/gpu_telemetry/test_pynvml_collector.py` at line 559, The async
test functions (e.g., test_nvidia_energy_consumption_collected and the other
renamed tests referenced) are missing explicit return type annotations; update
each async test definition to include an explicit return hint (use "-> None") to
comply with the repo standard for type hints on all functions, ensuring the
annotation is added to the async def lines for functions like
test_nvidia_energy_consumption_collected and the other tests noted.
After namespacing the telemetry rename, aggregate_gpu_telemetry always returned a column named nvidia_gpu_utilization. The dual-axis plot generator looks up df[y2_metric] using the literal spec name, so any PlotSpec still referencing gpu_utilization raised KeyError. Pass the requested name through aggregate_gpu_telemetry so both the new and the legacy spec names work, restoring the 5 failing TestDualAxisHandler tests. Refs AIP-905 Signed-off-by: Francesco Di Natale <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/plot/core/data_preparation.py`:
- Around line 356-358: The current rename can create duplicate column labels
when gpu_index is missing and both nvidia_gpu_utilization and gpu_utilization
exist; before renaming in the block that uses gpu_df, check if output_col
already exists in gpu_df.columns: if output_col is absent, proceed with rename
as now; if output_col is present and different from utilization_col, copy
gpu_df[utilization_col] into gpu_df[output_col] (or skip copying) and then drop
the original utilization_col to avoid duplicate labels; reference gpu_df,
output_col, utilization_col, gpu_index, nvidia_gpu_utilization, gpu_utilization
and downstream df[y2_metric] lookups when making this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 66483101-1242-4bfa-96cc-6353432055ee
📒 Files selected for processing (2)
src/aiperf/plot/core/data_preparation.pysrc/aiperf/plot/handlers/single_run_handlers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/aiperf/plot/handlers/single_run_handlers.py
| if output_col != utilization_col: | ||
| gpu_df = gpu_df.rename(columns={utilization_col: output_col}) | ||
|
|
There was a problem hiding this comment.
Prevent duplicate telemetry column names during rename.
When gpu_index is absent and both nvidia_gpu_utilization and gpu_utilization already exist, this rename can create duplicate column labels. Downstream df[y2_metric] lookups may then return a DataFrame instead of a Series.
Proposed fix
- if output_col != utilization_col:
- gpu_df = gpu_df.rename(columns={utilization_col: output_col})
+ if output_col != utilization_col:
+ if output_col in gpu_df.columns:
+ gpu_df = gpu_df.drop(columns=[output_col])
+ gpu_df = gpu_df.rename(columns={utilization_col: output_col})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/plot/core/data_preparation.py` around lines 356 - 358, The current
rename can create duplicate column labels when gpu_index is missing and both
nvidia_gpu_utilization and gpu_utilization exist; before renaming in the block
that uses gpu_df, check if output_col already exists in gpu_df.columns: if
output_col is absent, proceed with rename as now; if output_col is present and
different from utilization_col, copy gpu_df[utilization_col] into
gpu_df[output_col] (or skip copying) and then drop the original utilization_col
to avoid duplicate labels; reference gpu_df, output_col, utilization_col,
gpu_index, nvidia_gpu_utilization, gpu_utilization and downstream df[y2_metric]
lookups when making this change.
Refs AIP-905
Summary
Validation
Summary by CodeRabbit
New Features
Breaking Changes
Documentation