Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion docs/reference/json-export-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ A run with 20 requests against a streaming chat endpoint produces entries shaped

```json
{
"schema_version": "1.3",
"schema_version": "1.4",
"request_latency": {
"unit": "ms",
"avg": 2620.71,
Expand Down Expand Up @@ -72,6 +72,35 @@ In addition to the per-metric stats blocks, `profile_export_aiperf.json` include
| `telemetry_data` | object | GPU telemetry summaries when telemetry collection was active. |
| `error_summary` | array | Per-error counts collected during the run. |

### `telemetry_data`

Schema 1.4 adds a `platform` field to each GPU summary and vendor-scopes GPU
telemetry metric names. NVIDIA metrics collected through DCGM or pynvml use
`nvidia_*` names, and AMD metrics collected through amdsmi use `amd_*` names.
Metric semantics are platform-specific; cross-platform comparisons require
validation of the workload, collector behavior, and metric definitions.

```json
"telemetry_data": {
"endpoints": {
"localhost:9400": {
"gpus": {
"gpu_0": {
"gpu_index": 0,
"gpu_name": "NVIDIA H100",
"gpu_uuid": "GPU-...",
"platform": "nvidia",
"metrics": {
"nvidia_power_usage": {"unit": "W", "avg": 310.0},
"nvidia_gpu_utilization": {"unit": "%", "avg": 85.0}
}
}
}
}
}
}
```

### `run_info`

Schema 1.2 introduced `run_info` to surface the seed and sweep coordinates needed to reproduce a run from the JSON file alone, without consulting the internal `run_config.json` handoff file. Schema 1.3 extends it with identifiers and the redacted CLI command.
Expand Down Expand Up @@ -114,6 +143,7 @@ The current schema version is exported as the top-level `schema_version` field o
| `1.1` | Added `count` and `sum` to per-metric stats blocks. Backward-compatible for readers that ignore unknown fields; the new fields are present only on record-type metrics, omitted on derived/aggregate. |
| `1.2` | Added top-level `run_info` block (`random_seed`, `trial`, `run_label`, `variation_label`, `variation_index`, `variation_values`). Backward-compatible: readers that don't need reproducibility can ignore the field. |
| `1.3` | Added `benchmark_id`, `sweep_id`, and `cli_command` to `run_info`. `benchmark_id` duplicates the top-level field so `run_info` is self-contained; `sweep_id` (UUID4 of the outer sweep) lets readers join all per-run exports from one plan without consulting the parent multi-run artifact directory; `cli_command` records the redacted command line when available. Backward-compatible: nullable fields default to `null` when unavailable. |
| `1.4` | Added per-GPU telemetry `platform` and renamed built-in NVIDIA GPU telemetry metrics to `nvidia_*`. AMD telemetry remains under `amd_*`. This is a telemetry metric-name breaking change for consumers of `telemetry_data`. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify versioning policy vs this breaking change.

Line 146 documents a telemetry metric-name breaking change in 1.4, but the versioning policy above says renames/removals should coordinate a major bump. Please explicitly clarify the exception (or update the policy text) to avoid downstream parser confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/json-export-schema.md` at line 146, The docs state a telemetry
metric-name breaking change in `1.4` (renaming built-in NVIDIA metrics to
`nvidia_*`, keeping AMD under `amd_*`, and adding per-GPU `platform`) but the
versioning policy above requires coordination and a major bump for
renames/removals; update docs to remove ambiguity by either (A) adding an
explicit exception note next to the `1.4` entry that explains why this telemetry
rename is allowed under the policy (e.g., consumer-safe, transitional mapping
provided, and date/compatibility guarantees), or (B) modify the versioning
policy text to state that telemetry metric-name normalizations (specifically
`telemetry_data` metric renames like `nvidia_*`/`amd_*` and `platform`) are
allowed within minor releases with required migration guidance; reference the
`1.4` version entry, the `telemetry_data` term, and the `nvidia_*`/`amd_*`
metric names so readers can find and reconcile the change.


### Other JSON exports use independent schema versions

Expand Down
155 changes: 83 additions & 72 deletions docs/tutorials/gpu-telemetry.md

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion src/aiperf/common/models/export_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,10 @@ class GpuSummary(AIPerfBaseModel):
gpu_index: int
gpu_name: str
gpu_uuid: str
platform: str = Field(
default="unknown",
description="GPU telemetry platform namespace, e.g. 'nvidia', 'amd', or 'unknown'",
)
hostname: str | None
namespace: str | None = None
pod_name: str | None = None
Expand Down Expand Up @@ -238,7 +242,7 @@ class JsonExportData(AIPerfBaseModel):
model_config = ConfigDict(extra="allow")

# Increment on breaking changes to the export structure
SCHEMA_VERSION: ClassVar[str] = "1.3"
SCHEMA_VERSION: ClassVar[str] = "1.4"

schema_version: str | None = Field(
default=None,
Expand Down
84 changes: 58 additions & 26 deletions src/aiperf/common/models/telemetry_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@

import numpy as np
from numpy.typing import NDArray
from pydantic import ConfigDict, Field
from pydantic import AliasChoices, ConfigDict, Field

from aiperf.common.exceptions import NoMetricValue
from aiperf.common.models.base_models import AIPerfBaseModel
from aiperf.common.models.export_models import TelemetryExportData
from aiperf.common.models.record_models import MetricResult
from aiperf.common.models.server_metrics_models import TimeRangeFilter
from aiperf.gpu_telemetry.constants import UNKNOWN_GPU_TELEMETRY_PLATFORM


class TelemetryMetrics(AIPerfBaseModel):
Expand All @@ -23,47 +24,73 @@ class TelemetryMetrics(AIPerfBaseModel):

model_config = ConfigDict(extra="allow")

gpu_power_usage: float | None = Field(
default=None, description="Current GPU power usage in W"
nvidia_power_usage: float | None = Field(
default=None,
validation_alias=AliasChoices("nvidia_power_usage", "gpu_power_usage"),
description="Current NVIDIA GPU power usage in W",
)
energy_consumption: float | None = Field(
default=None, description="Cumulative energy consumption in MJ"
nvidia_energy_consumption: float | None = Field(
default=None,
validation_alias=AliasChoices(
"nvidia_energy_consumption", "energy_consumption"
),
description="NVIDIA GPU cumulative energy consumption in MJ",
)
gpu_utilization: float | None = Field(
nvidia_gpu_utilization: float | None = Field(
default=None,
description="GPU utilization percentage (0-100). "
validation_alias=AliasChoices("nvidia_gpu_utilization", "gpu_utilization"),
description="NVIDIA GPU utilization percentage (0-100). "
"Percent of time over the past sample period during which one or more kernels was executing on the GPU.",
)
gpu_memory_used: float | None = Field(
default=None, description="GPU memory used in GB"
nvidia_memory_utilization: float | None = Field(
default=None,
validation_alias=AliasChoices("nvidia_memory_utilization", "mem_utilization"),
description="NVIDIA memory bandwidth utilization percentage (0-100). "
"Percent of time over the past sample period during which global (device) memory was being read or written.",
)
gpu_temperature: float | None = Field(
default=None, description="GPU temperature in °C"
nvidia_memory_used: float | None = Field(
default=None,
validation_alias=AliasChoices("nvidia_memory_used", "gpu_memory_used"),
description="NVIDIA GPU memory used in GB",
)
mem_utilization: float | None = Field(
nvidia_temperature: float | None = Field(
default=None,
description="Memory bandwidth utilization percentage (0-100). "
"Percent of time over the past sample period during which global (device) memory was being read or written.",
validation_alias=AliasChoices("nvidia_temperature", "gpu_temperature"),
description="NVIDIA GPU temperature in °C",
)
sm_utilization: float | None = Field(
nvidia_sm_utilization: float | None = Field(
default=None,
description="Streaming multiprocessor utilization percentage (0-100)",
validation_alias=AliasChoices("nvidia_sm_utilization", "sm_utilization"),
description="NVIDIA streaming multiprocessor utilization percentage (0-100)",
)
decoder_utilization: float | None = Field(
default=None, description="Video decoder (NVDEC) utilization percentage (0-100)"
nvidia_decoder_utilization: float | None = Field(
default=None,
validation_alias=AliasChoices(
"nvidia_decoder_utilization", "decoder_utilization"
),
description="NVIDIA video decoder (NVDEC) utilization percentage (0-100)",
)
encoder_utilization: float | None = Field(
default=None, description="Video encoder (NVENC) utilization percentage (0-100)"
nvidia_encoder_utilization: float | None = Field(
default=None,
validation_alias=AliasChoices(
"nvidia_encoder_utilization", "encoder_utilization"
),
description="NVIDIA video encoder (NVENC) utilization percentage (0-100)",
)
jpg_utilization: float | None = Field(
default=None, description="JPEG decoder utilization percentage (0-100)"
nvidia_jpg_utilization: float | None = Field(
default=None,
validation_alias=AliasChoices("nvidia_jpg_utilization", "jpg_utilization"),
description="NVIDIA JPEG decoder utilization percentage (0-100)",
)
xid_errors: float | None = Field(
default=None, description="Value of the last XID error encountered"
nvidia_xid_errors: float | None = Field(
default=None,
validation_alias=AliasChoices("nvidia_xid_errors", "xid_errors"),
description="Value of the last NVIDIA XID error encountered",
)
power_violation: float | None = Field(
nvidia_power_violation: float | None = Field(
default=None,
description="Throttling duration due to power constraints in microseconds",
validation_alias=AliasChoices("nvidia_power_violation", "power_violation"),
description="NVIDIA throttling duration due to power constraints in microseconds",
)

# AMD ROCm telemetry (collected by AMDSMITelemetryCollector). These mirror
Expand Down Expand Up @@ -141,6 +168,10 @@ class GpuMetadata(AIPerfBaseModel):
pod_name: str | None = Field(
default=None, description="Pod name where the GPU is located (kubernetes only)"
)
platform: str = Field(
default=UNKNOWN_GPU_TELEMETRY_PLATFORM,
description="GPU telemetry platform namespace, e.g. 'nvidia', 'amd', or 'unknown'",
)


class TelemetryRecord(GpuMetadata):
Expand Down Expand Up @@ -649,6 +680,7 @@ def add_record(self, record: TelemetryRecord) -> None:
hostname=record.hostname,
namespace=record.namespace,
pod_name=record.pod_name,
platform=record.platform,
),
)

Expand Down
29 changes: 24 additions & 5 deletions src/aiperf/exporters/gpu_telemetry_console_exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ def get_renderable(self) -> RenderableType:
RenderableType: Rich Group containing multiple Tables, or Text message if no data
"""
renderables = []
renderables.append(self._create_platform_disclaimer())
first_table = True

# TelemetryExportData uses: endpoints[endpoint_display] -> EndpointData.gpus[gpu_key] -> GpuSummary
Expand Down Expand Up @@ -98,11 +99,29 @@ def get_renderable(self) -> RenderableType:
)
renderables.append(metrics_table)

if not renderables:
if len(renderables) == 1:
return self._create_no_data_message()

return Group(*renderables)

def _create_platform_disclaimer(self) -> Text:
"""Create platform-specific comparability warning for telemetry summaries."""
platforms = sorted(
{
gpu_summary.platform
for endpoint_data in self._telemetry_results.endpoints.values()
for gpu_summary in endpoint_data.gpus.values()
}
)
platform_text = ", ".join(platforms) if platforms else "unknown"
return Text(
"GPU telemetry platform: "
f"{platform_text}. "
"Metric semantics are platform-specific; cross-platform comparisons "
"require workload and collector validation.",
style="yellow",
)

def _create_summary_header(self, table_title_base: str) -> str:
"""Create the summary header with endpoint reachability status.

Expand All @@ -112,7 +131,7 @@ def _create_summary_header(self, table_title_base: str) -> str:
Returns:
Formatted title string with endpoint status
"""
title_lines = ["NVIDIA AIPerf | GPU Telemetry Summary"]
title_lines = ["AIPerf | GPU Telemetry Summary"]

endpoints_configured = self._telemetry_results.summary.endpoints_configured
endpoints_successful = self._telemetry_results.summary.endpoints_successful
Expand All @@ -122,15 +141,15 @@ def _create_summary_header(self, table_title_base: str) -> str:

if failed_count == 0:
title_lines.append(
f"[bold green]{successful_count}/{total_count} DCGM endpoints reachable[/bold green]"
f"[bold green]{successful_count}/{total_count} telemetry sources reachable[/bold green]"
)
elif successful_count == 0:
title_lines.append(
f"[bold red]{successful_count}/{total_count} DCGM endpoints reachable[/bold red]"
f"[bold red]{successful_count}/{total_count} telemetry sources reachable[/bold red]"
)
else:
title_lines.append(
f"[bold yellow]{successful_count}/{total_count} DCGM endpoints reachable[/bold yellow]"
f"[bold yellow]{successful_count}/{total_count} telemetry sources reachable[/bold yellow]"
)

for endpoint in endpoints_configured:
Expand Down
10 changes: 7 additions & 3 deletions src/aiperf/exporters/metrics_csv_exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@ def __init__(self, exporter_config: ExporterConfig, **kwargs) -> None:
self._percentile_keys = _percentile_keys_from(STAT_KEYS)
self.trace_or_debug(
lambda: f"Initializing MetricsCsvExporter with config: {exporter_config}",
lambda: f"Initializing MetricsCsvExporter with file path: {self._file_path}",
lambda: (
f"Initializing MetricsCsvExporter with file path: {self._file_path}"
),
)

def get_export_info(self) -> FileExportInfo:
Expand Down Expand Up @@ -203,6 +205,7 @@ def _write_telemetry_section(self, writer: csv.writer) -> None:
"GPU_Index",
"GPU_Name",
"GPU_UUID",
"Platform",
]
optional_headers, optional_fields = self._get_optional_headers_and_fields(
"Hostname", "Namespace", "Pod Name"
Expand Down Expand Up @@ -259,8 +262,8 @@ def _write_gpu_metric_row_from_summary(
endpoint_display: Display name of the endpoint
gpu_summary: GpuSummary with pre-computed metrics (from TelemetryExportData)
optional_fields: List of optional fields to write to the row
metric_key: Internal metric name (e.g., "gpu_power_usage")
metric_display: Display name for the metric (e.g., "GPU Power Usage")
metric_key: Internal metric name (e.g., "nvidia_power_usage")
metric_display: Display name for the metric (e.g., "NVIDIA GPU Power Usage")
unit: Unit of measurement (e.g., "W", "GB", "%")
"""
try:
Expand All @@ -274,6 +277,7 @@ def _write_gpu_metric_row_from_summary(
str(gpu_summary.gpu_index),
gpu_summary.gpu_name,
gpu_summary.gpu_uuid,
gpu_summary.platform,
metric_with_unit,
]

Expand Down
1 change: 1 addition & 0 deletions src/aiperf/gpu_telemetry/accumulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@ def export_results(
gpu_index=gpu_data.metadata.gpu_index,
gpu_name=gpu_data.metadata.gpu_model_name,
gpu_uuid=gpu_uuid,
platform=gpu_data.metadata.platform,
hostname=gpu_data.metadata.hostname,
namespace=gpu_data.metadata.namespace,
pod_name=gpu_data.metadata.pod_name,
Expand Down
6 changes: 5 additions & 1 deletion src/aiperf/gpu_telemetry/amdsmi_collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,10 @@
TelemetryMetrics,
TelemetryRecord,
)
from aiperf.gpu_telemetry.constants import AMDSMI_SOURCE_IDENTIFIER
from aiperf.gpu_telemetry.constants import (
AMD_GPU_TELEMETRY_PLATFORM,
AMDSMI_SOURCE_IDENTIFIER,
)
from aiperf.gpu_telemetry.protocols import TErrorCallback, TRecordCallback

__all__ = ["AMDSMITelemetryCollector"]
Expand Down Expand Up @@ -285,6 +288,7 @@ def _build_gpu_state(self, index: int, handle: Any) -> _AMDGpuDeviceState | None
pci_bus_id=pci_bus_id,
device=f"amd{index}",
hostname="localhost",
platform=AMD_GPU_TELEMETRY_PLATFORM,
),
)

Expand Down
Loading
Loading