feat: add phase baseline handshake#956
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: Anthony Casagrande <[email protected]>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@4134e4659e783f0eed8414001af8b262e7b86dd4Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@4134e4659e783f0eed8414001af8b262e7b86dd4Last updated for commit: |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
💤 Files with no reviewable changes (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR implements a phase baseline handshake mechanism that synchronizes baseline collection at phase boundaries. Services advertise baseline-collection or result-producer capabilities; the system broadcasts requests to collectors via ChangesPhase Baseline Handshake & Coordination
🎯 4 (Complex) | ⏱️ ~60 minutes
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (9)
tests/unit/common/mixins/test_baseline_collector_mixin.py (1)
65-75: ⚡ Quick winStrengthen failure-path contract assertions.
This test checks the error payload, but it should also assert the ack envelope (
service_id,phase_id,kind) and single publish to lock the handshake contract on failures too.Suggested test hardening
`@pytest.mark.asyncio` async def test_handler_acks_failure_when_collect_raises() -> None: bus = _StubBus() svc = _StubCollector(bus, fail=True) await svc._on_phase_baseline_request( PhaseBaselineRequestMessage( phase_id="p1", phase_name="profiling", kind=BaselineKind.END ) ) + assert len(bus.published) == 1 ack = bus.published[0] assert ack.success is False + assert ack.service_id == "svc-stub" + assert ack.phase_id == "p1" + assert ack.kind == BaselineKind.END assert "simulated DCGM failure" in ack.error🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/common/mixins/test_baseline_collector_mixin.py` around lines 65 - 75, Update the test_handler_acks_failure_when_collect_raises to also assert the ack envelope and ensure exactly one publish: after calling _StubCollector._on_phase_baseline_request and obtaining ack = bus.published[0], assert ack.service_id (or the expected service identifier from _StubCollector), ack.phase_id == "p1", and ack.kind == BaselineKind.END, and additionally assert len(bus.published) == 1 to guarantee a single publish; keep the existing assertions that ack.success is False and that "simulated DCGM failure" appears in ack.error.tests/unit/common/enums/test_baseline_enums.py (1)
11-40: ⚡ Quick winParametrize repeated enum/parser cases.
These are clean tests already, but the repeated single-assert patterns are a good fit for
@pytest.mark.parametrizeto reduce duplication and improve coverage extensibility.As per coding guidelines: "Use
@pytest.mark.parametrizefor data-driven tests".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/common/enums/test_baseline_enums.py` around lines 11 - 40, Several tests duplicate single-assert patterns; convert them to data-driven parameterized tests using pytest.mark.parametrize to reduce duplication. Replace the repeated tests for BaselineKind values and case-insensitive parsing with a single parameterized test that iterates over (input, expected) tuples and references BaselineKind and the existing test functions (e.g., test_baseline_kind_values_lowercase, test_baseline_kind_case_insensitive). Do the same for ServiceCapability and the capability helpers by parameterizing expected string values for ServiceCapability.BASELINE_COLLECTOR and ServiceCapability.RESULT_PRODUCER, and parameterize make_result_producer_capability and parse_result_producer_capability cases (including negative cases that should return None) so each assertion becomes an iteration in a single param test rather than separate test functions.src/aiperf/controller/baseline_coordinator.py (1)
82-93: 💤 Low valueExceptions in gathered futures are silently ignored.
With
return_exceptions=True, if a future raises an exception (e.g., fromset_exceptionor internal asyncio errors), it appears inresultsbut the loop only checks forPhaseBaselineAckMessage. These exceptions would be silently dropped.♻️ Suggested improvement to log unexpected exceptions
for ack in results: if isinstance(ack, PhaseBaselineAckMessage) and not ack.success: _logger.warning( f"Baseline {kind} for phase '{phase_name}' " f"(id={phase_id[:8]}) collector {ack.service_id!r} " f"reported failure: {ack.error}" ) + elif isinstance(ack, BaseException): + _logger.warning( + f"Baseline {kind} for phase '{phase_name}' " + f"(id={phase_id[:8]}) unexpected exception: {ack!r}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/controller/baseline_coordinator.py` around lines 82 - 93, The loop over results from asyncio.gather in baseline_coordinator.py currently only handles PhaseBaselineAckMessage and silently drops exceptions because return_exceptions=True; update the loop that processes results (the variable results from asyncio.wait_for / asyncio.gather) to detect non-PhaseBaselineAckMessage items that are Exception instances and log them (include context such as phase_name, phase_id[:8], kind and the exception/traceback) so raised exceptions from futures are not ignored; keep the existing handling for PhaseBaselineAckMessage.success failures and add a branch to log unexpected exceptions from the gathered futures.tests/unit/records/test_records_manager.py (1)
347-351: ⚡ Quick winAdd type hints to the new helper function.
Line 347 introduces
close_background_taskwithout parameter/return annotations; please type it for consistency with project rules.Proposed patch
+from collections.abc import Coroutine +from typing import Any ... - def close_background_task(coro): + def close_background_task(coro: Coroutine[Any, Any, Any]) -> MagicMock: coro.close() return MagicMock()As per coding guidelines, "Add type hints on ALL functions (params and return)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/records/test_records_manager.py` around lines 347 - 351, The helper close_background_task lacks type annotations; update it to have param and return types (e.g., annotate coro as Coroutine[Any, Any, Any] and the function return as MagicMock) and add the needed imports (typing: Any, Coroutine and unittest.mock: MagicMock) so the signature becomes typed and consistent with project rules, and keep manager.execute_async = MagicMock(side_effect=close_background_task) unchanged.tests/unit/common/test_environment_baseline.py (2)
6-8: ⚡ Quick winAvoid asserting defaults via the global
Environmentsingleton.Using
Environmentat Line 7 can make this test order-dependent if another test initializes env-backed settings earlier with overrides. Prefer constructing a fresh_Environment()here too, so this test always validates defaults deterministically.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/common/test_environment_baseline.py` around lines 6 - 8, The test test_baseline_defaults currently reads defaults from the global Environment singleton which can be mutated by other tests; instead instantiate a fresh _Environment() and assert against its BASELINE values (e.g., create env = _Environment() and use env.BASELINE.GATE_TIMEOUT_S and env.BASELINE.GATE_ENABLED) so the test validates deterministic defaults independent of global state.
11-13: ⚡ Quick winAdd the fixture type hint and keep imports at module scope.
At Line 11, type
monkeypatch(e.g.,pytest.MonkeyPatch). Also move the_Environmentimport from Line 12 to the file top for consistency with test style.As per coding guidelines, "Add type hints on ALL functions (params and return)" and "Put imports at file top, fixtures for setup, one focus per test."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/common/test_environment_baseline.py` around lines 11 - 13, Add a proper fixture type hint and move imports to module scope: change the test function signature test_baseline_env_override to accept a typed fixture (e.g., monkeypatch: pytest.MonkeyPatch) and keep the return annotation (-> None), and move the from aiperf.common.environment import _Environment import to the top of the test module so imports are module-scoped; update any other local imports in this test file similarly to follow the project style.tests/unit/controller/test_system_controller_baseline.py (1)
41-42: ⚡ Quick winType the
_no_op_publishparameter.At Line 41, add a type for
msgto satisfy the repo-wide function typing rule.As per coding guidelines, "Add type hints on ALL functions (params and return)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/controller/test_system_controller_baseline.py` around lines 41 - 42, The helper function _no_op_publish is missing a parameter type for msg; update its signature to include an explicit type (e.g., msg: Any) so it conforms to the repo rule requiring type hints on all functions, and add the corresponding import from typing (Any) at the top of the test module if not already present; keep the existing async and return type (-> None) intact.tests/unit/gpu_telemetry/test_telemetry_manager.py (1)
1250-1250: ⚡ Quick winAdd return type hints to the new async test functions.
Line 1250 and Line 1263 are missing explicit return annotations (
-> None), which this repo requires for all functions.Suggested patch
- async def test_collect_baseline_reports_scrape_failures(self): + async def test_collect_baseline_reports_scrape_failures(self) -> None: @@ - async def test_profile_complete_stops_without_final_scrape(self): + async def test_profile_complete_stops_without_final_scrape(self) -> None:As per coding guidelines: "Add type hints on ALL functions (params and return)" for
**/*.py.Also applies to: 1263-1263
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/gpu_telemetry/test_telemetry_manager.py` at line 1250, Two async test functions are missing explicit return type hints; add "-> None" to the async def signature for test_collect_baseline_reports_scrape_failures and the other async test declared at the nearby location (the function starting at line 1263) so both signatures read "async def <name>(...) -> None:". Update the function declarations (not body) for these tests to include the return annotation to satisfy the repository's typing requirement.tests/unit/server_metrics/test_server_metrics_manager.py (1)
466-466: ⚡ Quick winAdd explicit
-> Nonereturn annotations on updated async tests.Line 466, Line 535, and Line 559 define new/renamed test functions without return type hints.
Suggested patch
- async def test_configure_does_not_capture_baseline( + async def test_configure_does_not_capture_baseline( self, cli_config: CLIConfig, cfg_with_endpoint: CLIConfig, - ): + ) -> None: @@ - async def test_profile_complete_stops_without_final_scrape( + async def test_profile_complete_stops_without_final_scrape( self, cli_config: CLIConfig, cfg_with_endpoint: CLIConfig, - ): + ) -> None: @@ - async def test_collect_baseline_reports_scrape_failures( + async def test_collect_baseline_reports_scrape_failures( self, cli_config: CLIConfig, cfg_with_endpoint: CLIConfig, - ): + ) -> None:As per coding guidelines: "Add type hints on ALL functions (params and return)" for
**/*.py.Also applies to: 535-535, 559-559
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/server_metrics/test_server_metrics_manager.py` at line 466, The async test functions (e.g., async def test_configure_does_not_capture_baseline) are missing explicit return type annotations; update each updated/renamed async test definition to include an explicit "-> None" return type (for the functions at the mentioned locations, and the other two new/renamed tests referenced) so their signatures read "async def <function_name>(...) -> None:" to satisfy the project's type-hinting guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/common/enums/baseline_enums.py`:
- Around line 33-36: make_result_producer_capability currently allows an empty
domain which yields an invalid tag that parse_result_producer_capability cannot
round-trip; validate the domain in make_result_producer_capability (ensure
domain is a non-empty string) and reject empty input by raising a ValueError (or
similar explicit exception) with a clear message; update the function that
builds the tag (make_result_producer_capability) to check domain before
returning f"{_RESULT_PRODUCER_PREFIX}{domain}" and reference
parse_result_producer_capability in the error message if helpful.
In `@src/aiperf/common/messages/baseline_messages.py`:
- Around line 40-46: Add a Pydantic validation to enforce consistency between
the success and error fields on the baseline ACK model: ensure that when success
is True then error is None/empty, and when success is False then error is a
non-empty string. Locate the model containing the Field definitions for success
and error (the Baseline ACK dataclass/model in this file) and add a validator or
root_validator that raises a ValueError with a clear message when these
conditions are violated so the wire contract remains unambiguous.
In `@src/aiperf/common/mixins/baseline_collector_mixin.py`:
- Around line 27-37: The class BaselineCollectorMixin declares an
`@abstractmethod` (collect_baseline) but doesn't inherit from ABC, so Python won't
enforce implementation in subclasses; update BaselineCollectorMixin to inherit
from abc.ABC (or make it subclass of ABC) so the abstract hook on
collect_baseline is enforced and instantiating subclasses without an override
will raise TypeError, leaving the method name collect_baseline and the mixin
class name unchanged.
In `@tests/unit/common/enums/test_baseline_enums.py`:
- Around line 11-40: Add explicit return type hints "-> None" to every test
function in this file: test_baseline_kind_values_lowercase,
test_baseline_kind_case_insensitive,
test_service_capability_baseline_collector_value,
test_service_capability_result_producer_value,
test_make_result_producer_capability_includes_domain,
test_parse_result_producer_capability_returns_domain, and
test_parse_result_producer_capability_ignores_other_capabilities; update each
def signature to include "-> None" to comply with the repo-wide "Type hints on
ALL functions" rule.
In `@tests/unit/common/messages/test_register_service_capabilities.py`:
- Around line 32-34: The test test_capabilities_accepts_tuple_of_strings
currently asserts string-equality and can pass if values were coerced to enums;
change the assertion on cmd.capabilities (from _make(...)) to verify the
normalized type by asserting it is a tuple of Capability enum members
corresponding to the input strings (i.e., the Capability enum instances for
"baseline_collector"), rather than comparing to the raw string tuple.
---
Nitpick comments:
In `@src/aiperf/controller/baseline_coordinator.py`:
- Around line 82-93: The loop over results from asyncio.gather in
baseline_coordinator.py currently only handles PhaseBaselineAckMessage and
silently drops exceptions because return_exceptions=True; update the loop that
processes results (the variable results from asyncio.wait_for / asyncio.gather)
to detect non-PhaseBaselineAckMessage items that are Exception instances and log
them (include context such as phase_name, phase_id[:8], kind and the
exception/traceback) so raised exceptions from futures are not ignored; keep the
existing handling for PhaseBaselineAckMessage.success failures and add a branch
to log unexpected exceptions from the gathered futures.
In `@tests/unit/common/enums/test_baseline_enums.py`:
- Around line 11-40: Several tests duplicate single-assert patterns; convert
them to data-driven parameterized tests using pytest.mark.parametrize to reduce
duplication. Replace the repeated tests for BaselineKind values and
case-insensitive parsing with a single parameterized test that iterates over
(input, expected) tuples and references BaselineKind and the existing test
functions (e.g., test_baseline_kind_values_lowercase,
test_baseline_kind_case_insensitive). Do the same for ServiceCapability and the
capability helpers by parameterizing expected string values for
ServiceCapability.BASELINE_COLLECTOR and ServiceCapability.RESULT_PRODUCER, and
parameterize make_result_producer_capability and
parse_result_producer_capability cases (including negative cases that should
return None) so each assertion becomes an iteration in a single param test
rather than separate test functions.
In `@tests/unit/common/mixins/test_baseline_collector_mixin.py`:
- Around line 65-75: Update the test_handler_acks_failure_when_collect_raises to
also assert the ack envelope and ensure exactly one publish: after calling
_StubCollector._on_phase_baseline_request and obtaining ack = bus.published[0],
assert ack.service_id (or the expected service identifier from _StubCollector),
ack.phase_id == "p1", and ack.kind == BaselineKind.END, and additionally assert
len(bus.published) == 1 to guarantee a single publish; keep the existing
assertions that ack.success is False and that "simulated DCGM failure" appears
in ack.error.
In `@tests/unit/common/test_environment_baseline.py`:
- Around line 6-8: The test test_baseline_defaults currently reads defaults from
the global Environment singleton which can be mutated by other tests; instead
instantiate a fresh _Environment() and assert against its BASELINE values (e.g.,
create env = _Environment() and use env.BASELINE.GATE_TIMEOUT_S and
env.BASELINE.GATE_ENABLED) so the test validates deterministic defaults
independent of global state.
- Around line 11-13: Add a proper fixture type hint and move imports to module
scope: change the test function signature test_baseline_env_override to accept a
typed fixture (e.g., monkeypatch: pytest.MonkeyPatch) and keep the return
annotation (-> None), and move the from aiperf.common.environment import
_Environment import to the top of the test module so imports are module-scoped;
update any other local imports in this test file similarly to follow the project
style.
In `@tests/unit/controller/test_system_controller_baseline.py`:
- Around line 41-42: The helper function _no_op_publish is missing a parameter
type for msg; update its signature to include an explicit type (e.g., msg: Any)
so it conforms to the repo rule requiring type hints on all functions, and add
the corresponding import from typing (Any) at the top of the test module if not
already present; keep the existing async and return type (-> None) intact.
In `@tests/unit/gpu_telemetry/test_telemetry_manager.py`:
- Line 1250: Two async test functions are missing explicit return type hints;
add "-> None" to the async def signature for
test_collect_baseline_reports_scrape_failures and the other async test declared
at the nearby location (the function starting at line 1263) so both signatures
read "async def <name>(...) -> None:". Update the function declarations (not
body) for these tests to include the return annotation to satisfy the
repository's typing requirement.
In `@tests/unit/records/test_records_manager.py`:
- Around line 347-351: The helper close_background_task lacks type annotations;
update it to have param and return types (e.g., annotate coro as Coroutine[Any,
Any, Any] and the function return as MagicMock) and add the needed imports
(typing: Any, Coroutine and unittest.mock: MagicMock) so the signature becomes
typed and consistent with project rules, and keep manager.execute_async =
MagicMock(side_effect=close_background_task) unchanged.
In `@tests/unit/server_metrics/test_server_metrics_manager.py`:
- Line 466: The async test functions (e.g., async def
test_configure_does_not_capture_baseline) are missing explicit return type
annotations; update each updated/renamed async test definition to include an
explicit "-> None" return type (for the functions at the mentioned locations,
and the other two new/renamed tests referenced) so their signatures read "async
def <function_name>(...) -> None:" to satisfy the project's type-hinting
guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e5388858-cca6-4bbd-93a1-43cc76a9b315
📒 Files selected for processing (46)
docs/architecture.mddocs/dev/patterns.mddocs/environment-variables.mddocs/index.ymldocs/reference/phase-baseline-handshake.mddocs/server-metrics/server-metrics.mdsrc/aiperf/common/base_component_service.pysrc/aiperf/common/enums/__init__.pysrc/aiperf/common/enums/baseline_enums.pysrc/aiperf/common/enums/enums.pysrc/aiperf/common/environment.pysrc/aiperf/common/messages/__init__.pysrc/aiperf/common/messages/base_messages.pysrc/aiperf/common/messages/baseline_messages.pysrc/aiperf/common/messages/command_messages.pysrc/aiperf/common/mixins/__init__.pysrc/aiperf/common/mixins/baseline_collector_mixin.pysrc/aiperf/controller/baseline_coordinator.pysrc/aiperf/controller/result_join_coordinator.pysrc/aiperf/controller/system_controller.pysrc/aiperf/gpu_telemetry/manager.pysrc/aiperf/gpu_telemetry/pynvml_collector.pysrc/aiperf/records/records_manager.pysrc/aiperf/server_metrics/manager.pysrc/aiperf/timing/manager.pysrc/aiperf/timing/phase/phase_gate.pysrc/aiperf/timing/phase/runner.pysrc/aiperf/timing/phase_orchestrator.pytests/component_integration/conftest.pytests/integration/conftest.pytests/unit/common/enums/test_baseline_enums.pytests/unit/common/messages/test_baseline_messages.pytests/unit/common/messages/test_register_service_capabilities.pytests/unit/common/mixins/test_baseline_collector_mixin.pytests/unit/common/test_environment_baseline.pytests/unit/controller/test_baseline_coordinator.pytests/unit/controller/test_result_join_coordinator.pytests/unit/controller/test_system_controller_baseline.pytests/unit/gpu_telemetry/test_baseline_integration.pytests/unit/gpu_telemetry/test_telemetry_manager.pytests/unit/records/test_records_manager.pytests/unit/server_metrics/test_baseline_integration.pytests/unit/server_metrics/test_server_metrics_manager.pytests/unit/timing/phase/test_phase_gate_client.pytests/unit/timing/phase/test_runner_gate_ordering.pytests/unit/timing/phase/test_runner_gates.py
💤 Files with no reviewable changes (3)
- docs/server-metrics/server-metrics.md
- tests/integration/conftest.py
- tests/component_integration/conftest.py
| def make_result_producer_capability(domain: str) -> str: | ||
| """Build a result-producer capability tag for a result domain.""" | ||
|
|
||
| return f"{_RESULT_PRODUCER_PREFIX}{domain}" |
There was a problem hiding this comment.
Reject empty result-producer domains at construction.
make_result_producer_capability("") currently produces a tag that parse_result_producer_capability(...) cannot round-trip (it returns None). Guarding this early avoids silent invalid capabilities.
Suggested fix
def make_result_producer_capability(domain: str) -> str:
"""Build a result-producer capability tag for a result domain."""
-
- return f"{_RESULT_PRODUCER_PREFIX}{domain}"
+ if not domain:
+ raise ValueError("domain must be non-empty")
+ return f"{_RESULT_PRODUCER_PREFIX}{domain}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/common/enums/baseline_enums.py` around lines 33 - 36,
make_result_producer_capability currently allows an empty domain which yields an
invalid tag that parse_result_producer_capability cannot round-trip; validate
the domain in make_result_producer_capability (ensure domain is a non-empty
string) and reject empty input by raising a ValueError (or similar explicit
exception) with a clear message; update the function that builds the tag
(make_result_producer_capability) to check domain before returning
f"{_RESULT_PRODUCER_PREFIX}{domain}" and reference
parse_result_producer_capability in the error message if helpful.
| success: bool = Field( | ||
| ..., | ||
| description="False if collect_baseline() raised; coordinator still counts as ack.", | ||
| ) | ||
| error: str | None = Field( | ||
| default=None, description="Error string when success=False." | ||
| ) |
There was a problem hiding this comment.
Enforce success/error consistency on baseline acks.
The model currently allows success=True with a non-null error, and success=False without an error. Enforcing this keeps the wire contract unambiguous.
Suggested fix
-from pydantic import Field
+from pydantic import Field, model_validator
@@
class PhaseBaselineAckMessage(BaseServiceMessage):
@@
error: str | None = Field(
default=None, description="Error string when success=False."
)
+
+ `@model_validator`(mode="after")
+ def validate_ack_fields(self) -> "PhaseBaselineAckMessage":
+ if self.success and self.error is not None:
+ raise ValueError("error must be None when success=True")
+ if not self.success and not self.error:
+ raise ValueError("error is required when success=False")
+ return self🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/common/messages/baseline_messages.py` around lines 40 - 46, Add a
Pydantic validation to enforce consistency between the success and error fields
on the baseline ACK model: ensure that when success is True then error is
None/empty, and when success is False then error is a non-empty string. Locate
the model containing the Field definitions for success and error (the Baseline
ACK dataclass/model in this file) and add a validator or root_validator that
raises a ValueError with a clear message when these conditions are violated so
the wire contract remains unambiguous.
| class BaselineCollectorMixin: | ||
| """Mix into a BaseComponentService to participate in the phase baseline handshake.""" | ||
|
|
||
| extra_capabilities: ClassVar[tuple[str, ...]] = ( | ||
| ServiceCapability.BASELINE_COLLECTOR, | ||
| ) | ||
|
|
||
| @abstractmethod | ||
| async def collect_baseline( | ||
| self, kind: BaselineKind, phase_id: str, phase_name: str | ||
| ) -> None: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "1) Inspect BaselineCollectorMixin declaration:"
rg -n --type=py 'class\s+BaselineCollectorMixin' src/aiperf/common/mixins/baseline_collector_mixin.py
rg -n --type=py '`@abstractmethod`|def collect_baseline' src/aiperf/common/mixins/baseline_collector_mixin.py
echo
echo "2) List classes inheriting BaselineCollectorMixin and whether they override collect_baseline:"
python - <<'PY'
import ast
from pathlib import Path
def base_name(node):
if isinstance(node, ast.Name):
return node.id
if isinstance(node, ast.Attribute):
parts = []
cur = node
while isinstance(cur, ast.Attribute):
parts.append(cur.attr)
cur = cur.value
if isinstance(cur, ast.Name):
parts.append(cur.id)
return ".".join(reversed(parts))
return None
for path in Path("src").rglob("*.py"):
text = path.read_text(encoding="utf-8")
try:
tree = ast.parse(text)
except SyntaxError:
continue
for n in [x for x in ast.walk(tree) if isinstance(x, ast.ClassDef)]:
bases = {base_name(b) for b in n.bases}
if "BaselineCollectorMixin" in bases or "aiperf.common.mixins.baseline_collector_mixin.BaselineCollectorMixin" in bases:
has_override = any(
isinstance(m, (ast.FunctionDef, ast.AsyncFunctionDef)) and m.name == "collect_baseline"
for m in n.body
)
print(f"{path}:{n.lineno} class {n.name} collect_baseline_override={has_override}")
PYRepository: ai-dynamo/aiperf
Length of output: 487
🏁 Script executed:
head -40 src/aiperf/common/mixins/baseline_collector_mixin.pyRepository: ai-dynamo/aiperf
Length of output: 1507
🏁 Script executed:
rg -n 'class.*BaselineCollectorMixin' --type py -A 10 src/Repository: ai-dynamo/aiperf
Length of output: 3095
Enforce the abstract hook by inheriting ABC.
Line 34 uses @abstractmethod, but Line 27 does not inherit ABC. This means subclasses can be instantiated without implementing collect_baseline, and baseline requests may silently succeed. While both known subclasses (GPUTelemetryManager and ServerMetricsManager) correctly override the method, Python does not enforce the contract without ABC inheritance.
Proposed fix
-from abc import abstractmethod
+from abc import ABC, abstractmethod
@@
-class BaselineCollectorMixin:
+class BaselineCollectorMixin(ABC):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/aiperf/common/mixins/baseline_collector_mixin.py` around lines 27 - 37,
The class BaselineCollectorMixin declares an `@abstractmethod` (collect_baseline)
but doesn't inherit from ABC, so Python won't enforce implementation in
subclasses; update BaselineCollectorMixin to inherit from abc.ABC (or make it
subclass of ABC) so the abstract hook on collect_baseline is enforced and
instantiating subclasses without an override will raise TypeError, leaving the
method name collect_baseline and the mixin class name unchanged.
| def test_baseline_kind_values_lowercase(): | ||
| assert BaselineKind.START == "start" | ||
| assert BaselineKind.END == "end" | ||
|
|
||
|
|
||
| def test_baseline_kind_case_insensitive(): | ||
| assert BaselineKind("START") == BaselineKind.START | ||
| assert BaselineKind("End") == BaselineKind.END | ||
|
|
||
|
|
||
| def test_service_capability_baseline_collector_value(): | ||
| assert ServiceCapability.BASELINE_COLLECTOR == "baseline_collector" | ||
|
|
||
|
|
||
| def test_service_capability_result_producer_value(): | ||
| assert ServiceCapability.RESULT_PRODUCER == "result_producer" | ||
|
|
||
|
|
||
| def test_make_result_producer_capability_includes_domain(): | ||
| assert make_result_producer_capability("profile") == "result_producer:profile" | ||
|
|
||
|
|
||
| def test_parse_result_producer_capability_returns_domain(): | ||
| assert parse_result_producer_capability("result_producer:telemetry") == "telemetry" | ||
|
|
||
|
|
||
| def test_parse_result_producer_capability_ignores_other_capabilities(): | ||
| assert parse_result_producer_capability("baseline_collector") is None | ||
| assert parse_result_producer_capability("result_producer") is None | ||
| assert parse_result_producer_capability("result_producer:") is None |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add return type hints to test functions.
These tests should declare -> None for consistency with the repo-wide typing rule.
Proposed diff
-def test_baseline_kind_values_lowercase():
+def test_baseline_kind_values_lowercase() -> None:
@@
-def test_baseline_kind_case_insensitive():
+def test_baseline_kind_case_insensitive() -> None:
@@
-def test_service_capability_baseline_collector_value():
+def test_service_capability_baseline_collector_value() -> None:
@@
-def test_service_capability_result_producer_value():
+def test_service_capability_result_producer_value() -> None:
@@
-def test_make_result_producer_capability_includes_domain():
+def test_make_result_producer_capability_includes_domain() -> None:
@@
-def test_parse_result_producer_capability_returns_domain():
+def test_parse_result_producer_capability_returns_domain() -> None:
@@
-def test_parse_result_producer_capability_ignores_other_capabilities():
+def test_parse_result_producer_capability_ignores_other_capabilities() -> None:As per coding guidelines: "Type hints on ALL functions (params and return)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/common/enums/test_baseline_enums.py` around lines 11 - 40, Add
explicit return type hints "-> None" to every test function in this file:
test_baseline_kind_values_lowercase, test_baseline_kind_case_insensitive,
test_service_capability_baseline_collector_value,
test_service_capability_result_producer_value,
test_make_result_producer_capability_includes_domain,
test_parse_result_producer_capability_returns_domain, and
test_parse_result_producer_capability_ignores_other_capabilities; update each
def signature to include "-> None" to comply with the repo-wide "Type hints on
ALL functions" rule.
| def test_capabilities_accepts_tuple_of_strings() -> None: | ||
| cmd = _make(capabilities=("baseline_collector",)) | ||
| assert cmd.capabilities == ("baseline_collector",) |
There was a problem hiding this comment.
Assert normalized capability type, not just string-equivalent value.
This currently passes even if capabilities are coerced to enum instances because string enums compare equal to strings. Consider asserting the normalized enum tuple to lock the contract.
Proposed diff
def test_capabilities_accepts_tuple_of_strings() -> None:
cmd = _make(capabilities=("baseline_collector",))
- assert cmd.capabilities == ("baseline_collector",)
+ assert cmd.capabilities == (ServiceCapability.BASELINE_COLLECTOR,)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/common/messages/test_register_service_capabilities.py` around
lines 32 - 34, The test test_capabilities_accepts_tuple_of_strings currently
asserts string-equality and can pass if values were coerced to enums; change the
assertion on cmd.capabilities (from _make(...)) to verify the normalized type by
asserting it is a tuple of Capability enum members corresponding to the input
strings (i.e., the Capability enum instances for "baseline_collector"), rather
than comparing to the raw string tuple.
| self.info( | ||
| f"GPU Telemetry: Capturing {kind} baseline for phase '{phase_name}'..." | ||
| ) | ||
| await self._collect_once(label=str(kind)) |
There was a problem hiding this comment.
The START gate can call collect_baseline() before this service's PROFILE_START handler has initialized collectors, because TimingManager starts phases in a background task as soon as it handles PROFILE_START, so this scrape can run on uninitialized GPU collectors. Fix: initialize reachable GPU collectors during configure before advertising telemetry enabled, or delay TimingManager phase execution until all services complete PROFILE_START.
| END gate to `_on_return_wait_complete`. All other phases wait synchronously, | ||
| cancel the progress task, and fire the END gate inline. | ||
| """ | ||
| if self._config.seamless and not is_final_phase: |
There was a problem hiding this comment.
The seamless non-final path returns after sending completes and defers the END gate to a background task, so the next phase can issue credits before the previous phase's END baseline and contaminate the next START baseline with prior-phase in-flight work. Fix: when phase gating is enabled, wait for returns and the END gate before starting the next phase, or disable baseline gates for seamless overlapping phases.
| self._should_wait_for_telemetry = message.enabled | ||
|
|
||
| if not message.enabled: | ||
| self._result_join_coordinator.unregister("telemetry", message.service_id) |
There was a problem hiding this comment.
Disabled telemetry and server-metrics services are removed from ResultJoinCoordinator but remain registered in BaselineCoordinator, so after their delayed shutdown later phase gates wait for the full timeout for services that will never ack. Fix: unregister disabled, stopped, or erroring baseline collectors from BaselineCoordinator and resolve any in-flight wait for that service.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Summary by CodeRabbit
New Features
Configuration
Documentation