-
Notifications
You must be signed in to change notification settings - Fork 93
feat: add phase baseline handshake #956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ajcasagrande
wants to merge
3
commits into
main
Choose a base branch
from
ajc/phase-baseline-handshake
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| # Phase Baseline Handshake | ||
|
|
||
| The phase baseline handshake captures point-in-time baseline readings at phase boundaries without coupling `TimingManager` to specific collectors. `PhaseRunner` pauses at each boundary through `PhaseGateClient`; `SystemController` fans the request out through `BaselineCoordinator`; registered baseline collectors scrape once and ACK; then the gate releases and the benchmark continues. | ||
|
|
||
| ## Component map | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| subgraph TimingManager | ||
| PR[PhaseRunner] | ||
| PGC[PhaseGateClient] | ||
| PR -->|before_phase / after_phase| PGC | ||
| end | ||
|
|
||
| subgraph Controller | ||
| SC[SystemController] | ||
| BC[BaselineCoordinator] | ||
| SC --> BC | ||
| end | ||
|
|
||
| subgraph Collectors[Baseline collector services] | ||
| GTM[GPUTelemetryManager] | ||
| SMM[ServerMetricsManager] | ||
| BCM[BaselineCollectorMixin] | ||
| GTM --> BCM | ||
| SMM --> BCM | ||
| end | ||
|
|
||
| PGC -->|PhaseStartGateCommand / PhaseEndGateCommand| SC | ||
| BC -->|PhaseBaselineRequestMessage| BCM | ||
| BCM -->|PhaseBaselineAckMessage| SC | ||
| SC -->|PhaseGateGrantedResponse| PGC | ||
| ``` | ||
|
|
||
| ## Per-phase message sequence | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| autonumber | ||
| participant Runner as PhaseRunner | ||
| participant Gate as PhaseGateClient | ||
| participant SC as SystemController | ||
| participant Coord as BaselineCoordinator | ||
| participant Collector as Baseline collectors | ||
|
|
||
| Runner->>Gate: before_phase(phase_id, phase_name) | ||
| Gate->>SC: PhaseStartGateCommand | ||
| SC->>Coord: gate_phase(kind=START) | ||
| Coord->>Collector: PhaseBaselineRequestMessage(kind=START) | ||
| Collector->>Collector: collect_baseline(START, phase_id, phase_name) | ||
| Collector-->>SC: PhaseBaselineAckMessage(success=True) | ||
| SC->>Coord: handle_ack(ack) | ||
| Coord-->>SC: all registered collectors acked | ||
| SC-->>Gate: PhaseGateGrantedResponse | ||
| Gate-->>Runner: start gate released | ||
|
|
||
| Runner->>Runner: issue credits and wait for returns | ||
|
|
||
| Runner->>Gate: after_phase(phase_id, phase_name) | ||
| Gate->>SC: PhaseEndGateCommand | ||
| SC->>Coord: gate_phase(kind=END) | ||
| Coord->>Collector: PhaseBaselineRequestMessage(kind=END) | ||
| Collector->>Collector: collect_baseline(END, phase_id, phase_name) | ||
| Collector-->>SC: PhaseBaselineAckMessage(success=True) | ||
| SC->>Coord: handle_ack(ack) | ||
| Coord-->>SC: all registered collectors acked | ||
| SC-->>Gate: PhaseGateGrantedResponse | ||
| Gate-->>Runner: end gate released | ||
| ``` | ||
|
|
||
| ## Credit ordering at phase boundaries | ||
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| autonumber | ||
| participant Runner as PhaseRunner | ||
| participant Gate as PhaseGateClient | ||
| participant Controller as SystemController | ||
| participant Coord as BaselineCoordinator | ||
| participant Collectors as Baseline collectors | ||
| participant Issuer as CreditIssuer | ||
| participant Workers as Workers | ||
|
|
||
| Runner->>Gate: before_phase(phase_id, phase_name) | ||
| Gate->>Controller: PhaseStartGateCommand | ||
| Controller->>Coord: gate_phase(kind=START) | ||
| Coord->>Collectors: PhaseBaselineRequestMessage(kind=START) | ||
| Collectors-->>Controller: PhaseBaselineAckMessage(success=True) | ||
| Controller-->>Gate: PhaseGateGrantedResponse | ||
| Gate-->>Runner: START gate released | ||
| Runner->>Issuer: start strategy.execute_phase() | ||
| Issuer->>Workers: publish credits for this phase | ||
| Workers-->>Runner: return credit results | ||
| Runner->>Runner: wait for sends complete, then returns drain | ||
| Runner->>Gate: after_phase(phase_id, phase_name) | ||
| Gate->>Controller: PhaseEndGateCommand | ||
| Controller->>Coord: gate_phase(kind=END) | ||
| Coord->>Collectors: PhaseBaselineRequestMessage(kind=END) | ||
| Collectors-->>Controller: PhaseBaselineAckMessage(success=True) | ||
| Controller-->>Gate: PhaseGateGrantedResponse | ||
| Gate-->>Runner: END gate released | ||
| Runner->>Runner: phase transition may complete | ||
| ``` | ||
|
|
||
| ## TimingManager phase flow | ||
|
|
||
| ```mermaid | ||
| flowchart TD | ||
| A[PhaseRunner starts phase] --> B[Generate phase_id and phase_name] | ||
| B --> C[PhaseGateClient.before_phase] | ||
| C --> D[SystemController handles PHASE_START_GATE] | ||
| D --> E[BaselineCoordinator broadcasts START request] | ||
| E --> F[Collectors scrape START baseline] | ||
| F --> G[Collectors publish START ACKs] | ||
| G --> H[SystemController returns PhaseGateGrantedResponse] | ||
| H --> I[PhaseRunner issues credits] | ||
| I --> J[PhaseRunner waits for phase completion and returns] | ||
| J --> K[PhaseGateClient.after_phase] | ||
| K --> L[SystemController handles PHASE_END_GATE] | ||
| L --> M[BaselineCoordinator broadcasts END request] | ||
| M --> N[Collectors scrape END baseline] | ||
| N --> O[Collectors publish END ACKs] | ||
| O --> P[SystemController returns PhaseGateGrantedResponse] | ||
| P --> Q[PhaseRunner completes phase transition] | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| """Enums for the phase baseline handshake. | ||
|
|
||
| BaselineKind tags whether a baseline reading is taken before a phase starts | ||
| issuing credits (START) or after credits have drained (END). | ||
|
|
||
| ServiceCapability is a generic capability tag advertised by services in their | ||
| RegisterServiceCommand; SystemController dispatches based on membership | ||
| (e.g. BASELINE_COLLECTOR services join the BaselineCoordinator's registered set). | ||
| """ | ||
|
|
||
| from aiperf.common.enums.base_enums import CaseInsensitiveStrEnum | ||
|
|
||
|
|
||
| class BaselineKind(CaseInsensitiveStrEnum): | ||
| """Direction of a baseline reading relative to a phase.""" | ||
|
|
||
| START = "start" | ||
| END = "end" | ||
|
|
||
|
|
||
| class ServiceCapability(CaseInsensitiveStrEnum): | ||
| """Capability tags a service may advertise at registration time.""" | ||
|
|
||
| BASELINE_COLLECTOR = "baseline_collector" | ||
| RESULT_PRODUCER = "result_producer" | ||
|
|
||
|
|
||
| _RESULT_PRODUCER_PREFIX = f"{ServiceCapability.RESULT_PRODUCER}:" | ||
|
|
||
|
|
||
| def make_result_producer_capability(domain: str) -> str: | ||
| """Build a result-producer capability tag for a result domain.""" | ||
|
|
||
| return f"{_RESULT_PRODUCER_PREFIX}{domain}" | ||
|
|
||
|
|
||
| def parse_result_producer_capability(capability: str) -> str | None: | ||
| """Return the result domain if capability is a result-producer tag.""" | ||
|
|
||
| if not capability.startswith(_RESULT_PRODUCER_PREFIX): | ||
| return None | ||
| domain = capability.removeprefix(_RESULT_PRODUCER_PREFIX) | ||
| return domain or None | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reject empty result-producer domains at construction.
make_result_producer_capability("")currently produces a tag thatparse_result_producer_capability(...)cannot round-trip (it returnsNone). Guarding this early avoids silent invalid capabilities.Suggested fix
def make_result_producer_capability(domain: str) -> str: """Build a result-producer capability tag for a result domain.""" - - return f"{_RESULT_PRODUCER_PREFIX}{domain}" + if not domain: + raise ValueError("domain must be non-empty") + return f"{_RESULT_PRODUCER_PREFIX}{domain}"🤖 Prompt for AI Agents