Add IFoE L2 connectivity preflight check [AIMVT-180]#188
Merged
Conversation
Adds an opt-in preflight check that validates IFoE L2 reachability by invoking `afmctl test ping` on every reachable cluster node and parsing the per-port pass/fail table plus the aggregate `Summary:` block from afmctl's output. The check is disabled by default (`connectivity_check.ifoe.connectivity_mode = "skip"`) so it has no effect on clusters that don't use IFoE. Changes: - New module `cvs/lib/preflight/ifoe_l2_connectivity.py` with `IfoeL2ConnectivityCheck` (orchestration via parallel SSH) and `AfmctlPingParser` / `parse_afmctl_show_device` (output parsers). - New pytest entry `test_ifoe_l2_connectivity` wired into `cvs/tests/preflight/preflight_checks.py` between the interface presence check and the existing RDMA connectivity check. - New `connectivity_check.ifoe` config block in `preflight_config.json` (afmctl_path, bdf_discovery, dst_accelerators, ports, pings_per_port, per_ping_timeout, traffic_types, loss_threshold_pct, ssh_timeout, use_sudo). - Executive-summary + dedicated HTML section in `cvs/lib/preflight/report.py` with a per-node breakdown of failing invocations, failed ports, and the raw afmctl output for debugging. - 20 unit tests covering the parser (pass/fail/partial-loss/garbage inputs, multi-device `afmctl show device`) and the orchestrator (command rendering, sudo, port specs, traffic-type subsetting, auto-discovery, multiple dst-accelerators, threshold logic). - README updates for both `tests/preflight/` and the config guide. Configuration: A node is marked FAIL when any enabled traffic type exceeds `loss_threshold_pct` (default 0.0) or any per-port table row reports FAIL. Multiple destination accelerators issue one invocation each via `(bdf, dst_accelerator)` pairing. When `bdfs` is empty and `bdf_discovery` is `"auto"` (default), BDFs are discovered per-node via `afmctl show device`. Test plan: - `python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity` -> 20/20 pass. - `preflight_config.json` validates cleanly with `python -m json.tool`. - Default `connectivity_mode = "skip"` keeps existing preflight runs unchanged on clusters that don't enable IFoE. Co-authored-by: Cursor <cursoragent@cursor.com>
cijohnson
approved these changes
Jun 2, 2026
amd-droy
pushed a commit
that referenced
this pull request
Jun 25, 2026
* Add IFoE L2 connectivity preflight check [AIMVT-180] Adds an opt-in preflight check that validates IFoE L2 reachability by invoking `afmctl test ping` on every reachable cluster node and parsing the per-port pass/fail table plus the aggregate `Summary:` block from afmctl's output. The check is disabled by default (`connectivity_check.ifoe.connectivity_mode = "skip"`) so it has no effect on clusters that don't use IFoE. Changes: - New module `cvs/lib/preflight/ifoe_l2_connectivity.py` with `IfoeL2ConnectivityCheck` (orchestration via parallel SSH) and `AfmctlPingParser` / `parse_afmctl_show_device` (output parsers). - New pytest entry `test_ifoe_l2_connectivity` wired into `cvs/tests/preflight/preflight_checks.py` between the interface presence check and the existing RDMA connectivity check. - New `connectivity_check.ifoe` config block in `preflight_config.json` (afmctl_path, bdf_discovery, dst_accelerators, ports, pings_per_port, per_ping_timeout, traffic_types, loss_threshold_pct, ssh_timeout, use_sudo). - Executive-summary + dedicated HTML section in `cvs/lib/preflight/report.py` with a per-node breakdown of failing invocations, failed ports, and the raw afmctl output for debugging. - 20 unit tests covering the parser (pass/fail/partial-loss/garbage inputs, multi-device `afmctl show device`) and the orchestrator (command rendering, sudo, port specs, traffic-type subsetting, auto-discovery, multiple dst-accelerators, threshold logic). - README updates for both `tests/preflight/` and the config guide. Configuration: A node is marked FAIL when any enabled traffic type exceeds `loss_threshold_pct` (default 0.0) or any per-port table row reports FAIL. Multiple destination accelerators issue one invocation each via `(bdf, dst_accelerator)` pairing. When `bdfs` is empty and `bdf_discovery` is `"auto"` (default), BDFs are discovered per-node via `afmctl show device`. Test plan: - `python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity` -> 20/20 pass. - `preflight_config.json` validates cleanly with `python -m json.tool`. - Default `connectivity_mode = "skip"` keeps existing preflight runs unchanged on clusters that don't enable IFoE. Co-authored-by: Cursor <cursoragent@cursor.com> * fmt, lint and tests --------- Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in preflight check that validates IFoE L2 reachability across the cluster by running
afmctl test pingon every reachable node and parsing the per-port pass/fail table plus the aggregateSummary:block from afmctl's output. Disabled by default, so existing preflight runs on clusters that don't use IFoE are unaffected.Motivation
The existing preflight covers RDMA connectivity (GID consistency, RDMA interface presence,
ibv_rc_pingpong) but doesn't exercise the IFoE-based fabric path. AIMVT-180 fills that gap with a fast L2-level gate before more expensive downstream tests.Technical Details
New module —
cvs/lib/preflight/ifoe_l2_connectivity.pyIfoeL2ConnectivityCheck: orchestrates oneafmctl test pinginvocation per(bdf, dst_accelerator)pairing on every reachable node via parallel SSH. No pairwise client/server coordination needed —afmctldrives the request/response state machine in the device.AfmctlPingParser: tolerant parser for the per-port table (one row per(local_accel, port#), three traffic-type columns) and the aggregateSummary:block.parse_afmctl_show_device: parsesafmctl show devicefor BDF auto-discovery.New pytest entry —
cvs/tests/preflight/preflight_checks.pytest_ifoe_l2_connectivityis wired in between the interface-presence check and the existing RDMA connectivity check. Opt-in viaconnectivity_check.ifoe.connectivity_mode("run"or"skip"; default"skip").New config block —
cvs/input/config_file/preflight/preflight_config.jsonconnectivity_check.ifoe:afmctl_path,use_sudo,bdf_discovery(auto/config),bdfs,dst_accelerators,ports,pings_per_port,per_ping_timeout,traffic_types(ifoe_req/ifoe_resp/non_ifoe),loss_threshold_pct,ssh_timeout. All keys have inline_comment_*docs.Reporting —
cvs/lib/preflight/report.pyPASS/FAIL logic
loss_threshold_pct(default0.0) or any per-port row reportsFAIL.Documentation
cvs/tests/preflight/README.mdandcvs/input/config_file/preflight/README_preflight_config.mdupdated with the new check, configuration reference, and example config block.Test Plan
python3 -m unittest cvs.lib.preflight.unittests.test_ifoe_l2_connectivity→ 20/20 pass. Coverage:afmctl show device.-ttimeout,--traffic-typesubsetting and aliases), passing/failing runs, loss-threshold leniency, auto-discovery fallback, multiple destination accelerators.preflight_config.jsonparses cleanly withpython -m json.tool.connectivity_mode = "skip"means no behavioral change for existing clusters; the new pytest entry records a SKIPPED result and returns immediately without contacting nodes.Out of Scope
afmctlinstall — operators are expected to install these and apply fabric config before flippingconnectivity_modeto"run".Refs: AIMVT-180
Made with Cursor