Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 39 additions & 3 deletions cvs/input/config_file/preflight/README_preflight_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@ This document explains how to configure the GPU cluster preflight checks system.

## Overview

The preflight checks system validates essential cluster health before running performance tests like IB performance tests, RCCL training, and inference workloads. It performs four key validations:
The preflight checks system validates essential cluster health before running performance tests like IB performance tests, RCCL training, and inference workloads. It performs the following validations:

1. **GID Consistency** - Ensures RDMA interfaces have valid GID entries
2. **RDMA Connectivity** - Tests node-to-node RDMA communication using ibv_rc_pingpong
3. **ROCm Version Consistency** - Verifies consistent ROCm versions across nodes
4. **Interface Name Consistency** - Validates RDMA interface naming patterns
5. **IFoE L2 Connectivity (AIMVT-180; opt-in)** - Runs `afmctl test ping`
on each node and enforces per-port and Summary pass/fail accounting

## Configuration File Structure

Expand Down Expand Up @@ -55,8 +57,8 @@ preflight/
├── debug/ # Debug and troubleshooting options
├── node_check/ # Individual node validation parameters
├── connectivity_check/ # Inter-node connectivity tests
── rdma/ # RDMA-specific parameters (including nodes_per_full_mesh_group)
│ └── ifoe/ # (Future: IFoE parameters)
── rdma/ # RDMA-specific parameters (including nodes_per_full_mesh_group)
│ └── ifoe/ # IFoE L2 ping parameters (AIMVT-180; opt-in)
└── reporting/ # Output and report generation
```

Expand Down Expand Up @@ -163,6 +165,40 @@ All parameters below are optional and have sensible defaults. The sample configu
- Legacy hint for reporting: preflight now prunes interface/GID-failed nodes automatically
- Interface failures are excluded from mesh testing regardless of this flag

#### IFoE Settings (`connectivity_check.ifoe`) — opt-in (AIMVT-180)

Runs `afmctl test ping` on each reachable node and validates the per-port
pass/fail table plus the aggregate `Summary:` block in afmctl's output.

- **`connectivity_mode`** (default: `"skip"`)
- `"run"` — execute the L2 ping on every reachable node
- `"skip"` — preflight records a SKIPPED result and does not invoke afmctl
- **`afmctl_path`** (default: `"afmctl"`)
- Absolute path or PATH-resolved binary name on each node
- **`use_sudo`** (default: `false`)
- Prepend `sudo` to the afmctl invocation when the cluster image requires root
- **`bdf_discovery`** (default: `"auto"`)
- `"auto"` — run `afmctl show device` on each node and use the reported BDFs
- `"config"` — use only the `bdfs` list below; nodes with no matching BDFs FAIL
- **`bdfs`** (default: `[]`)
- Optional explicit list of accelerator BDFs to test on every node
- Example: `["0001:01:00.1"]`
- **`dst_accelerators`** (default: `[0]`)
- One afmctl invocation is issued per `(bdf, dst_accelerator)` combination
- **`ports`** (default: `"all"`)
- `"all"` (omit `-p`), a string like `"0-7"` or `"0,1,2"`, or a list `[0, 1, 2]`
- **`pings_per_port`** (default: `1`)
- Passed to afmctl as `-c <count>`
- **`per_ping_timeout`** (default: `null`)
- Optional afmctl `-t <seconds>` value; omitted when `null`
- **`traffic_types`** (default: `["ifoe_req", "ifoe_resp", "non_ifoe"]`)
- Determines which afmctl traffic categories are required to pass
- When all three are selected, `--traffic-type` is omitted so afmctl runs them all
- **`loss_threshold_pct`** (default: `0.0`)
- Maximum tolerated loss percentage per traffic type (Summary line)
- **`ssh_timeout`** (default: `180`)
- Per-invocation SSH timeout (seconds); raise for high `pings_per_port`

### Reporting Settings (`reporting`)

- **`generate_html_report`** (default: "true")
Expand Down
41 changes: 41 additions & 0 deletions cvs/input/config_file/preflight/preflight_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,47 @@

"exclude_failed_interface_nodes": "true",
"_comment_exclude_failed_interface_nodes": "Legacy hint for reporting: preflight now prunes interface/GID-failed nodes from the SSH host list automatically. Interface failures are excluded from mesh testing regardless of this flag."
},

"ifoe": {
"_comment": "IFoE L2 connectivity testing via 'afmctl test ping' (AIMVT-180). Opt-in: defaults to 'skip'.",

"connectivity_mode": "skip",
"_comment_connectivity_mode": "IFoE L2 ping mode. Options: 'run' (execute afmctl L2 ping on every reachable node) or 'skip' (default; preflight will not invoke afmctl). Enable once afmctl and the IFoE driver are available on every node.",

"afmctl_path": "afmctl",
"_comment_afmctl_path": "Absolute path or PATH-resolved name of the afmctl binary on each cluster node. Examples: 'afmctl', '/usr/local/bin/afmctl'.",

"use_sudo": false,
"_comment_use_sudo": "When true, afmctl is invoked with sudo. Enable if afmctl needs root on the cluster image.",

"bdf_discovery": "auto",
"_comment_bdf_discovery": "How to determine which accelerator BDFs to ping on each node. 'auto' runs 'afmctl show device' on each node and uses the BDFs reported there. 'config' uses only the explicit 'bdfs' list below.",

"_example_bdfs": ["0001:01:00.1"],
"bdfs": [],
"_comment_bdfs": "Optional explicit list of accelerator BDFs (e.g. ['0001:01:00.1']) shared across the cluster. Leave empty to defer to bdf_discovery='auto'.",

"dst_accelerators": [0],
"_comment_dst_accelerators": "Destination accelerator IDs passed to --dst-accelerator. One afmctl ping is issued per (bdf, dst_accelerator). Use [0] for a single-accelerator destination; use a list like [0, 1] to ping multiple peers.",

"ports": "all",
"_comment_ports": "Ports passed to -p. Use 'all' (default; omits -p so afmctl tests every port), a string like '0,1,2' or '0-7', or a list [0,1,2].",

"pings_per_port": 1,
"_comment_pings_per_port": "Value for -c (pings per port pair). Larger values smooth over transient losses.",

"per_ping_timeout": null,
"_comment_per_ping_timeout": "Optional value for afmctl's -t flag (per-ping timeout). Leave null to use afmctl's default.",

"traffic_types": ["ifoe_req", "ifoe_resp", "non_ifoe"],
"_comment_traffic_types": "Traffic categories to enforce when evaluating PASS/FAIL. Maps to afmctl's --traffic-type (request, response, non-ifoe). When all three are selected (default) --traffic-type is omitted so afmctl exercises every category.",

"loss_threshold_pct": 0.0,
"_comment_loss_threshold_pct": "Maximum tolerated packet loss percentage per traffic type. Defaults to 0.0 (any failure marks the node as FAIL).",

"ssh_timeout": 180,
"_comment_ssh_timeout": "Overall SSH timeout (seconds) for each afmctl invocation. Increase for large port counts or high pings_per_port values."
}
},

Expand Down
Loading
Loading