| Test Domain | Function | Description | Example | Test ID | Req ID | GH | Labels | Notes | Priority | Issues | Dependencies | Milestone | Release | Test Cases | Status | Justification |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Foundational Services |
Image Registry |
Stores and distributes boot images, VM images, containers, and deployment artifacts. Acts as a trusted source for all deployable assets. |
GCP Artifact Registry + Images |
BOOT01 |
image_registry, min_req |
P0 |
LimitedEnv |
M0 |
CRUD an custom OS image (iso file) (Upload an ISO file if supported, or remote URL if supported). |
approved |
||||||
BOOT01 |
image_registry |
P0 |
LimitedEnv |
M1 |
CRUD an OS install configuration (e.g. iPXE config like Carbide) |
approved |
||||||||||
BOOT03 |
min_req |
P0 |
LimitedEnv |
M0 |
CRUD a custom VM image |
approved |
||||||||||
BOOT03 |
image_registry, min_req |
P0 |
M3 |
CRUD (get, list, create, delete) custom OS images for fast boot (raw, qcow2, img, etc) for BM/VM |
approved |
|||||||||||
BOOT01 |
image_registry, min_req |
P0 |
LimitedEnv |
M1 |
Check that an OS image can be installed on a BMaaS system |
approved |
||||||||||
BOOT01 |
image_registry, min_req |
P0 |
LimitedEnv |
M1 |
Check that an OS install configuration can be installed on a BMaaS system |
approved |
||||||||||
BOOT01 |
image_registry, min_req |
P0 |
LimitedEnv |
M0 |
Check that a VM image can be installed onto a VMaaS system |
approved |
||||||||||
IMG01 |
P1 |
M4 |
Verify that an OS image (iso file) with full NVIDIA support is readily available |
approved |
||||||||||||
IMG02 |
P1 |
M4 |
Verify that an OS install configuration (e.g. iPXE config like Carbide) with full NVIDIA support is readily available |
approved |
||||||||||||
IMG03 |
P1 |
M4 |
Verify that an VM image with full NVIDIA support is readily available |
approved |
||||||||||||
Key Secret Mgmt |
Securely stores and manages cryptographic keys, secrets, and certificates. Supports key rotation |
AWS Secrets Manager / KMS |
AUTH01 |
P1 |
LimitedEnv |
M4 |
CRUD an SSH/Teleport compatible public key for system login |
approved |
||||||||
AUTH02 |
vm |
P1 |
M5 |
Spin up an instance which uses a specified key |
approved |
|||||||||||
AUTH03 |
P1 |
M5 |
Access other components via a specified key as possible (SOL, Network devices) |
approved |
||||||||||||
Identity and Asset Mgmt (IAM) |
Manages users, service identities, roles, and permissions across tenants and services. Forms the security foundation of the platform. |
AWS: IAM |
IAM01 |
iam, min_req |
P0 |
LimitedEnv |
M0 |
Create user and log in as that user |
approved |
|||||||
IAM02 |
P0 |
LimitedEnv |
M0 |
Delete user |
approved |
|||||||||||
IAM03 |
iam |
P1 |
LimitedEnv |
M5 |
Validate that a user created can access the API and authorized resources |
approved |
||||||||||
DDI |
core network services for name resolution, IP allocation, and address lifecycle management |
GCP Cloud DNS + IPAM |
IPAM01 |
network, ssh |
P0 |
MiniCloud |
M4 |
Check that IP addresses are managed by the platform, and that DHCP is able to provide IP addresses |
approved |
|||||||
IPAM02 |
network |
P0 |
MiniCloud |
M4 |
Check that IP addresses are managed sensibly as VPCs are configured (Move to SDN section?) |
approved |
||||||||||
Network Underlay & Mgmt |
Brings up the physical network fabric for Day 0, base configuration, and network OS installation. Manages switch OS versions, compliance, and safe upgrades over time. Provides base network that overlays and tenants are built upon. |
Netris |
NETMGMT01 |
P0 |
MiniCloud |
M4 |
Verify that all switches in the are reachable and healthy (SSH, API, etc) |
pending |
||||||||
Resource Database |
Resource Inventory for useable resources withing the data center |
RESDB01 |
P1 |
LimitedEnv |
M4 |
Verify that all resources assigned to the cluster are accounted for |
pending |
|||||||||
Infrastructure and Data Services |
Control Plane accessible |
Control plane can be accessed |
CP03 |
P0 |
LimitedEnv |
M0 |
Make sure we can ping the control plane, or get a heart beat/status code |
approved |
||||||||
CP04 |
P2 |
LimitedEnv |
M5 |
Control Plane can be upgraded (measure impact) |
approved |
|||||||||||
CP05 |
control_plane, iam |
P0 |
LimitedEnv |
M0 |
Access keys can be created, received, used to log in |
approved |
||||||||||
CP06 |
control_plane, iam, min_req |
P0 |
LimitedEnv |
M0 |
Access keys can expire or be disabled |
approved |
||||||||||
CP07 |
control_plane, iam |
P0 |
LimitedEnv |
M0 |
Create tenants |
approved |
||||||||||
CP08 |
control_plane, iam |
P0 |
LimitedEnv |
M0 |
Retrieve list of tenants |
approved |
||||||||||
CP09 |
control_plane, iam |
P0 |
LimitedEnv |
M0 |
Retrieve info about individual tenant |
approved |
||||||||||
CP10 |
control_plane |
P0 |
LimitedEnv |
M0 |
Delete Tenant |
approved |
||||||||||
CP11 |
min_req |
P0 |
LimitedEnv |
M0 |
Add user to tenant |
approved |
||||||||||
Hardware ingestion |
Makes sure that all hardware under test has been ingested |
HWING01 |
bare_metal, ingestion |
P1 (check w/ ISVs) |
MiniCloud |
M4 |
Makes sure that all hardware under test has been ingested, and matches the provided hardware |
approved |
||||||||
HWING02 |
P1 |
MiniCloud |
M4 |
Make sure that a device can be removed from the system |
approved |
|||||||||||
Attestation |
SEC22 |
attestation, bare_metal, security |
P1 |
MiniCloud |
M5 |
Check that hardware passes nonce check |
approved |
|||||||||
ATTEST01 |
P1 |
MiniCloud |
M5 |
Check that OS is approved |
approved |
|||||||||||
SEC22 |
P1 |
MiniCloud |
M5 |
Check that TPM Configuration can be set per host |
approved |
|||||||||||
ATTEST02 |
P1 |
MiniCloud |
M5 |
Check that an updated BIOS is installed on all hardware |
approved |
|||||||||||
CNP09 |
min_req |
P1 |
MiniCloud |
M4 |
Check that the updated firmware is installed |
approved |
||||||||||
CNP09 |
attestation, bare_metal, firmware, min_req, security |
P1 |
M5 |
Verify all firmware is cryptographically signed and attested during boot |
pending |
|||||||||||
SEC22 |
REVIEW |
TPM Secure Boot in BIOS |
pending |
|||||||||||||
Compute Services: BMaaS |
Supports the basic lifecycle of Bare Metal nodes |
AWS EC2/Forge |
CNP01 |
min_req |
P0 |
LimitedEnv |
M1 |
Create a Bare Metal node |
approved |
|||||||
CNP01, CNP04 |
min_req |
P0 |
M5 |
Supports the basic lifecycle of Bare Metal nodes, following the standard CRUD pattern |
approved |
|||||||||||
CNP01 |
min_req |
P0 |
LimitedEnv |
M1 |
Delete a Bare Metal node |
approved |
||||||||||
SEC21 |
bare_metal, min_req, security |
P0 |
LimitedEnv |
M1 |
Confirm sanitization on delete - Tenant View |
approved |
||||||||||
SEC21 |
bare_metal, disk, min_req, sanitization, security |
P0 |
MiniCloud |
M4 |
Confirm sanitization on delete - Low Level |
approved |
||||||||||
CNP01 |
bare_metal, min_req |
P0 |
MiniCloud |
M3 |
Topology-based placement |
approved |
||||||||||
BMAAS01 |
REVIEW |
Node attached storage |
pending |
|||||||||||||
BMAAS02 |
REVIEW |
Storage: Config/default/access to local NVME drives |
pending |
|||||||||||||
Once created, nodes can be accessed for usage, and have the high level support controls of reboot, reinstall. |
AWS EC2/Forge |
BMAAS03 |
bare_metal, ssh |
P0 |
LimitedEnv |
M1 |
A running node can be accessed via SSH/Teleport for further configuration (Bare Metal) |
approved |
||||||||
CNP01 |
bare_metal, min_req |
P0 |
M1 |
A running node can be rebooted in case of issue (Bare Metal) |
approved |
|||||||||||
CNP01 |
bare_metal, min_req |
P0 |
MiniCloud |
M4 |
A running node can be power-cycled in case of issue |
approved |
||||||||||
BMAAS04 |
P0 |
LimitedEnv |
M1 |
DEPRECATED: A node can be reinstalled from its configured stock operating system |
approved |
|||||||||||
CNP01 |
bare_metal, min_req |
MinAPI |
P0 |
MiniCloud |
M3 |
A running node can be powered off (Without being destroyed) |
approved |
|||||||||
BOOT02 |
bare_metal, min_req, ssh |
MinAPI |
P0 |
M3 |
Support for cloud-init and instance metadata discovery via link-local addresses (e.g. 169.254.169.254) or virtual devices (Bare Metal) |
approved |
||||||||||
CNP01 |
bare_metal, min_req |
MinAPI |
P0 |
MiniCloud |
M3 |
A powered off node can be powered back on |
approved |
|||||||||
CNP05 |
bare_metal, min_req |
MinAPI |
P0 |
LimitedEnv |
M3 .5 |
Support for user-defined tags/labels and cloud-init metadata on instances. (Bare Metal) |
approved |
|||||||||
Nodes will be able to access expected resources on the local networks |
AWS EC2/Forge |
BMAAS05 |
bare_metal, dpu |
P0 |
MiniCloud |
M4 |
Check the health and status of the DPU on instantiation |
approved |
||||||||
BMAAS06 |
bare_metal, gpu, network, ssh |
P0 |
LimitedEnv |
M1 |
Nodes can communicate across ethernet, infiniband, and NVLink (Bare Metal) |
approved |
||||||||||
Observability & Debug |
AWS EC2/Forge |
BMAAS07 |
bare_metal |
REVIEW |
P1 |
M5 |
Check for any per-host status log over time. |
approved |
||||||||
CNP06 |
bare_metal, min_req |
P0 |
LimitedEnv |
M3 |
Serial console access is required (read-only sufficient, interactive preferred) (Bare Metal) |
approved |
||||||||||
BMAAS08 |
bare_metal, gpu, ssh |
P0 |
LimitedEnv |
M1 |
Verify NVIDIA hardware |
approved |
||||||||||
BMAAS09 |
bare_metal, gpu, min_req, ssh, workload |
P0 |
LimitedEnv |
M1 |
GPU Stress workload (Bare Metal) |
approved |
||||||||||
BMAAS10 |
bare_metal, gpu, min_req, slow, ssh, workload |
P0 |
LimitedEnv |
M1 |
Nim Inference jobs (Bare Metal) |
approved |
||||||||||
BMAAS11 |
bare_metal, gpu, min_req, ssh, workload |
P0 |
LimitedEnv |
M1 |
NCCL tests |
approved |
||||||||||
BMAAS12 |
bare_metal, gpu, min_req, ssh, workload |
P0 |
LimitedEnv |
M1 |
Training workload test |
approved |
||||||||||
CNP06 |
bare_metal, min_req |
P0 |
M5 |
Verify that serial console output is logged and queryable for at least 1 month of history |
pending |
|||||||||||
Stable Identifiers |
CNP08 |
bare_metal, min_req |
P0 |
M5 |
Verify that a BM node’s ID does not change after a reboot |
pending |
||||||||||
CNP08 |
min_req |
P0 |
M5 |
Verify that a BM node’s ID does not change after a maintenance/reprovision event |
pending |
|||||||||||
Compute Service: VMaaS |
Supports the basic lifecycle of VM nodes |
CNP01 |
min_req, vm |
P0 |
LimitedEnv |
M0 |
Create a VM node |
approved |
||||||||
CNP01, CNP04 |
min_req |
P0 |
LimitedEnv |
M2 |
Get info about a specific VM node, including if it is ready for use |
approved |
||||||||||
CNP01 |
min_req, vm |
P0 |
LimitedEnv |
M1 |
Get a list of all VM nodes in a VPC |
approved |
||||||||||
CNP01 |
min_req |
P0 |
LimitedEnv |
M0 |
Delete a VM node |
approved |
||||||||||
SEC21 |
min_req |
P0 |
LimitedEnv |
M0 |
Confirm sanitization on delete |
approved |
||||||||||
Once created, VM nodes can be accessed for usage, and have the high level support controls of reboot, reinstall |
VMAAS01 |
ssh, vm |
P0 |
LimitedEnv |
M0 |
A running node can be accessed via SSH/Teleport for further configuration (VM) |
approved |
|||||||||
CNP01 |
vm |
P0 |
LimitedEnv |
M0 |
A running node can be rebooted in case of issue (VM) |
approved |
||||||||||
VMAAS02 |
P0 |
LimitedEnv |
M1 |
DEPRECATED: A node can be reinstalled from its configured stock operating system |
pending |
|||||||||||
CNP01 |
min_req, vm |
MinAPI |
P0 |
LimitedEnv |
M3 |
A running VM can be stopped (Without being destroyed) |
approved |
|||||||||
CNP01 |
min_req, vm |
MinAPI |
P0 |
LimitedEnv |
M3 |
A powered off VM can be started |
approved |
|||||||||
CNP05 |
min_req, vm |
MinAPI |
P0 |
LimitedEnv |
M3 |
Support for user-defined tags/labels and cloud-init metadata on instances. (VM) |
approved |
|||||||||
BOOT02 |
min_req, ssh, vm |
P0 |
M3 |
Support for cloud-init and instance metadata discovery via link-local addresses (e.g. 169.254.169.254) or virtual devices (VM) |
approved |
|||||||||||
Quota should be enforced |
VMAAS03 |
P2 |
Noisy Neighbor CPU/GPU tests |
approved |
||||||||||||
VMAAS04 |
P2 |
Tenant cannot allocate more VMs than their quota allows |
approved |
|||||||||||||
The underlying system allows VMs to access the GPU, with high performance, and are running with the correct parameters for optimal performance |
VMAAS05 |
ssh, vm |
P0 |
LimitedEnv |
M0 |
Check that the host OS is a known acceptable image (like Ubuntu/DGXOS) |
approved |
|||||||||
VMAAS06 |
gpu, ssh, vm |
P0 |
LimitedEnv |
M0 |
Check that the GPU is visible and accessible from the VM |
approved |
||||||||||
VMAAS07 |
gpu, ssh, vm |
P0 |
LimitedEnv |
M0 |
Check that the correct Linux Kernel, libvirt, sbios, and NVIDIA drivers are installed |
approved |
||||||||||
VMAAS08 |
P0 |
LimitedEnv |
M1 |
Check that vEGM is configured correctly |
approved |
|||||||||||
VMAAS09 |
gpu, ssh, vm |
Does this include GPU Passthrough |
P0 |
LimitedEnv |
M0 |
Check that vCPU pinning is set correctly, PCI bus is configured correctly |
approved |
|||||||||
VMAAS10 |
P1 |
LimitedEnv |
M5 |
TODO: More things from the "NVIDIA Grace I/O Virtualization Guide" |
approved |
|||||||||||
Can run AI/ML Jobs |
VMAAS11 |
P0 |
LimitedEnv |
M0 |
GPU Stress workload (VM) |
approved |
||||||||||
VMAAS12 |
gpu, slow, ssh, vm, workload |
P0 |
LimitedEnv |
M1 |
Nim Inference jobs (VM) |
approved |
||||||||||
VMAAS13 |
P0 |
M5 |
Run NCCL tests |
approved |
||||||||||||
nodes will be able to access expected resources on the local networks |
AWS EC2/Forge |
VMAAS14 |
P0 |
LimitedEnv |
M5 |
Nodes can communicate across ethernet, infiniband, and NVLink (VM) |
approved |
|||||||||
Observability & Debug |
CNP06 |
min_req, vm |
P0 |
M3 |
Serial console access is required (read-only sufficient, interactive preferred) (VM) |
approved |
||||||||||
Stable Identifiers |
CNP08 |
min_req, vm |
P0 |
M5 |
Verify that a VM’s ID does not change after a stop/start cycle |
pending |
||||||||||
CNP08 |
min_req |
P0 |
M5 |
Verify that a VM’s ID does not change after a live migration or maintenance event |
pending |
|||||||||||
VM Hardening |
CNP01 |
iam, min_req, security, vm |
P0 |
M5 |
Verify console access is restricted via RBAC |
pending |
||||||||||
CNP01 |
min_req, security, vm |
P1 |
M5 |
Verify USB, clipboard, and unnecessary virtual devices are disabled |
pending |
|||||||||||
SDN Controller |
Controls software-defined networking overlays and fabric behavior to match the compute-on-demand requirements. Enables network segmentation, isolation, dynamic network configuration, and security groups. |
SDN01 |
min_req, network |
P0 |
LimitedEnv |
M0 |
Create a VPC |
approved |
||||||||
SDN01 |
min_req, network |
P0 |
LimitedEnv |
M0 |
Retrieve a VPC |
approved |
||||||||||
SDN01 |
min_req, network |
P0 |
LimitedEnv |
M0 |
Update a VPC |
approved |
||||||||||
SDN01 |
min_req, network |
P0 |
LimitedEnv |
M0 |
Delete a VPC |
approved |
||||||||||
SDN04 |
P0 |
LimitedEnv |
M0 |
Verify that a VPC has an exclusive subnet |
approved |
|||||||||||
SDN04 |
min_req, network, security |
P0 |
LimitedEnv |
M0 |
Verify that nodes in two VPCs cannot communicate over the N/S (TAN) network |
approved |
||||||||||
SDN04 |
min_req |
P0 |
LimitedEnv |
M0 |
Verify that nodes in two VPCs cannot communicate over the E/W (CIN) network |
approved |
||||||||||
NET03, SDN01 |
min_req, network |
P0 |
M3 |
Support non-conflicting Bring-Your-Own-IP (including 7.0.0.0/8) |
approved |
|||||||||||
SDN11 |
min_req, network |
P0 |
M3 |
Support Stable Private IP allocations, where if a VM crashes and restarts the same IP address remains pinned until the node is deleted |
approved |
|||||||||||
SDN05 |
min_req, network |
P0 |
M3 |
Atomically switch a floating private IP between nodes via API within <10 seconds without requiring an instance reboot. |
approved |
|||||||||||
SDN06 |
min_req, network |
P0 |
M3 |
Localized DNS: Support for custom, localized DNS settings to enable internal domain resolution to private endpoints (e.g. storage endpoints) |
approved |
|||||||||||
SDN07 |
min_req, network |
P0 |
M3 |
Support for VPC peering with full bandwidth and no "hairpin" routing. |
approved |
|||||||||||
SDN02, SDN03 |
min_req, network, security |
P0 |
Create a Security Group |
approved |
||||||||||||
SDN02, SDN03 |
min_req, network, security |
P0 |
Retrieve a Security Group/list Security Groups |
approved |
||||||||||||
SDN02, SDN03 |
min_req, network, security |
P0 |
Update a Security Group |
approved |
||||||||||||
SDN02, SDN03 |
min_req, network, security |
P0 |
Delete a Security Group |
approved |
||||||||||||
CNP03 |
P0 |
LimitedEnv |
M0 |
NVLink Partitioning |
approved |
|||||||||||
SDN12 |
LimitedEnv |
M4 |
NVLink: Create a new partition; verify success response with partition key. |
approved |
||||||||||||
SDN13 |
LimitedEnv |
M4 |
NVLink: List partitions; retrieve a specific partition by partition ID |
approved |
||||||||||||
SDN14 |
LimitedEnv |
M4 |
NVLink: Delete an empty partition (no instances); retrieve by ID and verify not found. |
approved |
||||||||||||
SDN15 |
LimitedEnv |
M4 |
NVLink: Create a partition, create instance in that partition, make sure the instance becomes ready, delete instance, verify GPUs removed from partition. |
approved |
||||||||||||
SDN16 |
LimitedEnv |
M4 |
NVLink: Create partition, create 2 instances, verify both share the same nvlink_partition_id and both GPUs are reported for that partition |
approved |
||||||||||||
SDN17 |
LimitedEnv |
M4 |
IMEX: Assert nvidia-imex and nvidia-imex-ctl packages are installed and nvidia-imex.service unit is registered with systemd. |
approved |
||||||||||||
SDN18 |
LimitedEnv |
M4 |
IMEX: Start nvidia-imex via systemctl start nvidia-imex and assert the service reaches active/ready state. |
approved |
||||||||||||
SDN19 |
LimitedEnv |
M4 |
IMEX: Stop nvidia-imex via systemctl stop nvidia-imex and assert clean shutdown (other nodes show disabled node status). |
approved |
||||||||||||
SDN20 |
LimitedEnv |
M4 |
IMEX: Enable IMEX at boot with systemctl enable nvidia-imex, reboot, and confirm IMEX starts automatically. |
approved |
||||||||||||
SDN21 |
LimitedEnv |
M4 |
IMEX: After all nodes start IMEX, query domain state via nvidia-imex-ctl -N and assert state is UP, fully connected matrix |
approved |
||||||||||||
SDN22 |
LimitedEnv |
M4 |
IMEX: With IMEX_ENABLE_AUTH_ENCRYPTION=0, form a multi-node domain and assert connectivity without any auth/encryption. |
approved |
||||||||||||
SDN23 |
LimitedEnv |
M4 |
IMEX: Enable IMEX_ENABLE_AUTH_ENCRYPTION=1 with SSL_TLS mode, provide valid server/client certs, and assert mutual authentication and encrypted communication. |
approved |
||||||||||||
SDN24 |
LimitedEnv |
M4 |
IMEX: Enable GSSAPI/Kerberos mode (GSS_AUTH_ENCRYPT), configure KDC/keytabs, and assert mutual auth + encryption between nodes. |
approved |
||||||||||||
SDN25 |
LimitedEnv |
M4 |
IMEX: IMEX/K8s: Submit a workload requesting a ComputeDomain; assert ComputeDomain pods are created on each allocated node. |
approved |
||||||||||||
SDN26 |
LimitedEnv |
M4 |
IMEX: From workload pods in a ComputeDomain run nvbandwidth-test-job |
approved |
||||||||||||
SDN27 |
LimitedEnv |
M4 |
IMEX: After workload completes, assert ComputeDomain and associated IMEX resources are cleaned up. |
approved |
||||||||||||
SDN28 |
REVIEW |
Redundant Gateways |
pending |
|||||||||||||
Security Group Scope & Timing |
SDN02 |
min_req, network, security |
P0 |
M5 |
Verify security group rules can be scoped at workload level |
pending |
||||||||||
SDN02 |
min_req, network, security |
P0 |
M5 |
Verify security group rules can be scoped at node level |
pending |
|||||||||||
SDN02 |
min_req, network, security |
P0 |
M5 |
Verify security group rules can be scoped at subnet/tenant level |
pending |
|||||||||||
SDN02 |
min_req, network, security |
P1 |
M5 |
Measure and verify policy propagation timing is within acceptable bounds |
pending |
|||||||||||
SDN02 |
min_req, network, security |
P0 |
M5 |
Verify security group rules can be scoped at K8s API service level |
pending |
|||||||||||
All-to-All Storage Routing |
SDN08 |
min_req, network |
P1 |
M5 |
Verify that storage hosts can route to each other with all-to-all L3 communication |
pending |
||||||||||
Port Security |
SDN02 |
min_req, network, security |
P1 |
M5 |
Verify customizable port security policies can be applied to virtual interfaces |
pending |
||||||||||
Network Infrastructure Observability |
SDN09 |
min_req, network |
P1 |
M5 |
Verify logging is available for network hardware faults |
pending |
||||||||||
SDN09 |
min_req, network |
P1 |
M5 |
Verify logging captures latency/performance fluctuations |
pending |
|||||||||||
SDN09 |
min_req, network, security |
P1 |
M5 |
Verify a detailed audit trail exists for all configuration changes to network filtering rules |
pending |
|||||||||||
Metadata Service |
shared service source of truth for resource metadata such as nodes, GPUs, topology, and ownership. Enables scheduling, automation, and lifecycle management. |
GCP Resource Manager APIs |
META01 |
P1 |
LimitedEnv |
M4 |
TODO |
approved |
||||||||
Cloud Control Plane |
Primary access mechanism for IaaS, provding provision and manage of compute, networking, and storage. |
AWS EC2 Control Plane |
CTRL01 |
P0 |
M4 |
Needs to have an API we can access from our tests to do actions |
approved |
|||||||||
CAP02 |
Verify that fleet management API provides full set of information requested below: Node ID (Unique identifier for a GPU node) Health State (Healthy/Unhealthy classification) Instance ID (Identifier for virtual workload) Creation Timestamp (Time workload/node was created) Hardware Type (Descriptor for the hardware model) GPU Count (Number of GPUs per node) Top-levelAccount/ID (Identifier for the top-level organization/account) Sub-LevelProject/ID (Identifier for the nested project/sub-account) In Use (True/False status indicating if the GPU Node is turned on and in use) Region (Region of the data center where nodes are deployed) |
approved |
||||||||||||||
CAP03 |
Resource discovery API: newly delivered capacity must be discoverable. This "Resource Index" must provide a stable resource identifer. |
approved |
||||||||||||||
Stable Identifiers (Switches/Infrastructure) |
CNP08 |
min_req |
P1 |
M5 |
Verify that switch identifiers are stable across reboots and firmware updates |
pending |
||||||||||
Load Balancing |
Distributes traffic across services (e.g. inference endpoints) for availability and scale. |
AWS Network Load Balancer |
LB01 |
P2 |
LimitedEnv |
CRUD a load balancer |
approved |
|||||||||
LB02 |
LimitedEnv |
Verify that traffic is distributed across multiple nodes |
approved |
|||||||||||||
LB03 |
Verify that nodes can be marked as down or up (for rolling deployments/etc) |
approved |
||||||||||||||
LB04 |
LimitedEnv |
Verify that no traffic is dropped during outages |
approved |
|||||||||||||
Break-Fix |
A break-fix service works with a variety of components to detect, triage, remediate, and validate nodes in an AI factory. The critical focus here is on the GPU nodes, but the system can scale beyond. |
Lazarus, Shoreline |
BFX04 |
P1 |
LimitedEnv |
Check that GPUd or Sentinel is running |
approved |
|||||||||
BFX05 |
LimitedEnv |
Verify that Tenants can be notified, by some communication system, of planned future node maintenance |
approved |
|||||||||||||
BFX06 |
LimitedEnv |
Verify that Tenants can be notified, by some communication system, of immediate node failure |
approved |
|||||||||||||
Breakfix API Lifecycle |
BFX01 |
min_req |
P2 |
M5 |
Reset GPUs on an individual node via the Breakfix API |
pending |
||||||||||
BFX01 |
min_req |
P1 |
M5 |
Return an individual node to the provider for maintenance via the API |
pending |
|||||||||||
BFX01 |
min_req |
P1 |
M5 |
Return a rack to the provider for maintenance via the API |
pending |
|||||||||||
BFX01 |
min_req |
P1 |
M5 |
Cordon a node (mark unschedulable); verify no new workloads are placed; verify existing workloads continue |
pending |
|||||||||||
BFX01 |
min_req |
P1 |
M5 |
Request a host replacement when health thresholds are breached; verify the node is removed from the pool |
pending |
|||||||||||
Breakfix Events Query |
BFX02 |
min_req |
P1 |
M5 |
Query upcoming/current maintenance events for a node |
pending |
||||||||||
BFX02 |
min_req |
P1 |
M5 |
Query retirement notices for a node/rack |
pending |
|||||||||||
BFX02 |
min_req |
P1 |
M5 |
Query historical repair status for a node |
pending |
|||||||||||
Breakfix Diagnostics |
BFX03 |
min_req |
P1 |
M5 |
Query serial numbers of installed hardware (chassis, baseboard, NICs, CPU, GPU) — obfuscated but stable IDs are OK |
pending |
||||||||||
BFX03 |
min_req |
P1 |
M5 |
Inspect firmware versions of NV switch trays |
pending |
|||||||||||
BFX03 |
min_req |
P1 |
M5 |
Obtain BMC kernel log messages for a node |
pending |
|||||||||||
Data Services |
SDS Controller |
Manages software-defined storage services and policies. Orchestrates provisioning, tiers, and resiliency. |
AWS: EBS + FSx backend control planes |
HSS01 |
min_req |
P1 |
M5 |
Verify storage provisioning via vendor/NCP API |
pending |
|||||||
HSS02 |
min_req |
P1 |
M5 |
Verify QoS — provisioned throughput meets requested minimum bandwidth and IOPS |
pending |
|||||||||||
HSS03, K8S23 |
min_req |
P1 |
M5 |
Verify K8s CSI integration for storage |
pending |
|||||||||||
HSS04 |
min_req |
P1 |
M5 |
Verify quota limits can be set on user workloads/volumes |
pending |
|||||||||||
HSS05 |
min_req |
P1 |
M5 |
Verify non-disruptive upgrades (NVIDIA can defer maintenance up to 2 weeks) |
pending |
|||||||||||
STOR01 |
REVIEW |
Storage: Config/default/access to network provided storage |
pending |
|||||||||||||
Home Directory Storage |
DIR02 |
min_req |
P1 |
M5 |
Verify NFS protocol shared storage is available |
pending |
||||||||||
DIR01 |
min_req |
P1 |
M5 |
Verify configurable filesystem-wide quota limits |
pending |
|||||||||||
DIR01 |
min_req |
P1 |
M5 |
Verify usage accounting for uid/gids |
pending |
|||||||||||
Data Movement |
DMS01 |
min_req |
P1 |
M5 |
Verify a dedicated K8s cluster (or ability to create one) for the data mover stack |
pending |
||||||||||
DMS02 |
min_req |
P1 |
M5 |
Verify dedicated CPU nodes are available for data mover with high-performance networking |
pending |
|||||||||||
DMS03 |
min_req |
P1 |
M5 |
Verify the same filesystem mounted on GPU nodes is accessible from data mover nodes |
pending |
|||||||||||
DMS05 |
min_req, network |
P1 |
M5 |
Verify stable egress IP for allowlisting access to NVIDIA services |
pending |
|||||||||||
DGXC-Managed Storage Deployment |
STG02 |
min_req |
P1 |
M5 |
Verify optional skip-sanitization flag during break/fix where tenancy doesn’t change |
pending |
||||||||||
STG03 |
min_req |
P1 |
M5 |
Verify stable IP assignment for storage nodes across lifecycle operations |
pending |
|||||||||||
STG04 |
min_req |
P1 |
M5 |
Verify out-of-band failure detection (device, network, memory, drive) |
pending |
|||||||||||
STG05 |
min_req |
P1 |
M5 |
Verify topology observability — visibility into failure domains for physical diversity |
pending |
|||||||||||
Storage Security |
SEC04 |
iam, min_req, security |
P1 |
M5 |
Verify least-privilege access policies (resource-based, user-based, network-based) |
pending |
||||||||||
RDMA Memory Protection |
HSS06 |
min_req |
P1 |
M5 |
Verify storage systems using RDMA enforce memory protection via authorization keys for both local and remote access |
pending |
||||||||||
Parallel File System Services |
high performance file system optimized for AI and HPC workloads |
HSS07 |
min_req |
P1 |
M5 |
Verify that a parallel high-speed filesystem can be provisioned via API |
pending |
|||||||||
HSS09 |
min_req |
P1 |
M5 |
Verify multiple filesystems can exist within total capacity (minimum FS size ⇐ 50 TiB) |
pending |
|||||||||||
HSS10 |
min_req |
P1 |
M5 |
Verify live filesystem expansion (capacity, inodes, IO, metadata) |
pending |
|||||||||||
HSS12 |
min_req |
P1 |
M5 |
Verify uid/gid/project-id soft and hard quotas with enforcement |
pending |
|||||||||||
HSS13 |
min_req |
P1 |
M5 |
Verify root-squash can be enabled and disabled at any time |
pending |
|||||||||||
HSS14 |
min_req |
P1 |
M5 |
Verify the filesystem can be mounted with flock |
pending |
|||||||||||
HSS15 |
min_req |
P1 |
M5 |
Verify changelog/audit data is accessible (tracking by uid/gid, file/dir operations) |
pending |
|||||||||||
HSS18 |
min_req |
P1 |
M5 |
Verify client multipathing to all storage servers |
pending |
|||||||||||
Object Storage Service |
durable, scalable object storage for things like datasets, model artifacts, logs, backups. |
S3 |
DATASVC01 |
control_plane, min_req |
P2 |
M5 |
Verify S3-compatible API access with authenticated endpoints |
pending |
||||||||
Block Storage Services |
persistent block volumes for a wide range of uses |
AWS EBS, GCP Persistent Disk |
K8S23 |
min_req |
P2 |
M5 |
Verify dynamic volume provisioning via CSI |
pending |
||||||||
DATASVC02 |
min_req |
P1 |
M5 |
Verify volume snapshots |
pending |
|||||||||||
DATASVC03 |
min_req |
P1 |
M5 |
Verify volume resizing |
pending |
|||||||||||
DATASVC04 |
min_req |
P1 |
M5 |
Verify persistent block volumes survive instance restarts |
pending |
|||||||||||
Cache Services |
in-memory data store for frequently used data |
ElastiCache |
DATASVC05 |
P2 |
LimitedEnv |
Create a managed cache instance, write a key-value pair, read it back, verify data integrity |
pending |
|||||||||
Vector Store |
stores and indexes embeddings for semantic search and RAG generation |
Vertex AI Vector Search |
DATASVC06 |
P2 |
LimitedEnv |
Create a vector store, insert an embedding, perform similarity search, verify result |
pending |
|||||||||
Relational Database |
structured database (SQL) service |
GCP Cloud SQL |
DATASVC07 |
P2 |
LimitedEnv |
Create a managed SQL database, create a table, insert/query data, verify results |
pending |
|||||||||
Workload Orchestration |
Backup and Recovery |
Centralized managed service to automate and govern data backup across services |
AWS backup |
DATASVC08 |
P2 |
M4 |
Create a backup of a block volume, delete original, restore from backup, verify data is the same as the original |
pending |
||||||||
Model Registry |
storage with versioning for models and AI artifacts |
AWS Sagemaker Model Registry |
DATASVC09 |
P2 |
Push a model artifact with a version tag, retrieve by version, verify it matches (prefer a smaller reference model) |
pending |
||||||||||
Slurm Control Plane |
control plane to schedule and manage large-scale GPU batch training with SLURM semantics. |
AWS ParallelCluster |
SLURM01 |
P0 |
LimitedEnv |
M0 |
Create a SLURM instance using the software platform |
done |
||||||||
SLURM02 |
slurm |
P0 |
LimitedEnv |
M0 |
SlurmInfoAvailable |
done |
||||||||||
SLURM03 |
slurm |
P0 |
LimitedEnv |
M0 |
SlurmPartition cpu and gpu |
done |
||||||||||
SLURM04 |
slurm |
P0 |
LimitedEnv |
M0 |
SlurmJobSubmission |
done |
||||||||||
SLURM05 |
slurm |
P0 |
LimitedEnv |
M0 |
SlurmGpuAllocation 1gpu and 2gpu |
done |
||||||||||
SLURM06 |
slurm |
P0 |
LimitedEnv |
M0 |
SlurmNodeJobExecution cpu and gpu |
done |
||||||||||
SLURM07 |
gpu, slow, slurm, workload |
P0 |
LimitedEnv |
M0 |
SlurmGpuStressWorkload |
done |
||||||||||
SLURM08 |
gpu, slow, slurm, workload |
P0 |
LimitedEnv |
M0 |
SlurmNcclMultiNodeWorkload |
done |
||||||||||
SLURM09 |
slow, slurm, workload |
P0 |
LimitedEnv |
M0 |
SlurmSbatchWorkload gpu and cpu |
done |
||||||||||
SLURM10 |
P0 |
LimitedEnv |
M0 |
SlurmSbatchWorkload-inline |
done |
|||||||||||
Managed Kubernetes Control Plane |
Tenant-isolated Kubernetes control planes for managing k8s workloads. does placement, networking, lifecyle mgmt. |
AWS EKS, GCP GKE |
K8S05 |
kubernetes, min_req |
P0 |
LimitedEnv |
M0 |
Create a K8s Instance using the software platform (CLI, Terraform) |
done |
|||||||
Full lifecycle management |
K8S38 |
min_req |
M4 |
Run through all the steps of the k8s lifecycle management, including user-initated CP update |
approved |
|||||||||||
K8S41 |
The control plane should automatically add more capacity when load increases |
pending |
||||||||||||||
K8S27 |
min_req |
Pin Control Plane instances to a specific count to handle a particular load-limit |
pending |
|||||||||||||
Control Plane Isolation |
K8S14 |
min_req |
M4 |
per tenant k8s control plane outside of the tenant cluster/VPC |
approved |
|||||||||||
K8S39 |
min_req |
P0 |
LimitedEnv |
M0 |
Be able to run k8s workloads |
done |
||||||||||
K8S40 |
gpu, kubernetes, slow, workload |
P0 |
LimitedEnv |
M0 |
K8s Nim Inference |
done |
||||||||||
K8S32 |
gpu, kubernetes, slow, workload |
P0 |
LimitedEnv |
M0 |
K8s Nim Helm |
done |
||||||||||
Topology based placement |
K8S33 |
gpu, kubernetes, min_req, slow, workload |
P0 |
K8s Nccl |
approved |
|||||||||||
k8s upstream proxy |
K8S34 |
min_req |
k8s05 |
P0 |
validate adherence to upstream proxy requirements (service to pod load balancing, acces to internal services) |
approved |
||||||||||
Dynamic Resource Allocaiton |
K8S24 |
min_req |
k8s06/22 |
P0 |
verify that the k8s cluster supports the DRA API (resources.k8s.io) to get GPUs on the same nvlink |
approved |
||||||||||
Run k8s conformance tests |
K8S01 |
kubernetes, l2, min_req, slow |
k8s01 |
P0 |
get versions, topology, core APIs by: (a) pass CNCF k8s conformance: https://github.com/cncf/k8s-conformance (b) add AI conformance: [https://github.com/cncf/k8s-ai-conformance](https://github.com/cncf/k8s-ai-conformance) |
pending |
||||||||||
k8s service account OIDC |
K8S17 |
min_req |
test that the service account can be authorized with OIDC |
approved |
||||||||||||
Test support for CRDs and Validating/Mutating Admission Controllers (webhooks) |
K8S21 |
kubernetes, min_req |
k8s19 |
P1 |
create a CRD, create webhook, and test validation/mutation requirement |
approved |
||||||||||
TODO: L3 networking logging |
K8S35 |
min_req |
K8s16 |
P1 |
Verify that pod-to-pod L3 traffic flow logs are available and queryable |
approved |
||||||||||
CNI Compliance |
K8S22 |
kubernetes, min_req |
K8s20 |
P0 |
Apply a NetworkPolicy; verify pod connectivity is restricted as expected; verify IPv4 and IPv6 addresses are assigned on dual-stack nodes |
approved |
||||||||||
TODO: Dynamic provisioning, snapshots, and resizing |
K8S23 |
min_req |
K8s21 |
TODO: Helm version |
approved |
|||||||||||
K8S23 |
min_req |
K8s21 |
TODO: Kustomize version |
approved |
||||||||||||
K8S23 |
kubernetes, min_req |
P0 |
M5 |
Verify CSI supports block, shared filesystem, and NFS storage |
pending |
|||||||||||
K8S23 |
kubernetes, min_req |
P0 |
M5 |
Verify CSI supports both static and dynamic provisioning via PVs and PVCs |
pending |
|||||||||||
K8S23 |
kubernetes, min_req |
P0 |
M5 |
Verify CSI credentials are tenant cluster scoped (no cross-cluster access) |
pending |
|||||||||||
K8S23 |
kubernetes, min_req |
P0 |
M5 |
Verify APIs to query storage usage against overall cluster quota with per-PVC/Volume breakdown |
pending |
|||||||||||
NVIDIA Operator support |
K8S25 |
kubernetes, min_req |
K8S23 |
P0 |
TODO: Compatibility with the mainline NVIDIA GPU Operator |
approved |
||||||||||
K8S25 |
min_req |
P0 |
M5 |
Verify provider-default accelerator operators and drivers can be replaced or overridden with tenant-required versions |
pending |
|||||||||||
K8s Versioning & Compliance |
K8S02 |
min_req |
INFO |
M5 |
Verify support for the three most recent minor releases (N-2) |
pending |
||||||||||
K8S02 |
min_req |
INFO |
M5 |
Verify automated control plane security patching with no/minimal downtime |
pending |
|||||||||||
Zero-Downtime Upgrades |
K8S09 |
min_req |
INFO |
M5 |
Perform a minor version control plane upgrade and verify zero application downtime |
pending |
||||||||||
K8S10 |
min_req |
INFO |
M5 |
Perform a rolling node upgrade and verify disruption budgets are respected |
pending |
|||||||||||
HA Control Plane |
K8S11 |
min_req |
INFO |
M5 |
Verify the control plane recovers from a single-node failure |
pending |
||||||||||
Backup & Disaster Recovery |
K8S12 |
min_req |
INFO |
M5 |
Verify backup and recovery within defined RPO/RTO |
pending |
||||||||||
K8s Access Controls |
K8S15 |
kubernetes, min_req |
P0 |
M5 |
Verify the K8s API endpoint has network access controls (firewall/private link) |
pending |
||||||||||
K8s Encryption |
K8S19 |
min_req |
INFO |
M5 |
Verify at-rest encryption is enabled for etcd |
pending |
||||||||||
Certificate Management |
SEC09 |
min_req, security |
P1 |
M5 |
Verify certificates are rotated on a 60-day cycle |
pending |
||||||||||
SEC09 |
min_req, security |
P1 |
M5 |
Verify support for both provider-managed and customer-managed keys |
pending |
|||||||||||
Autoscaling |
K8S36 |
kubernetes, min_req |
P1 |
M5 |
Verify Cluster Autoscaler integration (upstream) |
pending |
||||||||||
Kubernetes Security Response |
K8S04 |
min_req |
INFO |
INFO |
M5 |
Verify NCP participates in the Kubernetes Security Response Committee (SRC) process and can patch disclosed vulnerabilities during embargo |
pending |
|||||||||
Node Pool Lifecycle Management |
K8S06 |
kubernetes, min_req, slow |
P0 |
M5 |
Create a K8s node pool via API/CLI specifying node type (CPU or GPU instance type) |
pending |
||||||||||
K8S06 |
kubernetes, min_req, slow |
P0 |
M5 |
Update a K8s node pool (e.g., scale to a target count) |
pending |
|||||||||||
K8S06 |
kubernetes, min_req, slow |
P0 |
M5 |
Delete a K8s node pool |
pending |
|||||||||||
K8S06 |
min_req |
P0 |
M5 |
Verify default node labels and taints can be specified when a node joins the cluster via node pool |
pending |
|||||||||||
API Server Metrics |
K8S07 |
kubernetes, min_req |
P1 |
M5 |
Verify API Server metrics are available in a Prometheus-scrapable format for SLO measurement |
pending |
||||||||||
Public OIDC Endpoint |
K8S17 |
kubernetes, min_req |
P0 |
M5 |
Verify the cluster exposes a cluster-specific OIDC Issuer endpoint for workload identity federation |
pending |
||||||||||
K8s Control Plane Logging |
K8S20 |
kubernetes, min_req |
P1 |
M5 |
Verify ability to view or export Kubernetes control plane logs (apiserver, kcm) |
pending |
||||||||||
K8s Performance |
K8S37 |
min_req |
P0 |
M5 |
Verify the managed K8s service meets standard Kubernetes performance tests to max cluster size |
pending |
||||||||||
Multi-Cluster Support |
K8S26 |
kubernetes, min_req |
P0 |
M5 |
Support multiple clusters in the same tenancy and in the same VPC |
pending |
||||||||||
Power Policy Management |
Control plane to manage power allocation (re-allocation) based on need and availability |
Boost and Flex (DPS) |
POWER01 |
P2 |
MiniCloud |
TODO |
approved |
|||||||||
AI Platform & User Access |
Host API Gateway |
Service control plane "front door" that enforces authn/authz via IAM |
AWS/GCP API Gateway |
APIGW01 |
P2 |
TODO |
approved |
|||||||||
Al Platform 1..N Control Planes |
Placeholder for all AI platform CPs you might run. E.g. Run:ai, NVCF, Redhat Inference Server |
AICP01 |
P2 |
TODO |
approved |
|||||||||||
Web Management Dashboard |
WEBUI01 |
P2 |
LimitedEnv |
M5 |
TODO |
approved |
||||||||||
Observability |
Observability Collectors |
OBS01 |
P0 |
LimitedEnv |
M0 |
Validate that an OTel endpoint is provided and some data can be pulled from it |
done |
|||||||||
OBS02 |
P0 |
LimitedEnv |
M0 |
Logging information about the control plane is available |
done |
|||||||||||
OBS03 |
P0 |
LimitedEnv |
M0 |
Logging information from instances |
done |
|||||||||||
OBS04 |
P0 |
LimitedEnv |
M0 |
Alerts can be issued by the system in case of problems |
done |
|||||||||||
Telemetry Delivery |
OBS05 |
min_req |
P1 |
M5 |
Verify telemetry latency is no longer than 120 seconds |
pending |
||||||||||
Network Telemetry |
OBS06 |
min_req |
P1 |
M5 |
Verify telemetry is available for North-South (front-end) network |
pending |
||||||||||
OBS07 |
min_req |
P1 |
M5 |
Verify telemetry is available for East-West (back-end / GPU interconnect) network |
pending |
|||||||||||
OBS08 |
min_req |
P1 |
M5 |
Verify telemetry is available for Management network |
pending |
|||||||||||
OBS09 |
min_req |
P1 |
M5 |
Verify telemetry is available for NVSwitch Fabric (GB200+) |
pending |
|||||||||||
OBS10 |
min_req |
P1 |
M5 |
Verify telemetry is available for Host network (NIC-level) |
pending |
|||||||||||
Required Logs |
OBS11 |
min_req |
P1 |
M5 |
Verify Fabric Manager logs are available (where applicable) |
pending |
||||||||||
OBS12 |
min_req |
P1 |
M5 |
Verify Subnet Manager logs are available (where applicable) |
pending |
|||||||||||
OBS13 |
min_req |
P1 |
M5 |
Verify UFM Event logs are available |
pending |
|||||||||||
OBS14 |
min_req |
P1 |
M5 |
Verify general switch logs are available |
pending |
|||||||||||
OBS15 |
min_req |
P1 |
M5 |
Verify switch syslogs are available |
pending |
|||||||||||
OBS16 |
min_req |
P1 |
M5 |
Verify switch kernel logs are available |
pending |
|||||||||||
OBS17 |
min_req, observability, security |
P1 |
M5 |
Verify BMC SEL logs are available |
pending |
|||||||||||
OBS18 |
bare_metal, min_req, observability |
P1 |
M5 |
Verify host syslogs are available |
pending |
|||||||||||
OBS19 |
min_req, network, observability |
P1 |
M5 |
Verify VPC Flow logs (all ingress/egress traffic) are available |
pending |
|||||||||||
Telemetry / Data Lakes |
TELEM01 |
P1 |
LimitedEnv |
TODO |
approved |
|||||||||||
TELEM02 |
min_req |
MinAPI |
P0 |
MiniCloud |
M3 |
Read only access to a BMaaS system’s serial console |
approved |
|||||||||
TELEM03 |
min_req |
MinAPI |
P0 |
LimitedEnv |
M3 |
Read only access to a VM serial console |
approved |
|||||||||
BMC/Redfish Telemetry |
TELEM04 |
gpu, min_req, observability, security |
P1 |
M5 |
Verify BMC/Redfish telemetry is accessible via API for GPU metrics not available from the host OS |
pending |
||||||||||
Storage Telemetry |
TELEM05 |
min_req |
P1 |
M5 |
Verify storage resource capacity metrics are available (used/free/total) |
pending |
||||||||||
TELEM06 |
min_req |
P1 |
M5 |
Verify storage performance metrics are available (bandwidth, IOPS, latency) |
pending |
|||||||||||
NVLink Switch Telemetry |
TELEM07 |
min_req |
P1 |
M5 |
Verify NVLink metrics are available from the GPU perspective (per-link counters, bandwidth utilization) |
pending |
||||||||||
TELEM08 |
min_req |
P1 |
M5 |
Verify NVLink metrics are available from the switch perspective (port-level counters, error rates) |
pending |
|||||||||||
Security & Identity |
Network Security |
BMC Security |
SEC12 |
min_req, network, security |
P0 |
M5 |
Verify BMC management is on a dedicated, restricted network (physically separate or VLAN/VRF-isolated) |
pending |
||||||||
SEC12 |
min_req, network, security |
P0 |
M5 |
Verify BMC interfaces are not reachable from tenant networks |
pending |
|||||||||||
CNP10 |
min_req, network, security |
P0 |
M5 |
Verify IPMI is disabled; Redfish over TLS is used with AAA |
pending |
|||||||||||
SEC12 |
min_req, network, security |
P0 |
M5 |
Verify BMC is only accessible via a hardened bastion (jumphost) server; direct public/corporate network access is blocked |
pending |
|||||||||||
Network Traffic Encryption |
SEC13 |
min_req |
P1 |
M5 |
Verify mTLS (or equivalent) for all east-west and north-south traffic |
pending |
||||||||||
Edge Security |
SEC14 |
min_req, network, security |
P0 |
M5 |
Verify no public internet access to API endpoints by default |
pending |
||||||||||
SEC13 |
min_req, network, security |
P1 |
M5 |
Verify insecure protocols (HTTP, SSLv3, TLSv1) are disabled |
pending |
|||||||||||
InfiniBand Security |
SDN04 |
bare_metal, infiniband, min_req, network, security |
P0 |
M5 |
Verify IB tenant isolation — compute dedicated to NVIDIA is isolated from other customers |
pending |
||||||||||
SDN04 |
bare_metal, infiniband, min_req, network, security |
P0 |
M5 |
Verify IB keys are configured: P_Key, Management Key, Aggregation Management Key, VendorSpecific Key, CongestionControl Key, Node2Node Key, Manager2Node Key |
pending |
|||||||||||
Tenancy Isolation |
SEC11 |
iam, min_req, network, security |
P0 |
M5 |
Verify hard physical or logical isolation between tenants for network, data, compute, and storage resources |
pending |
||||||||||
Authentication |
User Authentication |
SEC01 |
iam, min_req, security |
P0 |
M5 |
Verify user authentication via OIDC for platform services |
pending |
|||||||||
In-Cluster Authentication |
SEC06 |
iam, min_req, security |
P0 |
M5 |
Verify workloads and nodes receive short-lived credentials/tokens |
pending |
||||||||||
External Service Authentication |
SEC03 |
iam, min_req, security |
P0 |
M5 |
Verify out-of-cluster service accounts can authenticate with long-lived credentials |
pending |
||||||||||
Admin Interface Security |
SEC07 |
iam, min_req, security |
INFO |
P0 |
M5 |
Verify all administrative interfaces (UI, CLI, API) are protected by Multi-Factor Authentication |
pending |
|||||||||
Authorization |
RBAC |
SEC04 |
iam, min_req, security |
P1 |
M5 |
Assign a minimal role to a user; verify they cannot perform actions outside that role on Compute, Storage, and Network APIs |
pending |
|||||||||
Audit & Logging |
Audit & Logging |
SEC08 |
min_req, security |
P1 |
M5 |
Perform a management API call; verify an audit log entry is generated with correct metadata |
pending |
|||||||||
SEC08 |
min_req, security |
P1 |
M5 |
Verify audit logs are retained for at least 30 days |
pending |
|||||||||||
Hardware Security & Compliance |
Data Sanitization |
SEC21 |
bare_metal, min_req, sanitization, security |
P0 |
M5 |
Verify memory sanitization between tenants |
pending |
|||||||||
SEC21 |
bare_metal, gpu, min_req, sanitization, security |
P0 |
M5 |
Verify SRAM/GPU memory is sanitized between tenants |
pending |
|||||||||||
SEC21, SEC22 |
bare_metal, firmware, min_req, sanitization, security |
P0 |
M5 |
Verify TPM and BIOS are reset during tenant transitions or hardware replacement |
pending |
|||||||||||
At-Rest Encryption |
SEC20 |
min_req |
INFO |
M5 |
Verify all data at rest is encrypted via SED on local NVMe/SSD and network-attached storage |
pending |
||||||||||
Key Management |
SEC09, SEC10 |
min_req, security |
P1 |
M5 |
Verify a centralized KMS is used for all encryption keys and secrets |
pending |
||||||||||
SEC09 |
min_req, security, slow, workload |
P0 |
M5 |
Verify support for Customer Managed Keys (BYOK) |
pending |
|||||||||||
Network Transport |
Backend Switch Fabric API |
Backend Switch Fabric API |
NET01 |
min_req, network |
P0 |
M5 |
Query the API for a compute node and verify it returns backend switch IDs (leaf, spine, core) |
pending |
||||||||
NVLink Domain API |
NVLink Domain API |
NET02 |
min_req, network |
P0 |
M5 |
Query the NVLink domain ID for a compute node supporting NVLink |
pending |
|||||||||
Transport Connectivity |
CorpIT Interconnect |
NET04 |
min_req |
P1 |
M5 |
Verify Private Cloud Interconnect to NVIDIA CorpIT is operational |
pending |
|||||||||
DGXC Storage Interconnect |
NET05 |
min_req |
INFO |
M5 |
Verify high-bandwidth Private Cloud Interconnect to DGXC on-prem object storage |
pending |
||||||||||
NET05 |
min_req |
INFO |
M5 |
Verify end-to-end MACsec encryption (fail-closed) |
pending |
|||||||||||
Cluster Internet Access |
NET06 |
min_req |
INFO |
M5 |
Verify egress NAT IPs are a static pool dedicated to NVIDIA |
pending |
||||||||||
NET06 |
min_req |
INFO |
M5 |
Verify redundant upstream paths for internet connectivity |
pending |
|||||||||||
Capacity & Fleet Management |
Governance Metrics |
Governance Metrics |
CAP01 |
bare_metal, governance, min_req |
P1 |
M5 |
Query and verify governance API returns Delivered, Healthy, Reserved, and Active metrics for nodes/GPUs |
pending |
||||||||
Capacity Reservations |
Capacity Reservations |
CAP04 |
bare_metal, capacity, min_req, security |
P1 |
M5 |
Verify a set of resources can be logically grouped and pinned to an account |
pending |
|||||||||
CAP04 |
bare_metal, capacity, min_req, security |
P1 |
M5 |
Verify atomic allocation of a topology block as a single unit |
pending |
|||||||||||
Unified Health APIs |
Unified Health APIs |
CAP05 |
bare_metal, health, min_req |
P1 |
M5 |
Verify per-host health API returns real-time GPU state, thermal status, memory health |
pending |
|||||||||
CAP05 |
bare_metal, health, min_req |
P1 |
M5 |
Verify primitive-level health aggregation (cluster, nodegroup, or reservation level) |
pending |
|||||||||||
Benchmarking |
Exemplar |
Run a mini-cloud configuration of Exemplar benchmarks |
BM01 |
min_req |
Simulate Exemplar across 512 GPU cluster using latest dgxc-benchmarking release |
pending |
||||||||||
BENCH01 |
REVIEW |
NVMesh checks? |
pending |
|||||||||||||
BENCH02 |
REVIEW |
IPv4 and IPv6 checks? |
pending |
|||||||||||||
BM01 |
min_req |
P1 |
M5 |
Verify performance is within 5% of NVIDIA Reference |
pending |