NCX Infra Controller (NICo) -- Kubernetes Deployment
NICo (also known as NCX Infra Controller) is a platform for provisioning, managing, and monitoring bare metal GPU servers, including DGX and HGX systems. This Helm chart deploys NICo services into a Kubernetes cluster as a single umbrella chart with independently toggleable subcharts.
The chart is designed for production environments where NICo manages the full lifecycle of bare metal infrastructure: DHCP/PXE-based OS provisioning, DNS resolution, hardware health monitoring, SSH console access, and a unified REST/gRPC API.
| # | Subchart | Description |
|---|---|---|
| 1 | nico-api | Core API server (gRPC + REST). Manages machines, provisioning, networking, and firmware. Requires PostgreSQL and Vault. |
| 2 | nico-bmc-proxy | Authenticating proxy for connecting to BMCs over HTTPS (Redfish). |
| 3 | nico-dhcp | Kea DHCP server for bare-metal PXE boot and IP assignment. |
| 4 | nico-dns | Authoritative DNS server (StatefulSet) for managed machines and VPCs. |
| 5 | nico-dsx-exchange-consumer | Consumes DSX exchange messages for machine telemetry and state updates. Disabled by default. |
| 6 | nico-flow | Workflow / Temporal-backed orchestration component. Disabled by default. |
| 7 | nico-hardware-health | Collects and reports hardware health metrics from managed machines. |
| 8 | nico-ntp | chrony NTP servers (3-replica StatefulSet, per-pod LoadBalancer VIPs). DPUs and bare-metal hosts sync against these per the kea DHCP ntpServer advertisement. |
| 9 | nico-pxe | PXE boot server (HTTP-based) for OS provisioning workflows. |
| 10 | nico-ssh-console-rs | SSH console proxy for remote access to managed machine BMCs and consoles. |
| 11 | unbound | Recursive DNS resolver. Optional — used to serve the DPU compatibility .forge zone when no external DNS does. Disabled by default. |
- Kubernetes 1.27+
- Helm 3.12+
- cert-manager with a
ClusterIssuerconfigured (default issuer name:vault-nico-issuer) - HashiCorp Vault for PKI certificate issuance and secret storage
- PostgreSQL (SSL-enabled) for the
nico-apidatabase backend - Prometheus Operator CRDs if you enable
ServiceMonitorresources - Required Kubernetes Secrets and ConfigMaps (Vault tokens, database credentials, SSO secrets, etc.)
For the full list of required secrets, ConfigMaps, and infrastructure setup steps, see PREREQUISITES.md.
helm upgrade --install nico ./helm \
--namespace forge-system --create-namespace \
--set global.image.repository=<your-registry>/nico-core \
--set global.image.tag=<version>To verify the deployment:
kubectl get pods -n forge-system
kubectl get svc -n forge-systemTop-level global: values are automatically passed to all subcharts.
| Parameter | Description | Default |
|---|---|---|
global.image.repository |
Container image repository (REQUIRED) | "" |
global.image.tag |
Container image tag (REQUIRED) | "" |
global.image.pullPolicy |
Image pull policy | IfNotPresent |
global.imagePullSecrets |
Image pull secrets | [] |
global.certificate.duration |
Certificate validity period | 720h0m0s |
global.certificate.renewBefore |
Renew certificates before expiry | 360h0m0s |
global.certificate.privateKey.algorithm |
Certificate private key algorithm | ECDSA |
global.certificate.privateKey.size |
Certificate private key size | 384 |
global.certificate.issuerRef.name |
cert-manager ClusterIssuer name | vault-nico-issuer |
global.certificate.issuerRef.kind |
cert-manager issuer kind | ClusterIssuer |
global.certificate.issuerRef.group |
cert-manager issuer API group | cert-manager.io |
global.spiffe.trustDomain |
SPIFFE trust domain for mTLS | nico.local |
global.labels |
Common labels applied to all resources | See values.yaml |
Each subchart can be independently enabled or disabled. All core NICo services are enabled by default. Infrastructure services (unbound) that may already be provided by the environment are disabled by default.
nico-api:
enabled: true # Core API -- usually always enabled
nico-dhcp:
enabled: true # DHCP for PXE boot
nico-dns:
enabled: true # Authoritative DNS
nico-dsx-exchange-consumer:
enabled: false # DSX exchange telemetry consumer (off by default)
nico-flow:
enabled: false # Temporal-backed workflow orchestrator (off by default)
nico-hardware-health:
enabled: true # Hardware health monitoring
nico-ntp:
enabled: true # chrony NTP servers (required for DPU pre-ingestion)
nico-pxe:
enabled: true # PXE boot server
nico-ssh-console-rs:
enabled: true # SSH console proxy
unbound:
enabled: false # Recursive DNS resolver (disabled by default)The global.image.repository and global.image.tag values must be set -- they default to empty strings. Most subcharts use the global image reference. The following subcharts use their own separate image references and do not inherit global.image:
| Subchart | Image Parameter | Default |
|---|---|---|
nico-ssh-console-rs (log collector) |
nico-ssh-console-rs.lokiLogCollector.image.repository / .tag |
"" — sidecar disabled by default (lokiLogCollector.enabled: false); reference image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.81.0 |
unbound |
unbound.image.repository / .tag |
"" (must be set) |
unbound (exporter) |
unbound.exporterImage.repository / .tag |
"" (must be set) |
To enable OAuth2 authentication (for example, Azure AD or Okta), configure the nico-api.extraEnv values:
nico-api:
extraEnv:
- name: CARBIDE_WEB_AUTH_TYPE
value: "oauth2"
- name: CARBIDE_WEB_OAUTH2_AUTH_ENDPOINT
value: "https://your-idp/authorize"
- name: CARBIDE_WEB_OAUTH2_TOKEN_ENDPOINT
value: "https://your-idp/token"
- name: CARBIDE_WEB_OAUTH2_CLIENT_ID
value: "your-client-id"
- name: CARBIDE_WEB_ALLOWED_ACCESS_GROUPS
value: "group1,group2"
- name: CARBIDE_WEB_OAUTH2_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: your-sso-secret
key: client_secretThe extraEnv array supports any Kubernetes env spec, including valueFrom references to Secrets and ConfigMaps.
Several services support optional external LoadBalancer exposure, typically used with MetalLB on bare metal clusters. Enable and configure them per subchart:
nico-api:
externalService:
enabled: true
type: LoadBalancer
externalTrafficPolicy: Local
annotations:
metallb.universe.tf/loadBalancerIPs: "10.x.x.x"Services with external LoadBalancer support: nico-api, nico-dhcp, nico-dns, nico-ntp, nico-pxe, and nico-ssh-console-rs.
For StatefulSet-based services (nico-dns, nico-ntp), per-pod LoadBalancer IPs can be assigned:
nico-dns:
externalService:
enabled: true
perPodAnnotations:
- metallb.universe.tf/loadBalancerIPs: "10.x.x.1" # pod-0
- metallb.universe.tf/loadBalancerIPs: "10.x.x.2" # pod-1| Subchart | Workload Type | Primary Port(s) | TLS Certificate | Metrics |
|---|---|---|---|---|
| nico-api | Deployment | 1079 (gRPC), 1080 (metrics), 1081 (profiler) | Yes | ServiceMonitor |
| nico-bmc-proxy | Deployment | 1079 (gRPC), 1080 (metrics) | Yes | ServiceMonitor |
| nico-dhcp | Deployment | 67/UDP, 1089 (metrics) | Yes | ServiceMonitor |
| nico-dns | StatefulSet | 53/TCP, 53/UDP | Yes | -- |
| nico-dsx-exchange-consumer | Deployment | 9009 | Yes | ServiceMonitor |
| nico-hardware-health | Deployment | 9009 (/metrics, /telemetry) |
Yes | ServiceMonitor; optional telemetry ServiceMonitor (sensor data, off by default) |
| nico-ntp | StatefulSet | 123/UDP | No | -- |
| nico-pxe | Deployment | 8080 | Yes | ServiceMonitor |
| nico-ssh-console-rs | Deployment | 22, 9009 (metrics) | Yes | ServiceMonitor |
| unbound | Deployment | 53 | No | ServiceMonitor |
+------------------+
| nico-api | <-- PostgreSQL, Vault
+--------+---------+
|
+-----------+-----------+-----------+-----------+
| | | | |
nico-dhcp nico-dns nico-pxe nico-ssh-console-rs unbound (optional)
| | |
v v v
Bare Metal Bare Metal Upstream DNS
(PXE boot) (OS install)
All services that communicate with nico-api use mTLS via SPIFFE-based certificates issued by cert-manager and backed by Vault PKI.
For reference configurations, see:
examples/values-minimal.yaml-- Minimal deployment with only the core servicesexamples/values-full.yaml-- Full deployment with all services and production settings
This Helm chart supersedes the Kustomize-based deployment previously located in deploy/. The mapping is straightforward:
- Each Kustomize component maps to a subchart with the same name.
- Base resources (Deployments, Services, ConfigMaps) are now templated within each subchart.
- Environment-specific configuration that was previously managed through Kustomize overlays should be provided via Helm values overrides (
-f values-myenv.yamlor--setflags). - ConfigMap generators in Kustomize are replaced by
config:sections in each subchart's values, with the option to provide external ConfigMaps instead (config.enabled: false).
helm upgrade nico ./helm \
--namespace forge-system \
-f values-production.yamlReview changes before applying:
helm diff upgrade nico ./helm \
--namespace forge-system \
-f values-production.yamlhelm uninstall nico --namespace forge-systemNote that PersistentVolumeClaims, Secrets, and ConfigMaps created outside of Helm (by operators, Vault, or database controllers) are not removed by helm uninstall.
Apache-2.0