You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A machine's BMC IP should always be whatever machine_interfaces currently says it is. Right now the machine snapshot caches a copy of it in machine_topologies and silently falls back to that copy when the live address is gone -- so when a BMC's dynamic lease is released or its IP changes, the snapshot keeps handing out the old IP, and consumers act on an address that no longer exists.
The mismatch
machine_topologies.topology.bmc_info records the BMC ip/mac at discovery time, and nothing re-syncs it when machine_interfaces / machine_interface_addresses change.
The machine snapshot reads the live address first but falls back to that stale copy: 'ip', COALESCE(host(bmc_addr.address), t.bmc_info->>'ip') (same for mac) -- in both crates/api-db/src/sql/machine_snapshots.sql.template and crates/api-db/src/sql/managed_hosts.sql.template.
After a BMC's DHCP lease expires the address row is deleted but the interface survives, so bmc_addr.address is NULL and the snapshot resolves bmc_info.ip to the deleted IP. DPF then registers a DpuDevice and a Kubernetes host-bmc-ip label with a dead IP (machine-controller/src/handler/dpf.rs, dpf.rs), and a host whose BMC interface has no address can't be processed by the state machine.
What this involves
Make machine_interfaces the single source of truth for the BMC ip/mac in the snapshot: drop the t.bmc_info->>'ip' / ->>'mac' fallback in both machine_snapshots.sql.template and managed_hosts.sql.template. bmc_info.ip becomes the live address, or None when there is no current address.
Repoint the two readers that pull the BMC IP straight from the topology copy at the live source (the existing find_machine_bmc_pairs_by_machine_id / find_machine_id_by_bmc_ip already read machine_interfaces correctly):
crates/rack/src/firmware_update.rs:87
crates/api-core/src/handlers/machine_interface.rs:202 (the LookupBy::Serial branch; the Ip branch already reads live)
Leave machine_topologies as-is -- it stays the hardware-inventory snapshot. We are deliberately not adding a sync path (that would mean mirroring live network state into a discovery snapshot forever); we just stop reading the IP from it.
Heads-up: bmc_info.ip will now be honestly None after a release, so callers that error on a missing BMC IP fail loudly instead of acting on a phantom -- the intended, safer behavior. Making the BMC IP never vanish is separate work (retain_ip), as is DPF tolerating an IP change (feat: keep DPU BMC ip in sync for DPF (DPF Integration) #1629).
A machine's BMC IP should always be whatever
machine_interfacescurrently says it is. Right now the machine snapshot caches a copy of it inmachine_topologiesand silently falls back to that copy when the live address is gone -- so when a BMC's dynamic lease is released or its IP changes, the snapshot keeps handing out the old IP, and consumers act on an address that no longer exists.The mismatch
machine_topologies.topology.bmc_inforecords the BMCip/macat discovery time, and nothing re-syncs it whenmachine_interfaces/machine_interface_addresseschange.'ip', COALESCE(host(bmc_addr.address), t.bmc_info->>'ip')(same formac) -- in bothcrates/api-db/src/sql/machine_snapshots.sql.templateandcrates/api-db/src/sql/managed_hosts.sql.template.bmc_addr.addressis NULL and the snapshot resolvesbmc_info.ipto the deleted IP. DPF then registers aDpuDeviceand a Kuberneteshost-bmc-iplabel with a dead IP (machine-controller/src/handler/dpf.rs,dpf.rs), and a host whose BMC interface has no address can't be processed by the state machine.What this involves
machine_interfacesthe single source of truth for the BMCip/macin the snapshot: drop thet.bmc_info->>'ip'/->>'mac'fallback in bothmachine_snapshots.sql.templateandmanaged_hosts.sql.template.bmc_info.ipbecomes the live address, orNonewhen there is no current address.find_machine_bmc_pairs_by_machine_id/find_machine_id_by_bmc_ipalready readmachine_interfacescorrectly):crates/rack/src/firmware_update.rs:87crates/api-core/src/handlers/machine_interface.rs:202(theLookupBy::Serialbranch; theIpbranch already reads live)machine_topologiesas-is -- it stays the hardware-inventory snapshot. We are deliberately not adding a sync path (that would mean mirroring live network state into a discovery snapshot forever); we just stop reading the IP from it.bmc_info.ipwill now be honestlyNoneafter a release, so callers that error on a missing BMC IP fail loudly instead of acting on a phantom -- the intended, safer behavior. Making the BMC IP never vanish is separate work (retain_ip), as is DPF tolerating an IP change (feat: keep DPU BMC ip in sync for DPF (DPF Integration) #1629).Notes
DPUDevice.bmcIpsync, blocked on DPF). This is the Carbide-side root-cause fix for themachine_interfaces<->machine_topologiesmismatch.None, not the stale topology IP.