Skip to content

Machine snapshot serves a stale BMC IP after the address changes #2918

Description

@chet

A machine's BMC IP should always be whatever machine_interfaces currently says it is. Right now the machine snapshot caches a copy of it in machine_topologies and silently falls back to that copy when the live address is gone -- so when a BMC's dynamic lease is released or its IP changes, the snapshot keeps handing out the old IP, and consumers act on an address that no longer exists.

The mismatch

  • machine_topologies.topology.bmc_info records the BMC ip/mac at discovery time, and nothing re-syncs it when machine_interfaces / machine_interface_addresses change.
  • The machine snapshot reads the live address first but falls back to that stale copy: 'ip', COALESCE(host(bmc_addr.address), t.bmc_info->>'ip') (same for mac) -- in both crates/api-db/src/sql/machine_snapshots.sql.template and crates/api-db/src/sql/managed_hosts.sql.template.
  • After a BMC's DHCP lease expires the address row is deleted but the interface survives, so bmc_addr.address is NULL and the snapshot resolves bmc_info.ip to the deleted IP. DPF then registers a DpuDevice and a Kubernetes host-bmc-ip label with a dead IP (machine-controller/src/handler/dpf.rs, dpf.rs), and a host whose BMC interface has no address can't be processed by the state machine.

What this involves

  • Make machine_interfaces the single source of truth for the BMC ip/mac in the snapshot: drop the t.bmc_info->>'ip' / ->>'mac' fallback in both machine_snapshots.sql.template and managed_hosts.sql.template. bmc_info.ip becomes the live address, or None when there is no current address.
  • Repoint the two readers that pull the BMC IP straight from the topology copy at the live source (the existing find_machine_bmc_pairs_by_machine_id / find_machine_id_by_bmc_ip already read machine_interfaces correctly):
    • crates/rack/src/firmware_update.rs:87
    • crates/api-core/src/handlers/machine_interface.rs:202 (the LookupBy::Serial branch; the Ip branch already reads live)
  • Leave machine_topologies as-is -- it stays the hardware-inventory snapshot. We are deliberately not adding a sync path (that would mean mirroring live network state into a discovery snapshot forever); we just stop reading the IP from it.
  • Heads-up: bmc_info.ip will now be honestly None after a release, so callers that error on a missing BMC IP fail loudly instead of acting on a phantom -- the intended, safer behavior. Making the BMC IP never vanish is separate work (retain_ip), as is DPF tolerating an IP change (feat: keep DPU BMC ip in sync for DPF (DPF Integration) #1629).

Notes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions