Skip to content

bug: host with health alert 'ContainerExists: crictl JSON output has empty 'containers' array. Is doca-hbn running?' #2843

Description

@david-mateer

Version

v0.9.3-0-gd09a7dd35

Describe the bug.

There is a health alert on the host:

ContainerExists: crictl JSON output has empty 'containers' array. Is doca-hbn running?

It happened in this part of the lifecycle (during instance termination):

2026-06-16T07:49:11.260549Z {"state": "assigned", "instance_state": {"state": "ready"}}
2026-06-16T09:29:52.225331Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": false, "power_on_retry_count": 0}}}
2026-06-16T09:30:29.246781Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": true, "power_on_retry_count": 0}}}
2026-06-16T09:32:12.782019Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": true, "power_on_retry_count": 1}}}
2026-06-16T09:34:33.918079Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "unlockhost", "unlock_host_state": {"state": "disablelockdown"}}}}
2026-06-16T09:34:50.387699Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "checkhostconfig"}}}
2026-06-16T09:38:38.969509Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "configurebios", "retry_count": 0}}}
2026-06-16T09:39:12.886872Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "pollingbiossetup"}}}
2026-06-16T09:39:46.439014Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "setbootorder"}}}}}
2026-06-16T09:39:51.596080Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "waitforsetbootorderjobscheduled"}}}}}
2026-06-16T09:40:21.746038Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "reboothost"}}}}}
2026-06-16T09:40:50.947789Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "waitforsetbootorderjobcompletion"}}}}}
2026-06-16T09:41:22.261845Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "checkbootorder"}}}}}
2026-06-16T09:41:56.503796Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "lockhost"}}}
2026-06-16T09:41:59.452263Z {"state": "assigned", "instance_state": {"state": "waitingfordpustoup"}}
2026-06-16T09:42:51.194150Z {"state": "assigned", "instance_state": {"retry": {"count": 0}, "state": "bootingwithdiscoveryimage"}}
2026-06-16T09:49:35.199971Z {"state": "assigned", "instance_state": {"state": "switchtoadminnetwork"}}
2026-06-16T09:49:37.733800Z {"state": "assigned", "instance_state": {"state": "waitingfornetworkreconfig"}}

This is a DPU issue where the the doca-hbn container comes up but hasn't finished initialising and is still in state init-sfs before starting as doca-hbn:

# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD                                                      NAMESPACE
caee908d9564b       02343f4e83954       8 days ago          Running             init-sfs                 2                   4510badb2e617       doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com         default
8de4006211a03       fb13922a56430       8 days ago          Running             doca-telemetry-service   13                  404c066d6fbce       doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com   default

If you look at the processes in the container, we're still running

# crictl exec -it $(crictl ps | awk '$0~"doca-hbn" {print $1}') bash
root@doca-hbn-service-XXX-XXX-XXX-XXX:/tmp# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Jun16 ?        00:00:00 bash -c /bin/bash <<'EOF'  /initbootstrap.sh  # The below script will chroot to host and move the interfaces to # containerd created namespace for this pod.  # First we find the container id and the namespace. POD=$(chroot /host crictl pods --state
root           7       1  0 Jun16 ?        00:09:38 /bin/bash
root     2792433       0  0 12:31 pts/1    00:00:00 bash
root     2792468       7  0 12:31 ?        00:00:00 sleep 1
root     2792469 2792433  0 12:31 pts/1    00:00:00 ps -ef

The initbootstrap.sh script is logging these errors from the container log:

# tail /var/log/pods/default_doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com_0b3d3231db11845c4699ba06ed96d757/init-sfs/2.log
2026-06-24T13:45:31.662914885Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:32.674183037Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:33.685377503Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:34.696444112Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:35.707400947Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:36.718380437Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:37.729221332Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:38.739999172Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:39.75099209Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:40.762251099Z stderr F Device "p0_if" does not exist.

...but from within the container that interface exists:

# crictl exec -it $(crictl ps | awk '$0~"doca-hbn" {print $1}') bash
# ifconfig -a | grep -i flags
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
p0_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
p1_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0dpu1_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0dpu3_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0hpf_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf0_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf10_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf11_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf12_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf13_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf1_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf2_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf3_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf4_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf5_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf6_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf7_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf8_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf0vf9_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216
pf1hpf_if: flags=4098<BROADCAST,MULTICAST>  mtu 9216

So my guess would be that at the time the script ran, that interface wasn't there, it was still in the host ArmOS and hadn't yet made its' way into the container. But it is there now - so maybe a timing issue.

In fact, this looks exactly the same as NVBug 5824879 I opened back in January but I think that was closed by Mellanox as I attached a sosreport that had a dev misconfiguration - so they must have presumed this issue was always due to misconfiguration.

It isn't, it happens for real and to me it looks like a timing issue as the doca-hbn container starts. I have collected a sosreport from this DPU in case it helps, but I haven't attached it to this bug report. Let me know where to send it.

BTW a restart of the container seems to fix the problem, which would also make me think this is a timing issue:

# service kubelet@mgmt restart
# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD                                                      NAMESPACE
fa82230c90041       02343f4e83954       14 seconds ago      Running             init-sfs                 0                   f07ab14b72aa0       doca-hbn-service-XXX-XXX-XXX-XXX.SITEfrg.nvidia.com         default
8de4006211a03       fb13922a56430       8 days ago          Running             doca-telemetry-service   13                  404c066d6fbce       doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com   default
# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD                                                      NAMESPACE
0ef1b7a68432a       02343f4e83954       3 seconds ago       Running             doca-hbn                 0                   f07ab14b72aa0       doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com         default
8de4006211a03       fb13922a56430       8 days ago          Running             doca-telemetry-service   13                  404c066d6fbce       doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com   default

You can see now the container moves to odca-hbn after init-sfs completes. Let me know if you have

Minimum reproducible example

Relevant log output

Other/Misc.

No response

Code of Conduct

  • I agree to follow NVIDIA Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)interest/dsx

    Type

    No fields configured for Bug.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions