Version
v0.9.3-0-gd09a7dd35
Describe the bug.
There is a health alert on the host:
ContainerExists: crictl JSON output has empty 'containers' array. Is doca-hbn running?
It happened in this part of the lifecycle (during instance termination):
2026-06-16T07:49:11.260549Z {"state": "assigned", "instance_state": {"state": "ready"}}
2026-06-16T09:29:52.225331Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": false, "power_on_retry_count": 0}}}
2026-06-16T09:30:29.246781Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": true, "power_on_retry_count": 0}}}
2026-06-16T09:32:12.782019Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "powercycle", "power_on": true, "power_on_retry_count": 1}}}
2026-06-16T09:34:33.918079Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "unlockhost", "unlock_host_state": {"state": "disablelockdown"}}}}
2026-06-16T09:34:50.387699Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "checkhostconfig"}}}
2026-06-16T09:38:38.969509Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "configurebios", "retry_count": 0}}}
2026-06-16T09:39:12.886872Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "pollingbiossetup"}}}
2026-06-16T09:39:46.439014Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "setbootorder"}}}}}
2026-06-16T09:39:51.596080Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "waitforsetbootorderjobscheduled"}}}}}
2026-06-16T09:40:21.746038Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "reboothost"}}}}}
2026-06-16T09:40:50.947789Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "waitforsetbootorderjobcompletion"}}}}}
2026-06-16T09:41:22.261845Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "setbootorder", "set_boot_order_info": {"retry_count": 0, "set_boot_order_state": {"state": "checkbootorder"}}}}}
2026-06-16T09:41:56.503796Z {"state": "assigned", "instance_state": {"state": "hostplatformconfiguration", "platform_config_state": {"state": "lockhost"}}}
2026-06-16T09:41:59.452263Z {"state": "assigned", "instance_state": {"state": "waitingfordpustoup"}}
2026-06-16T09:42:51.194150Z {"state": "assigned", "instance_state": {"retry": {"count": 0}, "state": "bootingwithdiscoveryimage"}}
2026-06-16T09:49:35.199971Z {"state": "assigned", "instance_state": {"state": "switchtoadminnetwork"}}
2026-06-16T09:49:37.733800Z {"state": "assigned", "instance_state": {"state": "waitingfornetworkreconfig"}}
This is a DPU issue where the the doca-hbn container comes up but hasn't finished initialising and is still in state init-sfs before starting as doca-hbn:
# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
caee908d9564b 02343f4e83954 8 days ago Running init-sfs 2 4510badb2e617 doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com default
8de4006211a03 fb13922a56430 8 days ago Running doca-telemetry-service 13 404c066d6fbce doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com default
If you look at the processes in the container, we're still running
# crictl exec -it $(crictl ps | awk '$0~"doca-hbn" {print $1}') bash
root@doca-hbn-service-XXX-XXX-XXX-XXX:/tmp# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Jun16 ? 00:00:00 bash -c /bin/bash <<'EOF' /initbootstrap.sh # The below script will chroot to host and move the interfaces to # containerd created namespace for this pod. # First we find the container id and the namespace. POD=$(chroot /host crictl pods --state
root 7 1 0 Jun16 ? 00:09:38 /bin/bash
root 2792433 0 0 12:31 pts/1 00:00:00 bash
root 2792468 7 0 12:31 ? 00:00:00 sleep 1
root 2792469 2792433 0 12:31 pts/1 00:00:00 ps -ef
The initbootstrap.sh script is logging these errors from the container log:
# tail /var/log/pods/default_doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com_0b3d3231db11845c4699ba06ed96d757/init-sfs/2.log
2026-06-24T13:45:31.662914885Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:32.674183037Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:33.685377503Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:34.696444112Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:35.707400947Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:36.718380437Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:37.729221332Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:38.739999172Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:39.75099209Z stderr F Device "p0_if" does not exist.
2026-06-24T13:45:40.762251099Z stderr F Device "p0_if" does not exist.
...but from within the container that interface exists:
# crictl exec -it $(crictl ps | awk '$0~"doca-hbn" {print $1}') bash
# ifconfig -a | grep -i flags
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
p0_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
p1_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0dpu1_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0dpu3_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0hpf_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf0_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf10_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf11_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf12_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf13_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf1_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf2_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf3_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf4_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf5_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf6_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf7_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf8_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf0vf9_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
pf1hpf_if: flags=4098<BROADCAST,MULTICAST> mtu 9216
So my guess would be that at the time the script ran, that interface wasn't there, it was still in the host ArmOS and hadn't yet made its' way into the container. But it is there now - so maybe a timing issue.
In fact, this looks exactly the same as NVBug 5824879 I opened back in January but I think that was closed by Mellanox as I attached a sosreport that had a dev misconfiguration - so they must have presumed this issue was always due to misconfiguration.
It isn't, it happens for real and to me it looks like a timing issue as the doca-hbn container starts. I have collected a sosreport from this DPU in case it helps, but I haven't attached it to this bug report. Let me know where to send it.
BTW a restart of the container seems to fix the problem, which would also make me think this is a timing issue:
# service kubelet@mgmt restart
# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
fa82230c90041 02343f4e83954 14 seconds ago Running init-sfs 0 f07ab14b72aa0 doca-hbn-service-XXX-XXX-XXX-XXX.SITEfrg.nvidia.com default
8de4006211a03 fb13922a56430 8 days ago Running doca-telemetry-service 13 404c066d6fbce doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com default
# crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD NAMESPACE
0ef1b7a68432a 02343f4e83954 3 seconds ago Running doca-hbn 0 f07ab14b72aa0 doca-hbn-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com default
8de4006211a03 fb13922a56430 8 days ago Running doca-telemetry-service 13 404c066d6fbce doca-telemetry-service-XXX-XXX-XXX-XXX.SITE.frg.nvidia.com default
You can see now the container moves to odca-hbn after init-sfs completes. Let me know if you have
Minimum reproducible example
Relevant log output
Other/Misc.
No response
Code of Conduct
Version
v0.9.3-0-gd09a7dd35
Describe the bug.
There is a health alert on the host:
It happened in this part of the lifecycle (during instance termination):
This is a DPU issue where the the doca-hbn container comes up but hasn't finished initialising and is still in state init-sfs before starting as doca-hbn:
If you look at the processes in the container, we're still running
The initbootstrap.sh script is logging these errors from the container log:
...but from within the container that interface exists:
So my guess would be that at the time the script ran, that interface wasn't there, it was still in the host ArmOS and hadn't yet made its' way into the container. But it is there now - so maybe a timing issue.
In fact, this looks exactly the same as NVBug 5824879 I opened back in January but I think that was closed by Mellanox as I attached a sosreport that had a dev misconfiguration - so they must have presumed this issue was always due to misconfiguration.
It isn't, it happens for real and to me it looks like a timing issue as the doca-hbn container starts. I have collected a sosreport from this DPU in case it helps, but I haven't attached it to this bug report. Let me know where to send it.
BTW a restart of the container seems to fix the problem, which would also make me think this is a timing issue:
You can see now the container moves to odca-hbn after init-sfs completes. Let me know if you have
Minimum reproducible example
Relevant log output
Other/Misc.
No response
Code of Conduct