-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
problem
Unable to create VMs getting 504 error
The server is recognized a GPU enabled.
Using a regular template with UBUNTU 22.04
Using a Service created, the service is as all our server but has a GPU. HA, and i checked the Video just to test.
on the agent logs shows:
2025-12-09 20:33:14,351 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-5:[]) (logid:3e3d3f80) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
2025-12-09 20:34:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:35:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:36:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:37:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:38:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:39:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:40:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:41:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:42:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:05,155 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-3:[]) (logid:7f598e7f) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
On the Server, there is no errors or disconnections.
the storage ID for this VM shows
ID
30aea531-8b82-478f-85db-e9991bf193f5
I am able to reach the primary storage from the GPU Host.
Except for this error and after 45 minutes the System keeps spinning on creating the VM "Launch Instance in progress"
Logs from the management server:
2025-12-09 20:46:49,858 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) No inactive management server node found
2025-12-09 20:46:49,858 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms
2025-12-09 20:46:51,322 DEBUG [o.a.c.h.H.HAManagerBgPollTask] (BackgroundTaskPollManager-4:[ctx-1aad6e7c]) (logid:829826e7) HA health check task is running...
2025-12-09 20:46:51,358 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) No inactive management server node found
2025-12-09 20:46:51,358 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms
2025-12-09 20:46:51,678 INFO [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Found the following agents behind on ping: [75]
2025-12-09 20:46:51,683 DEBUG [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Ping timeout for agent Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}, do investigation
2025-12-09 20:46:51,685 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Investigating why host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} has disconnected with event
2025-12-09 20:46:51,687 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Checking if agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) is alive
2025-12-09 20:46:51,689 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Wait time setting on com.cloud.agent.api.CheckHealthCommand is 50 seconds
2025-12-09 20:46:51,690 DEBUG [c.c.a.m.ClusteredAgentAttache] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Routed from 250977680725600
2025-12-09 20:46:51,690 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Sending { Cmd , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 100011,
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-12-09 20:46:51,733 DEBUG [c.c.a.t.Request] (AgentManager-Handler-11:[]) (logid:) Seq 75-1207246175112003675: Processing: { Ans: , MgmtId: 250977680725600, via: 75, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckHealthAnswer":{"result":"true","details":"resource is alive","wait":"0","bypassHostMaintenance":"false"}}] }
2025-12-09 20:46:51,734 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Received: { Ans: , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 10, { CheckHealthAnswer } }
2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Details from executing class com.cloud.agent.api.CheckHealthCommand: resource is alive
2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) responded to checkHealthCommand, reporting that agent is Up
2025-12-09 20:46:51,734 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) The agent from host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} state determined is Up
2025-12-09 20:46:51,734 INFO [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent is determined to be up and running
2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) [Resource state = Enabled, Agent event = , Host = Ping]
2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) ===START=== SOMEIPADDRESS -- GET jobId=22d89170-20e6-4151-a809-552938d734e9&command=queryAsyncJobResult&response=json&
2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) Two factor authentication is already verified for the user 2, so skipping
2025-12-09 20:46:52,134 DEBUG [c.c.a.ApiServer] (qtp1438988851-251223:[ctx-4df48857, ctx-caa819c6]) (logid:9ab16f87) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"45a1be9e-2c67-11f0-a2e6-9ee6a2dce283"}]' is allowed to perform API calls:
I noticed the Isolated network the Virtual router is on another server, I do not have any server tags at the moment.
final error:
Error
Unable to orchestrate the start of VM instance {"instanceName":"i-2-223-VM","uuid":"a12748a3-7519-4732-8445-05dfa96046b7"}.
versions
The versions of ACS, hypervisors, storage, network etc..
ACS 4.22.0
KVM for the GPU and other hosts
CEPH RDB primary storage
NFS secondary storage
VXLAN running same as the other servers.
Ubuntu 22.04 as a Hypervisor
Ubuntu 22.04 as template - Same template used for other VMs.
The GPU is recognized by the system with no problems.
The steps to reproduce the bug
- Using a GPU service offering with HA and GPU Display true - we do have disabled the OOB management.
- Add a simple VM isolated network, using a GPU Offering 1GPU.
- Everything starts ok, VR is created on a regular CPU server -automatically, Storage is created, Ip addresses allocated
- Instance creation fails after 35+ minutes.
one more screen:
Please Guide us on the proper setting.
Thank you
What to do about it?
No response