Skip to content

CS4.22 GPU Server Unable to create a VM #12216

@tatay188

Description

@tatay188

problem

Unable to create VMs getting 504 error

The server is recognized a GPU enabled.
Using a regular template with UBUNTU 22.04

Using a Service created, the service is as all our server but has a GPU. HA, and i checked the Video just to test.
on the agent logs shows:

2025-12-09 20:33:14,351 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-5:[]) (logid:3e3d3f80) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt
2025-12-09 20:34:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:35:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:36:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:37:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:38:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:39:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:40:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:41:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:42:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:01,633 ERROR [cloud.agent.Agent] (AgentOutRequest-Handler-6:[]) (logid:) Ping Interval has gone past 300000. Won't reconnect to mgt server, as connection is still alive
2025-12-09 20:43:05,155 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-3:[]) (logid:7f598e7f) Trying to fetch storage pool e76f8956-1a81-3e97-aff6-8dc3f199a48a from libvirt

On the Server, there is no errors or disconnections.
the storage ID for this VM shows

ID
30aea531-8b82-478f-85db-e9991bf193f5

I am able to reach the primary storage from the GPU Host.

Except for this error and after 45 minutes the System keeps spinning on creating the VM "Launch Instance in progress"

Image

Logs from the management server:

2025-12-09 20:46:49,858 INFO  [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) No inactive management server node found

2025-12-09 20:46:49,858 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cc70e228]) (logid:c4510d7b) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms

2025-12-09 20:46:51,322 DEBUG [o.a.c.h.H.HAManagerBgPollTask] (BackgroundTaskPollManager-4:[ctx-1aad6e7c]) (logid:829826e7) HA health check task is running...

2025-12-09 20:46:51,358 INFO  [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) No inactive management server node found

2025-12-09 20:46:51,358 DEBUG [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-c1ff2c4b]) (logid:07a5943d) Peer scan is finished. profiler: Done. Duration: 4ms , profilerQueryActiveList: Done. Duration: 1ms, , profilerSyncClusterInfo: Done. Duration: 1ms, profilerInvalidatedNodeList: Done. Duration: 0ms, profilerRemovedList: Done. Duration: 0ms,, profilerNewList: Done. Duration: 0ms, profilerInactiveList: Done. Duration: 1ms

2025-12-09 20:46:51,678 INFO  [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Found the following agents behind on ping: [75]

2025-12-09 20:46:51,683 DEBUG [c.c.a.m.A.MonitorTask] (AgentMonitor-1:[ctx-fe31f2a4]) (logid:825f839a) Ping timeout for agent Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}, do investigation

2025-12-09 20:46:51,685 INFO  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Investigating why host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} has disconnected with event

2025-12-09 20:46:51,687 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Checking if agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) is alive

2025-12-09 20:46:51,689 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Wait time setting on com.cloud.agent.api.CheckHealthCommand is 50 seconds

2025-12-09 20:46:51,690 DEBUG [c.c.a.m.ClusteredAgentAttache] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Routed from 250977680725600

2025-12-09 20:46:51,690 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Sending  { Cmd , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }

2025-12-09 20:46:51,733 DEBUG [c.c.a.t.Request] (AgentManager-Handler-11:[]) (logid:) Seq 75-1207246175112003675: Processing:  { Ans: , MgmtId: 250977680725600, via: 75, Ver: v1, Flags: 10, [{"com.cloud.agent.api.CheckHealthAnswer":{"result":"true","details":"resource is alive","wait":"0","bypassHostMaintenance":"false"}}] }

2025-12-09 20:46:51,734 DEBUG [c.c.a.t.Request] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Seq 75-1207246175112003675: Received:  { Ans: , MgmtId: 250977680725600, via: 75(ggpu), Ver: v1, Flags: 10, { CheckHealthAnswer } }

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Details from executing class com.cloud.agent.api.CheckHealthCommand: resource is alive

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent (Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"}) responded to checkHealthCommand, reporting that agent is Up

2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) The agent from host Host {"id":75,"name":"ggpu","type":"Routing","uuid":"98715dad-759d-44c7-a633-524e9aa67431"} state determined is Up

2025-12-09 20:46:51,734 INFO  [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) Agent is determined to be up and running

2025-12-09 20:46:51,734 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentTaskPool-6:[ctx-f54bab9b]) (logid:58c1e81b) [Resource state = Enabled, Agent event = , Host = Ping]

2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) ===START===  SOMEIPADDRESS  -- GET  jobId=22d89170-20e6-4151-a809-552938d734e9&command=queryAsyncJobResult&response=json&

2025-12-09 20:46:52,121 DEBUG [c.c.a.ApiServlet] (qtp1438988851-251223:[ctx-4df48857]) (logid:9ab16f87) Two factor authentication is already verified for the user 2, so skipping

2025-12-09 20:46:52,134 DEBUG [c.c.a.ApiServer] (qtp1438988851-251223:[ctx-4df48857, ctx-caa819c6]) (logid:9ab16f87) CIDRs from which account 'Account [{"accountName":"admin","id":2,"uuid":"45a1be9e-2c67-11f0-a2e6-9ee6a2dce283"}]' is allowed to perform API calls:

I noticed the Isolated network the Virtual router is on another server, I do not have any server tags at the moment.

final error:

Error
Unable to orchestrate the start of VM instance {"instanceName":"i-2-223-VM","uuid":"a12748a3-7519-4732-8445-05dfa96046b7"}.

versions

The versions of ACS, hypervisors, storage, network etc..
ACS 4.22.0
KVM for the GPU and other hosts
CEPH RDB primary storage
NFS secondary storage
VXLAN running same as the other servers.
Ubuntu 22.04 as a Hypervisor
Ubuntu 22.04 as template - Same template used for other VMs.

The GPU is recognized by the system with no problems.

The steps to reproduce the bug

  1. Using a GPU service offering with HA and GPU Display true - we do have disabled the OOB management.
  2. Add a simple VM isolated network, using a GPU Offering 1GPU.
  3. Everything starts ok, VR is created on a regular CPU server -automatically, Storage is created, Ip addresses allocated
  4. Instance creation fails after 35+ minutes.

one more screen:

Image

Please Guide us on the proper setting.

Thank you

What to do about it?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions