23/09/27 19:42:59 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.3 KiB, free 47.8 GiB)
23/09/27 19:42:59 INFO TorrentBroadcast: Reading broadcast variable 3 took 13 ms
23/09/27 19:42:59 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 47.8 GiB)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 192, boot = -749, init = 941, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 203, boot = -723, init = 926, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 4.0 in stage 4.0 (TID 389). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 36.0 in stage 4.0 (TID 421). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 440
23/09/27 19:43:00 INFO Executor: Running task 55.0 in stage 4.0 (TID 440)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 220, boot = -692, init = 912, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 443
23/09/27 19:43:00 INFO Executor: Running task 58.0 in stage 4.0 (TID 443)
23/09/27 19:43:00 INFO Executor: Finished task 44.0 in stage 4.0 (TID 429). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 446
23/09/27 19:43:00 INFO Executor: Running task 61.0 in stage 4.0 (TID 446)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 238, boot = -679, init = 917, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = -767, init = 1006, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 12.0 in stage 4.0 (TID 397). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 20.0 in stage 4.0 (TID 405). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 453
23/09/27 19:43:00 INFO Executor: Running task 68.0 in stage 4.0 (TID 453)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 454
23/09/27 19:43:00 INFO Executor: Running task 69.0 in stage 4.0 (TID 454)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 280, boot = -698, init = 978, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 28.0 in stage 4.0 (TID 413). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 466
23/09/27 19:43:00 INFO Executor: Running task 81.0 in stage 4.0 (TID 466)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 159, boot = -7, init = 166, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 164, boot = -14, init = 178, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 55.0 in stage 4.0 (TID 440). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 58.0 in stage 4.0 (TID 443). 2004 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 473
23/09/27 19:43:00 INFO Executor: Running task 88.0 in stage 4.0 (TID 473)
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 474
23/09/27 19:43:00 INFO Executor: Running task 89.0 in stage 4.0 (TID 474)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 173, boot = -3, init = 176, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 68.0 in stage 4.0 (TID 453). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 479
23/09/27 19:43:00 INFO Executor: Running task 94.0 in stage 4.0 (TID 479)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 244, boot = -4, init = 248, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 61.0 in stage 4.0 (TID 446). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 194, boot = 8, init = 186, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 489
23/09/27 19:43:00 INFO Executor: Finished task 81.0 in stage 4.0 (TID 466). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Running task 104.0 in stage 4.0 (TID 489)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 249, boot = -5, init = 254, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 494
23/09/27 19:43:00 INFO Executor: Running task 109.0 in stage 4.0 (TID 494)
23/09/27 19:43:00 INFO Executor: Finished task 69.0 in stage 4.0 (TID 454). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 499
23/09/27 19:43:00 INFO Executor: Running task 114.0 in stage 4.0 (TID 499)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 215, boot = 1, init = 214, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 89.0 in stage 4.0 (TID 474). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 507
23/09/27 19:43:00 INFO Executor: Running task 122.0 in stage 4.0 (TID 507)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 272, boot = 15, init = 256, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 88.0 in stage 4.0 (TID 473). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 6, init = 233, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 515
23/09/27 19:43:00 INFO Executor: Running task 130.0 in stage 4.0 (TID 515)
23/09/27 19:43:00 INFO Executor: Finished task 94.0 in stage 4.0 (TID 479). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 519
23/09/27 19:43:00 INFO Executor: Running task 134.0 in stage 4.0 (TID 519)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 240, boot = -7, init = 247, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 114.0 in stage 4.0 (TID 499). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO PythonRunner: Times: total = 274, boot = 0, init = 274, finish = 0
23/09/27 19:43:00 INFO PythonRunner: Times: total = 259, boot = -7, init = 266, finish = 0
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 536
23/09/27 19:43:00 INFO Executor: Running task 151.0 in stage 4.0 (TID 536)
23/09/27 19:43:00 INFO Executor: Finished task 104.0 in stage 4.0 (TID 489). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO Executor: Finished task 109.0 in stage 4.0 (TID 494). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 537
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 538
23/09/27 19:43:00 INFO Executor: Running task 152.0 in stage 4.0 (TID 537)
23/09/27 19:43:00 INFO Executor: Running task 153.0 in stage 4.0 (TID 538)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 269, boot = 9, init = 260, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 122.0 in stage 4.0 (TID 507). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 547
23/09/27 19:43:00 INFO Executor: Running task 162.0 in stage 4.0 (TID 547)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 246, boot = -10, init = 256, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 134.0 in stage 4.0 (TID 519). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 560
23/09/27 19:43:00 INFO Executor: Running task 175.0 in stage 4.0 (TID 560)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 297, boot = 6, init = 290, finish = 1
23/09/27 19:43:00 INFO Executor: Finished task 130.0 in stage 4.0 (TID 515). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 568
23/09/27 19:43:00 INFO Executor: Running task 183.0 in stage 4.0 (TID 568)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 241, boot = 3, init = 238, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 151.0 in stage 4.0 (TID 536). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 570
23/09/27 19:43:00 INFO Executor: Running task 185.0 in stage 4.0 (TID 570)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 239, boot = 7, init = 232, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 152.0 in stage 4.0 (TID 537). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 571
23/09/27 19:43:00 INFO Executor: Running task 186.0 in stage 4.0 (TID 571)
23/09/27 19:43:00 INFO PythonRunner: Times: total = 258, boot = 14, init = 244, finish = 0
23/09/27 19:43:00 INFO Executor: Finished task 153.0 in stage 4.0 (TID 538). 2047 bytes result sent to driver
23/09/27 19:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 574
23/09/27 19:43:01 INFO Executor: Running task 189.0 in stage 4.0 (TID 574)
23/09/27 19:43:01 INFO PythonRunner: Times: total = 215, boot = 15, init = 200, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 162.0 in stage 4.0 (TID 547). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 162, boot = -6, init = 168, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 185.0 in stage 4.0 (TID 570). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 230, boot = -5, init = 235, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 175.0 in stage 4.0 (TID 560). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 154, boot = 0, init = 154, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 189.0 in stage 4.0 (TID 574). 2004 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 244, boot = 15, init = 229, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 183.0 in stage 4.0 (TID 568). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO PythonRunner: Times: total = 219, boot = 7, init = 212, finish = 0
23/09/27 19:43:01 INFO Executor: Finished task 186.0 in stage 4.0 (TID 571). 2047 bytes result sent to driver
23/09/27 19:43:01 INFO UCX: UCX context created
23/09/27 19:43:01 INFO UCX: UCX Worker created
23/09/27 19:43:02 INFO UCX: Started UcpListener on /<master_ip>:57306
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Registering executor BlockManagerId(1, <master_ip>, 32805, Some(rapids=57306)) with driver
23/09/27 19:43:02 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(0, <master_ip>, 41505, Some(rapids=62205))
23/09/27 19:43:02 INFO UCX: Creating connection for executorId 0
23/09/27 19:43:02 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=0) started
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124) for /<master_ip>:46124
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for /<master_ip>:46148
23/09/27 19:43:02 INFO UCX: Got UcpListener request from /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Created ConnectionRequest endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for /<master_ip>:46136
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Success sending handshake header! java.nio.DirectByteBuffer[pos=0 lim=281 cap=281]
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 3: UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 2: UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136)
23/09/27 19:43:02 INFO UCX: Established endpoint on ConnectionRequest for executor 0: UcpEndpoint(id=139640732815552, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46124)
23/09/27 19:43:02 INFO CoarseGrainedExecutorBackend: Got assigned task 577
23/09/27 19:43:02 INFO Executor: Running task 0.0 in stage 6.0 (TID 577)
23/09/27 19:43:02 INFO MapOutputTrackerWorker: Updating epoch to 4 and clearing cache
23/09/27 19:43:02 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 13.2 KiB, free 47.8 GiB)
23/09/27 19:43:02 INFO TorrentBroadcast: Reading broadcast variable 4 took 7 ms
23/09/27 19:43:02 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 29.0 KiB, free 47.8 GiB)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 2, fetching them
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:04 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 0 (0.0 B) local and 1 (97.0 B) host-local and 0 (0.0 B) push-merged-local and 1 (97.0 B) remote blocks
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<worker_ip_from_non_master_node>:38739 after 2 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 21 ms
23/09/27 19:43:04 INFO TransportClientFactory: Successfully created connection to /<master_ip>:36401 after 1 ms (0 ms spent in bootstraps)
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 48.927407 ms
23/09/27 19:43:04 INFO CodeGenerator: Code generated in 19.687474 ms
23/09/27 19:43:04 INFO Executor: Finished task 0.0 in stage 6.0 (TID 577). 4021 bytes result sent to driver
23/09/27 19:43:05 INFO CoarseGrainedExecutorBackend: Got assigned task 584
23/09/27 19:43:05 INFO Executor: Running task 6.0 in stage 9.0 (TID 584)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Updating epoch to 5 and clearing cache
23/09/27 19:43:05 INFO TorrentBroadcast: Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB)
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 28.7 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO TorrentBroadcast: Reading broadcast variable 5 took 17 ms
23/09/27 19:43:05 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 59.1 KiB, free 47.8 GiB)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@master:35063)
23/09/27 19:43:05 INFO MapOutputTrackerWorker: Got the map output locations
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
23/09/27 19:43:05 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
23/09/27 19:43:05 INFO CodeGenerator: Code generated in 14.846185 ms
23/09/27 19:43:05 INFO PythonRunner: Times: total = 139, boot = -4325, init = 4464, finish = 0
23/09/27 19:43:05 INFO Executor: Finished task 6.0 in stage 9.0 (TID 584). 6740 bytes result sent to driver
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(3, <master_ip>, 36401, Some(rapids=29068))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 3
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=3) started
23/09/27 19:43:07 INFO RapidsShuffleHeartbeatEndpoint: Updating shuffle manager for new executor BlockManagerId(2, <master_ip>, 36445, Some(rapids=50893))
23/09/27 19:43:07 INFO UCX: Creating connection for executorId 2
23/09/27 19:43:07 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@331ae0ff, peerExecutorId=2) started
23/09/27 19:46:00 INFO RapidsShuffleInternalManager: Unregistering shuffle 1 from shuffle buffer catalog
23/09/27 19:46:00 WARN ShuffleBufferCatalog: Ignoring unregister of unknown shuffle 1
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 2: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815680, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46136) for 2
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 2
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 ERROR UCX: UcpListener detected an error for executorId 3: UCXError(-25,Connection reset by remote peer)
23/09/27 19:46:32 WARN UCX: Removing endpoint UcpEndpoint(id=139640732815616, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,connectionRequest/<master_ip>:46148) for 3
23/09/27 19:46:32 WARN UCX: Removed stale client connection for 3
23/09/27 19:46:32 ERROR UCX: Error while closing ep. Ignoring.
org.openucx.jucx.UcxException: Connection reset by remote peer
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:441)
at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:904)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:188)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:182)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:182)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:490)
at java.base/java.lang.Thread.run(Thread.java:833)
23/09/27 19:46:32 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
23/09/27 19:46:32 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
tdown
I am trying to run the Linear Regression, KMeans, and PCA examples on a cluster of 2 nodes, each with 4 GPUs, but some of the executors in the examples always get stuck in the barrier when the cuML function is called (i.e., I get 6+2/8, 4+4/8, and 5+3/8, where 2, 4, and 3 executors are stuck in LinReg, KMeans, and PCA respectively). I also tried runing a KMeans application that deals with a large amount of data, so I do not think the problem is related to the small dataset.
I checked the logs for the executor that successfully ran the task and the executor that got stuck. The executor that got stuck initialized cuML These logs are from running the LinReg example in the Python directory of this repo. The executors that are stuck have
RUNNING | NODE_LOCALas the status while the successful executors haveSUCCESS PROCESS_LOCAL.I am using Spark RAPIDS ML branch-23.10 (daedfe56edae33c565af5e06179e992cf8fec93e and f651978), Spark 3.5.0 on standalone mode, and Hadoop 3.3.6 on a cluster of 2 nodes, each with 4 Titan-V GPUs.
Successful Executor
Killed Executor
Here is the
spark.confcontaining the related options. I tried to disable the options related to UDFs (Scala UDF, UDF compiler, etc.), but it did not do much.`spark.conf`