-
Since version 0.7.18, when I separate the physical machines and start secretflow in cluster mode, I get 2023-01-18 12:05:03,270 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.146.0.2:20000...
2023-01-18 12:05:03,277 INFO worker.py:1538 -- Connected to Ray cluster.
Traceback (most recent call last):
File "myscript.py", line 73, in <module>
spu_cluster.psi_join_csv('username', input_path, output_path, 'alice', 'alice')
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 1311, in psi_join_csv
return dispatch(
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/register.py", line 111, in dispatch
return _registrar.dispatch(self.device_type, name, self, *args, **kwargs)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/register.py", line 80, in dispatch
return self._ops[device_type][name](*args, **kwargs)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/kernels/spu.py", line 408, in psi_join_csv
return sfd.get(res)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/distributed/primitive.py", line 49, in get
return ray.get(object_refs)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/ray/_private/worker.py", line 2311, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::SPURuntime.__init__() (pid=249868, ip=10.146.0.4, repr=<secretflow.device.device.spu.SPURuntime object at 0x7ff825417460>)
File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 278, in __init__
self.link = spu_link.create_brpc(desc, rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel_brpc.cc:126] brpc server failed start
stacktrace:
#0 yacl::link::FactoryBrpc::CreateContext()+0x7ff8366700c1
#1 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7ff8354b9510
#2 pybind11::cpp_function::dispatcher()+0x7ff8354a3c5b
#3 PyCFunction_Call+0x4dfd82
(SPURuntime pid=249868, ip=10.146.0.4) 2023-01-18 12:05:04,980 ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::SPURuntime.__init__() (pid=249868, ip=10.146.0.4, repr=<secretflow.device.device.spu.SPURuntime object at 0x7ff825417460>)
(SPURuntime pid=249868, ip=10.146.0.4) File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 278, in __init__
(SPURuntime pid=249868, ip=10.146.0.4) self.link = spu_link.create_brpc(desc, rank)
(SPURuntime pid=249868, ip=10.146.0.4) RuntimeError: what:
(SPURuntime pid=249868, ip=10.146.0.4) [external/yacl/yacl/link/transport/channel_brpc.cc:126] brpc server failed start
(SPURuntime pid=249868, ip=10.146.0.4) stacktrace:
(SPURuntime pid=249868, ip=10.146.0.4) #0 yacl::link::FactoryBrpc::CreateContext()+0x7ff8366700c1
(SPURuntime pid=249868, ip=10.146.0.4) #1 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7ff8354b9510
(SPURuntime pid=249868, ip=10.146.0.4) #2 pybind11::cpp_function::dispatcher()+0x7ff8354a3c5b
(SPURuntime pid=249868, ip=10.146.0.4) #3 PyCFunction_Call+0x4dfd82
(SPURuntime pid=249868, ip=10.146.0.4) 2023-01-18 12:05:04.968 [error] [server.cpp:BRPC:969] Fail to listen 10.146.0.6:9261
(SPURuntime pid=239472) 2023-01-18 12:05:05,028 ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::SPURuntime.__init__() (pid=239472, ip=10.146.0.2, repr=<secretflow.device.device.spu.SPURuntime object at 0x7fb3041b1b80>)
(SPURuntime pid=239472) File "<path-to-anaconda>/anaconda3/envs/secretflow/lib/python3.8/site-packages/secretflow/device/device/spu.py", line 278, in __init__
(SPURuntime pid=239472) self.link = spu_link.create_brpc(desc, rank)
(SPURuntime pid=239472) RuntimeError: what:
(SPURuntime pid=239472) [external/yacl/yacl/link/transport/channel_brpc.cc:126] brpc server failed start
(SPURuntime pid=239472) stacktrace:
(SPURuntime pid=239472) #0 yacl::link::FactoryBrpc::CreateContext()+0x7fb3153a30c1
(SPURuntime pid=239472) #1 pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()+0x7fb3141ec510
(SPURuntime pid=239472) #2 pybind11::cpp_function::dispatcher()+0x7fb3141d6c5b
(SPURuntime pid=239472) #3 PyCFunction_Call+0x4dfd82
(SPURuntime pid=239472) 2023-01-18 12:05:05.015 [error] [server.cpp:BRPC:969] Fail to listen 10.146.0.4:9100 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I did not correctly understand rayfed or the new version, but now that I understand, it is ok to delete this discussion. The cause was that I was running in standalone mode by specifying |
Beta Was this translation helpful? Give feedback.
I did not correctly understand rayfed or the new version, but now that I understand, it is ok to delete this discussion. The cause was that I was running in standalone mode by specifying
parties
forsf.init
.