-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPP-1979] VCL app with worker is hang on exit #3441
Comments
Fixed by https://gerrit.fd.io/r/c/vpp/+/31922 |
Awesome! Thanks for the report! |
Hi Florin, working fine, thanks! |
Could you try this out https://gerrit.fd.io/r/c/vpp/+/31922 ? The simple vcl app above seems to be working fine for me now. |
Hi Vadim, That's unfortunately another type of issue, that has to do with the fact that the app is multi threaded and not all workers have been stopped. Consequently, vcl asks vpp to cleanup the binary api registration for the worker, but doesn't clean it locally as it assumes it'll soon exit (not the case in your app). Without going too much into the weeds, that's an optimization mainly aimed at supporting sshd + ldp (has to do with seccomp sandbox). Good news is that this should not be present if the app socket api is in use (as opposed to the binary api). Still, I'd like to also support your use case so let me see if I can come up with a solution. |
Hi, Florin, With your fix vppcom_worker_unregister works fine, but the original issue is not fixed, the app is still hanged on exit. Consider the problem is that Thread 4 rx_thread_fn is not joined on vppcom_app_destroy and vppcom_app_destroy is infinetly waiting for this thread. Starting program: /home/vadim/hybrid-ngfw/sources/pdk/network_interceptor/test_vcl_server/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". VCL<16492>: allocated VCL heap = 0x7fffe7869000, size 268435456 (0x10000000) VCL<16492>: configured rx_fifo_size 1048576 (0x100000) VCL<16492>: configured tx_fifo_size 1048576 (0x100000) VCL<16492>: configured api-socket-name (/run/vpp/api.sock) VCL<16492>: configured with mq with eventfd VCL<16492>: configured event_queue_size 524288 (0x80000) VCL<16492>: completed parsing vppcom config! [New Thread 0x7fffe7868700 (LWP 16496)] vcl_bapi_connect_to_vpp:504: vcl<16492:0>: app (test_vcl_server-wrk-0) is connected to VPP! vcl_bapi_init:607: vcl<16492:0>: sending session enable vppcom_app_create:1267: vcl<16492:0>: app_name 'test_vcl_server', my_client_index 256 (0x100) [New Thread 0x7fffe7067700 (LWP 16497)] [New Thread 0x7fffe6866700 (LWP 16498)] vcl_bapi_connect_to_vpp:504: vcl<16492:1>: app (test_vcl_server-wrk-1) is connected to VPP! vl_api_app_worker_add_del_reply_t_handler:227: vcl<0:-1>: worker 1 vpp-worker 1 added vcl_worker_register_with_vpp:224: vcl<16492:1>: added worker 1 [Thread 0x7fffe7067700 (LWP 16497) exited]
|
That's definitely an issue! Could you check with the same code on top of this https://gerrit.fd.io/r/c/vpp/+/31915? Now, regarding the purpose of the cb, it's there because we wanted to proactively cleanup worker state in vpp, in case apps forget to cleanup their workers or the workers are registered through ldp (apps are not aware of it). In particular, the binary api takes up to 30s until it cleans up a client and that could lead to issues if, for instance, such a worker listens on some ports that need to be reused. Going forward, if apps attach using the explicit socket api (session { use-app-socket-api }in startup.conf), vpp should be able to detect app worker cleanups the moment the attachment socket is closed. So we might be able to get rid of that cb at one point but, for now, the extra checks in the patch above should be enough. |
Hi Florin, in case vppcom_woker_unregister() at end of func() there are assert because of the vcl_worker_cleanup_cb() callback at thread exit. By the way I consider it's better to remove this callback and let vcl users make explicit vppcom_worker_unregister(). Starting program: /home/vadim/test_vcl_server/a.out [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". VCL<5712>: allocated VCL heap = 0x7fffe786a000, size 268435456 (0x10000000) VCL<5712>: configured rx_fifo_size 1048576 (0x100000) VCL<5712>: configured tx_fifo_size 1048576 (0x100000) VCL<5712>: configured api-socket-name (/run/vpp/api.sock) VCL<5712>: configured with mq with eventfd VCL<5712>: configured event_queue_size 524288 (0x80000) VCL<5712>: completed parsing vppcom config! [New Thread 0x7fffe7869700 (LWP 5716)] vcl_bapi_connect_to_vpp:504: vcl<5712:0>: app (test_vcl_server-wrk-0) is connected to VPP! vcl_bapi_init:607: vcl<5712:0>: sending session enable vppcom_app_create:1267: vcl<5712:0>: app_name 'test_vcl_server', my_client_index 0 (0x0) [New Thread 0x7fffe7068700 (LWP 5717)] [New Thread 0x7fffe6867700 (LWP 5718)] vcl_bapi_connect_to_vpp:504: vcl<5712:1>: app (test_vcl_server-wrk-1) is connected to VPP! vl_api_app_worker_add_del_reply_t_handler:227: vcl<0:-1>: worker 1 vpp-worker 1 added vcl_worker_register_with_vpp:218: vcl<5712:1>: added worker 1 /home/vadim/vpp/src/vcl/vcl_private.h:650 (vcl_worker_get) assertion `! pool_is_free (vcm->workers, _e)' fails
|
Hi Vadim! Could you try with vppcom_worker_unregister() at the end of func()? |
Description
compile the following VCL app
g++ -I /home/vadim/vpp/build-root/install-vpp_debug-native/vpp/include main.cpp -L/home/vadim/vpp/build-root/install-vpp_debug-native/vpp/lib -lpthread -lvppcom
vadim@vadim-lenovo:~/hybrid-ngfw/sources/pdk/network_interceptor/test_vcl_server$ sudo LD_LIBRARY_PATH=/home/vadim/vpp/build-root/install-vpp_debug-native/vpp/lib VCL_CONFIG=/home/vadim/config/vcl.conf gdb a.out
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from a.out...
(No debugging symbols found in a.out)
(gdb) r
Starting program: /home/vadim/hybrid-ngfw/sources/pdk/network_interceptor/test_vcl_server/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
VCL<91647>: allocated VCL heap = 0x7fffe786a000, size 268435456 (0x10000000)
VCL<91647>: configured rx_fifo_size 1048576 (0x100000)
VCL<91647>: configured tx_fifo_size 1048576 (0x100000)
VCL<91647>: configured api-socket-name (/run/vpp/api.sock)
VCL<91647>: configured with mq with eventfd
VCL<91647>: configured event_queue_size 524288 (0x80000)
VCL<91647>: completed parsing vppcom config!
[New Thread 0x7fffe7869700 (LWP 91651)]
vcl_bapi_connect_to_vpp:504: vcl<91647:0>: app (test_vcl_server-wrk-0) is connected to VPP!
vcl_bapi_init:607: vcl<91647:0>: sending session enable
vppcom_app_create:1267: vcl<91647:0>: app_name 'test_vcl_server', my_client_index 512 (0x200)
[New Thread 0x7fffe7068700 (LWP 91652)]
[New Thread 0x7fffe6867700 (LWP 91653)]
vcl_bapi_connect_to_vpp:504: vcl<91647:1>: app (test_vcl_server-wrk-1) is connected to VPP!
vl_api_app_worker_add_del_reply_t_handler:227: vcl<0:-1>: worker 1 vpp-worker 1 added
vcl_worker_register_with_vpp:218: vcl<91647:1>: added worker 1
vcl_worker_cleanup_cb:145: vcl<0:-1>: cleaned up worker 1
[Thread 0x7fffe7068700 (LWP 91652) exited]
[Thread 0x7fffe7869700 (LWP 91651) exited]
^C
Thread 1 "a.out" received signal SIGINT, Interrupt.
__pthread_clockjoin_ex (threadid=140737060960000, thread_return=0x7fffffffe150, clockid=, abstime=, block=) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) thread apply all bt
Thread 4 (Thread 0x7fffe6867700 (LWP 91653)):
#0 futex_wait_cancelable (private=, expected=0, futex_word=0x7f814202c950) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f814202c900, cond=0x7f814202c928) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7f814202c928, mutex=0x7f814202c900) at pthread_cond_wait.c:638
#3 0x00007ffff7c5928d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:82
#4 0x4000000000000000 in ?? ()
#5 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7ffff786a780 (LWP 91647)):
#0 __pthread_clockjoin_ex (threadid=140737060960000, thread_return=0x7fffffffe150, clockid=, abstime=, block=) at pthread_join_common.c:145
#1 0x00007ffff7b197c1 in disconnect_from_vlib_internal (do_unmap=0 '\000') at /home/vadim/vpp/src/vlibmemory/memory_client.c:518
#2 0x00007ffff7b19827 in vl_client_disconnect_from_vlib_no_unmap () at /home/vadim/vpp/src/vlibmemory/memory_client.c:538
#3 0x00007ffff7b20aaa in vl_socket_client_disconnect2 (scm=0x7fffe786ba18) at /home/vadim/vpp/src/vlibmemory/socket_client.c:237
#4 0x00007ffff7f50569 in vcl_bapi_disconnect_from_vpp () at /home/vadim/vpp/src/vcl/vcl_bapi.c:519
#5 0x00007ffff7f39bb8 in vcl_api_detach (wrk=0x7fffe786b8c0) at /home/vadim/vpp/src/vcl/vppcom.c:1228
#6 0x00007ffff7f3a1d4 in vppcom_app_destroy () at /home/vadim/vpp/src/vcl/vppcom.c:1293
#7 0x000055555555534b in main ()
(gdb)
Actual
HANGED on exit
Expected
No hang on exit
Assignee
Florin Coras
Reporter
Vadim Platonov
Comments
working fine, thanks!
That's unfortunately another type of issue, that has to do with the fact that the app is multi threaded and not all workers have been stopped. Consequently, vcl asks vpp to cleanup the binary api registration for the worker, but doesn't clean it locally as it assumes it'll soon exit (not the case in your app). Without going too much into the weeds, that's an optimization mainly aimed at supporting sshd + ldp (has to do with seccomp sandbox).
Good news is that this should not be present if the app socket api is in use (as opposed to the binary api). Still, I'd like to also support your use case so let me see if I can come up with a solution.
With your fix vppcom_worker_unregister works fine, but the original issue is not fixed, the app is still hanged on exit.
Consider the problem is that Thread 4 rx_thread_fn is not joined on vppcom_app_destroy and vppcom_app_destroy is infinetly waiting for this thread.
Now, regarding the purpose of the cb, it's there because we wanted to proactively cleanup worker state in vpp, in case apps forget to cleanup their workers or the workers are registered through ldp (apps are not aware of it). In particular, the binary api takes up to 30s until it cleans up a client and that could lead to issues if, for instance, such a worker listens on some ports that need to be reused.
Going forward, if apps attach using the explicit socket api (session
{ use-app-socket-api }
in startup.conf), vpp should be able to detect app worker cleanups the moment the attachment socket is closed. So we might be able to get rid of that cb at one point but, for now, the extra checks in the patch above should be enough.
in case vppcom_woker_unregister() at end of func() there are assert because of the vcl_worker_cleanup_cb() callback at thread exit. By the way I consider it's better to remove this callback and let vcl users make explicit vppcom_worker_unregister().
Original issue: https://jira.fd.io/browse/VPP-1979
The text was updated successfully, but these errors were encountered: