- 
                Notifications
    You must be signed in to change notification settings 
- Fork 683
Description
Description
This is not seen in CSIT under usual conditions.
But it somewhat frequently happens with a new (unmerged) CSIT code [0] tested on 2n-icx testbed. (Daily job [2] usually passes there without any failures.)
That code should not do anything specific for AVF, but it switches Python API client to async mode, so messages may be arriving (over socket) somewhat differently (not sure how differently yet).
Either way, VPP should not occasionally crash in avf_process_request [1] just because something is different on API level.
[0] https://gerrit.fd.io/r/c/csit/+/36774
[1] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k4-k1
[2] https://jenkins.fd.io/view/csit/job/csit-vpp-perf-mrr-daily-master-2n-icx/
Assignee
Unassigned
Reporter
Vratko Polak
Comments
- vrpolak (Mon, 2 Sep 2024 11:38:33 +0000): VPP-2121 was opened later, but contains better insights, so closing this one.
- vrpolak (Fri, 9 Sep 2022 14:17:41 +0000): > One way to fix this would be to use vlib_process_get_events_with_type.
Contributed here [9].
[9] https://gerrit.fd.io/r/c/vpp/+/37083
- vrpolak (Fri, 9 Sep 2022 08:35:58 +0000): After reading about processes and events in the docs [7], I think I understand what is going on.
it seems like the vlib_process_wait_for_event_or_clock returns prematurely,
The return is not premature, because there is an event. But it is not the expected confirmation from the AVF process (event type 0), it is [8] a new SOCKET_READ_EVENT (event type 2). The current code thinks it is a timeout, because it sees the event type is non-zero, but does not check if it is really a timeout (event type ~0).
One way to fix this would be to use vlib_process_get_events_with_type.
[8] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-device-master-ubuntu2204-x86_64-1n-skx/133/log.html.gz#s1-s1-s1-s1-s1-t1-k3-k3
- ayourtch (Thu, 1 Sep 2022 15:06:50 +0000): I did some debugging, this is preliminary notes:
Part of the backtrace:
#9 0x00007ffff6d73887 in _clib_error (how_to_die=2, function_name=0x7fff769edb0c "avf_process_request", line_number=1285, fmt=0x7fff769edb20 "avf process node failed to reply in 5 seconds") at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vppinfra/error.c:143
#10 0x00007fff769cc9a1 in avf_process_request (vm=0x7fff76a00740, req=0x7fff766efa58) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1285
#11 0x00007fff769cc614 in avf_flag_change (vnm=0x7ffff7f4c368 <vnet_main>, hw=0x7fff7e508e40, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1318
#12 0x00007ffff70d9f8f in ethernet_set_flags (vnm=0x7ffff7f4c368 <vnet_main>, hw_if_index=1, flags=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vnet/ethernet/interface.c:484
#13 0x00007fff769cb9bf in avf_create_if (vm=0x7fff76a00740, args=0x7fff766efcb0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/device.c:1733
#14 0x00007fff769e047a in vl_api_avf_create_t_handler (mp=0x7fff7e52c978) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/plugins/avf/avf_api.c:52
#15 0x00007ffff7fa46c7 in msg_handler_internal (am=0x7ffff7fbbd78 <api_global_main>, the_msg=0x7fff7e52c978, msg_len=24, trace_it=0, do_it=1, tfree_it=0) at /w/workspace/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/src/vlibapi/api_shared.c:576
The avf_create handler calls avf_flag_change, which calls avf_process_request; which posts the request to avf_process and suspends the thread for 5 seconds.
the avf_process picks up the request and then calls vf_process_handle_request which calls avf_op_config_promisc_mode and then signals the event by using vlib_process_signal_event. however, avf_op_config_promisc_mode may in turn call vlib_process_suspend in a cycle with doubling timeout, while awaiting for the reply from hardware.
experimenting, it seems like the vlib_process_wait_for_event_or_clock returns prematurely, and thus the panic happens - because the event has not been posted yet...
- vrpolak (Thu, 11 Aug 2022 12:30:18 +0000): > messages may be arriving (over socket) somewhat differently (not sure how differently yet).
I think I identified what is the difference. VPP PAPI sync code is smart, it sends control ping only when the number of replies is not known (for dump/detail pair). Otherwise (request/reply) it just waits for the response without sending control ping. The new CSIT code was always sending control ping (or a subsequent request without waiting for the first reply). When I changed CSIT code to be also (almost) as smart, failures stopped.
Another way to stop failures was to hack the socket so it sends both a request and subsequent control ping as a single binary chunk.
So I think the crash happens when VPP socket API receives another message while processing previous message that relies on avf_flag_change. Possible fix attempts include:
- 
Make sure AVF related processing does not get confused by signals caused by API socket. 
- 
Set API socket to stop sending signals while any AVF (or any API in general) call is running. 
- vrpolak (Thu, 11 Aug 2022 09:43:22 +0000): Here [6] is a backtrace from debug build.
[6] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/7/csit_current/0/log.html.gz#s1-s1-s1-s1-s1-t3-k3-k4-k1
- vrpolak (Wed, 10 Aug 2022 16:52:46 +0000): > VPP stopped responding without crashing.
Looking more closely, I see it does crash [4], but still no idea why it does so without producing core file.
other failures with new CSIT code
I noticed (when there IS a core file) the other failures also involve avf_flag_change [5].
[5] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/vpp-csit-verify-perf-master-ubuntu2004-x86_64-2n-icx/6/csit_current/0/log.html.gz#s1-s1-s1-s1-s2-t2-k3-k4-k1
- vrpolak (Wed, 10 Aug 2022 14:41:20 +0000): There are other failures with new CSIT code which do not happen with old code, for example here [3] VPP stopped responding without crashing. 
I see the word "signal" in the avf_process_request code, so perhaps when something goes wrong with API socket, it creates a signal, and that can be picked up by AVF code?
[3] https://s3-logs.fd.io/vex-yul-rot-jenkins-1/csit-vpp-perf-verify-master-2n-icx/35/log.html.gz#s1-s1-s1-s1-s3-t1-k2-k9-k1-k1-k1-k1-k1-k6-k3
Original issue: https://jira.fd.io/browse/VPP-2033