Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to run interop handshakeloss test currently unsupported #953

Closed
Tracked by #944
nhorman opened this issue Dec 3, 2024 · 7 comments
Closed
Tracked by #944

try to run interop handshakeloss test currently unsupported #953

nhorman opened this issue Dec 3, 2024 · 7 comments

Comments

@nhorman
Copy link
Contributor

nhorman commented Dec 3, 2024

This is likely just an artifact of handshakeloss not being listed in the supported tests within the containers run_endpoint.sh script. Adding it there with the appropriate client/server executions should allow it to pass

@nhorman nhorman changed the title handshakeloss test currently unsupported interop handshakeloss test currently unsupported Dec 3, 2024
@nhorman nhorman removed this from Project Board Dec 3, 2024
@github-project-automation github-project-automation bot moved this to Pre-Refinement in Development Board Dec 3, 2024
@vavroch2010 vavroch2010 added this to the 3.5.0 milestone Dec 9, 2024
@vavroch2010 vavroch2010 changed the title interop handshakeloss test currently unsupported try to run interop handshakeloss test currently unsupported Dec 9, 2024
@vavroch2010 vavroch2010 moved this from Pre-Refinement to Todo in Development Board Dec 9, 2024
@andrewkdinh andrewkdinh self-assigned this Dec 16, 2024
@andrewkdinh andrewkdinh moved this from Todo to In Progress in Development Board Dec 16, 2024
@nhorman
Copy link
Contributor Author

nhorman commented Jan 3, 2025

logs.zip

Ok, in an effort to record some of the work we've been doing here, heres an accounting of what we've tried to get working here and the problems we've encountered. Currently work that I have is here, and is still very much a work in progress:
https://github.com/nhorman/openssl/tree/handshakeloss

Things that have been adjusted and fixed in this branch:

  1. Addition of the multiconnect test to the docker container - This is needed as the handshake test runs the multiconnect test under the covers, and so requires explicit recognition
  2. Adjustment of the multiconnect test/client/server to the docker container to allow multithreading. Because of the large amount of traffic loss, other implementations send multiple requests in parallel to avoid running afoul of the 300 second test timeout (as occcasionally bad packet loss leads to head of line blocking for a long lived connection). As our clients are currently only single threaded, servicing one request at at time, this is currently managed, by spawning a server side thread for each incomming connection to handle streams in parallel. On the client side, each requests is forked by a separate client process, and the run_entrypoint.sh script manages reaping of those processes.
  3. Several memory errors have been addressed in the client. On connection failure, some memory was inadvertently double freed leading to crashes, those have been fixed
  4. Limit ack range trimming to only being done on the processing of an ack frame. Currently we trim our ack ranges andbump our acked watermark on every received pn, however according to The RFC in section 13.2.4, we should only trim our ranges when we receive an ACK frame
  5. Adjust the test to allow retries of failed connections. The test currently checks the number of handshakes produced, and ensures it matches the number of generated requests, but thats a bit naieve. At high packet loss, it is reasonable to assume that in some cases a connection might fail, and be retried. I've submitted a PR to the interop maintainer to relax this, and addjusted our test case in run_entrypiont.sh to allow for retry on failure. interop-runner pr is here:
    Relax the handshake count check in handshakeloss test quic-interop/quic-interop-runner#415

With the above changes in place, I am able to consistently pass the handshakeloss test between the openssl client and server. Any failures appear to be the result of the client hitting the advertised max_idle_timeout. An example can be seen in the trace files on udp.port 53805. Those failed connections are now retried and succeede in a subsequent attempt.

Assuming that the interop maintainer supports the relaxing of this test, I think we have some fixes here that are worth converting to PR's.

However, there are other issues that require investigation, based on the fact that openssl still fails when testing against a few other implementations (most notably lsquic and quic-go). Investigation is ongoing there

@Sashan
Copy link

Sashan commented Jan 7, 2025

I'm still trying to make sense of all this. the relevant protocol spec is RFC-9002.txt I'm running my own trials on my notebook. I'm seeing a slightly different results than what's reported here. It looks much worse here. Things start to fall apart when just 10% of packets get lost. I'm using this firewall rule on my notebook:

# inspect traffic on loopback interface too
#set skip on lo

# port 4444 is used by server, port 4445 is used by client to
# receive connections back from server (part of ssl_new_from_listener testing)
# most important piece is probability 10% this indicates that only 10% of
# packets matches the rule.
block quick on lo0 proto udp from any to port 4444 probability 10%
block quick on lo0 proto udp from any to port 4445 probability 10%
# avoid creating state (no state) so we enforce every single UDP packet on loopback
# to match rules.
pass on lo0 proto udp no state

When those rules are enabled I'm not able to handle single http-like request done.
I can see server accepts the connection but client fails. This is what I see on terminal
when test is running in verbose mode:

# The results of this test will end up in test-runs/test_quic_new_from_listener
1..1
    # Subtest: ../../test/quic_new_from_listener_test
    1..1
    # INFO:  @ test/quic_new_from_listener.c:1060
    # ( Server ) Binding to port 4444
    #
    # INFO:  @ test/quic_new_from_listener.c:516
    # ( Server ) Waiting for connection
    #
    # INFO:  @ test/quic_new_from_listener.c:523
    # ( Server ) Accepted new connection
    #
    # INFO:  @ test/quic_new_from_listener.c:810
    # ( Client )  getting file_1024.txt
    #
    # INFO:  @ test/quic_new_from_listener.c:575
    # ( Client ) reading from stream ...
    #
    # INFO:  @ test/quic_new_from_listener.c:376
    # (Server) Request is GET /file_1024.txt
    #
    #
    # INFO:  @ test/quic_new_from_listener.c:289
    # ( Server ) Serving file_1024.txt
    #
    # ERROR: (ptr) 'ssl_qstream != NULL' failed @ test/quic_new_from_listener.c:536
    # 0x0
    # INFO:  @ test/quic_new_from_listener.c:516
    # ( Server ) Waiting for connection
    #

output above indicates server completes handshake, but fails to accept the stream from client. This matches the callstack cat client:

(gdb) bt
#0  ossl_quic_port_raise_net_error (port=0x181863f86ff95a50, triggering_ch=0x96a6e194700) at ssl/quic/quic_port.c:1017
#1  0x000009679e6b4f8a in ch_tx (ch=0x96a6e194700, notify_other_threads=0x779917ea5108) at ssl/quic/quic_channel.c:2622
#2  0x000009679e6b43ef in ossl_quic_channel_subtick (ch=0x96a6e194700, res=0x779917ea5160, flags=0) at ssl/quic/quic_channel.c:1997
#3  0x000009679e6d1f97 in ossl_quic_port_subtick (port=0x96a6e1a7140, res=0x779917ea51b8, flags=0) at ssl/quic/quic_port.c:525
#4  0x000009679e6cddff in qeng_tick (res=0x779917ea5218, arg=0x969d7751600, flags=0) at ssl/quic/quic_engine.c:191
#5  0x000009679e6d3c79 in ossl_quic_reactor_tick (rtor=0x969d7751628, flags=0) at ssl/quic/quic_reactor.c:146
#6  0x000009679e6d3ec0 in ossl_quic_reactor_block_until_pred (rtor=0x969d7751628, pred=0x9679e6adcf0 <quic_read_again>, pred_arg=0x779917ea5328, flags=0)
    at ssl/quic/quic_reactor.c:477
#7  0x000009679e6a4abf in block_until_pred (ctx=0x779917ea5358, pred=0x9679e6adcf0 <quic_read_again>, pred_arg=0x779917ea5328, flags=0)
    at ssl/quic/quic_impl.c:537
#8  0x000009679e6a6e6e in quic_read (s=0x96a6e18c540, buf=0x779917ea5520, len=1024, bytes_read=0x779917ea5510, peek=0) at ssl/quic/quic_impl.c:3031
#9  0x000009679e6a6c71 in ossl_quic_read (s=0x96a6e18c540, buf=0x779917ea5520, len=1024, bytes_read=0x779917ea5510) at ssl/quic/quic_impl.c:3067
#10 0x000009679e65b58f in ssl_read_internal (s=0x96a6e18c540, buf=0x779917ea5520, num=1024, readbytes=0x779917ea5510) at ssl/ssl_lib.c:2280
#11 0x000009679e65bcbe in SSL_read_ex (s=0x96a6e18c540, buf=0x779917ea5520, num=1024, readbytes=0x779917ea5510) at ssl/ssl_lib.c:2349
#12 0x000009679e652046 in client_stream_transfer (ssl_qstream=0x96a6e18c540, expected=1024, filename=0x9679e530ad0 "file_1024.txt")
    at test/quic_new_from_listener.c:576
#13 0x000009679e651a91 in client_httplike_transfer (ssl_qstream=0x96a6e18c540, filename=0x9679e530ad0 "file_1024.txt") at test/quic_new_from_listener.c:649
#14 0x000009679e651564 in client_run (ssl_qconn=0x96a6e189900, ssl_qconn_listener=0x0) at test/quic_new_from_listener.c:812
#15 0x000009679e650aa7 in client_main (argc=4, argv=0x779917ea5f10) at test/quic_new_from_listener.c:989
#16 0x000009679e65034d in server_main (argc=4, argv=0x779917ea5f10) at test/quic_new_from_listener.c:1050
#17 0x000009679e6501c3 in run_client_server () at test/quic_new_from_listener.c:1093
#18 0x000009679e7523a9 in run_tests (test_prog_name=0x779917ea72f8 "../../test/quic_new_from_listener_test") at test/testutil/driver.c:342
#19 0x000009679e753ce1 in main (argc=4, argv=0x779917ea7028) at test/testutil/main.c:31

(I did put breakpoint to ossl_quic_port_raise_net_error function)

This is basically where I'm at now.

@Sashan
Copy link

Sashan commented Jan 7, 2025

quic-10-loss.pcapng.gz
attaching packet dump. not much helpful at the moment.

@Sashan
Copy link

Sashan commented Jan 7, 2025

sorry after measuring the packet loss using ping it looks like my results are in line with what we got from interop testing. the probability 10% on the rules makes effective loss around 30%. The single packet visits firewall 2 times and each time has 10% probability to get dropped. I suck in math but I would say there is 80% chance it will reach it's destination, same applies for response so we get something as 0.9^4 ~= 65% we will see reply.

@Sashan
Copy link

Sashan commented Jan 7, 2025

the callstack in comment is red herring. it indicates permanent error at tx path on the client side. It is caused by packet dropped at outbound path. basically if firewall drops local outbound packet the error is signaled up to socket which is handled by application (quic library). This is not related to the issue we are hunting down. Had to adjust firewall rules to set slightly better trap:

block in quick on lo0 proto udp from any to port 4444 probability 10%
block in quick on lo0 proto udp from any to port 4445 probability 10%
pass on lo0 proto icmp no state
pass on lo0 proto udp no state

this way I've started to see more interesting failures.

@Sashan
Copy link

Sashan commented Jan 9, 2025

It indeed looks like the QUIC connection handshake is sensitive to packet loss. Once connection is established the handling of streams feel reliable. To investigate it further I need to re-arrange test environment a bit. Currently I let server and client to talk over loopback sockets. It is more simple to run my test but it produces less useful packet dumps. There is no easy way to identify lost packets because those packets.

what I've observed often time one of the peers get stuck in ACCEPT operation while the other operates on established connection already (has object which comes from connect/new_stream) operation. And then gets stuck on read from accepted object. The read operation fails after underlying connection dies on idle time out (the read from channel fails with network/port is in shutdown state).

I'm using blocking semantics which might be more prone to this kind of communication stall, because the QUIC engine stops ticking, I suspect this makes any retransmission timers stop to fire. This is just suspicion. I did try to workaround that by making sure there will always be a tick every second. It did help a bit, but nothing exciting.

I think it's also worth to try andrew's quiapitest here:
https://github.com/andrewkdinh/openssl/tree/dev-drop-packets
the test allows us to select which packets should be dropped (like first, second, fourth etc.)

@andrewkdinh
Copy link

Continued with #971

@github-project-automation github-project-automation bot moved this from New to Done in Project Board Jan 13, 2025
@github-project-automation github-project-automation bot moved this from Blocked to Done in Development Board Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Status: Done
Status: Done
Development

No branches or pull requests

4 participants