-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try to run interop handshakeloss test currently unsupported #953
Comments
Ok, in an effort to record some of the work we've been doing here, heres an accounting of what we've tried to get working here and the problems we've encountered. Currently work that I have is here, and is still very much a work in progress: Things that have been adjusted and fixed in this branch:
With the above changes in place, I am able to consistently pass the handshakeloss test between the openssl client and server. Any failures appear to be the result of the client hitting the advertised max_idle_timeout. An example can be seen in the trace files on udp.port 53805. Those failed connections are now retried and succeede in a subsequent attempt. Assuming that the interop maintainer supports the relaxing of this test, I think we have some fixes here that are worth converting to PR's. However, there are other issues that require investigation, based on the fact that openssl still fails when testing against a few other implementations (most notably lsquic and quic-go). Investigation is ongoing there |
I'm still trying to make sense of all this. the relevant protocol spec is RFC-9002.txt I'm running my own trials on my notebook. I'm seeing a slightly different results than what's reported here. It looks much worse here. Things start to fall apart when just 10% of packets get lost. I'm using this firewall rule on my notebook:
When those rules are enabled I'm not able to handle single http-like request done.
output above indicates server completes handshake, but fails to accept the stream from client. This matches the callstack cat client:
(I did put breakpoint to This is basically where I'm at now. |
quic-10-loss.pcapng.gz |
sorry after measuring the packet loss using ping it looks like my results are in line with what we got from interop testing. the probability 10% on the rules makes effective loss around 30%. The single packet visits firewall 2 times and each time has 10% probability to get dropped. I suck in math but I would say there is 80% chance it will reach it's destination, same applies for response so we get something as 0.9^4 ~= 65% we will see reply. |
the callstack in comment is red herring. it indicates permanent error at tx path on the client side. It is caused by packet dropped at outbound path. basically if firewall drops local outbound packet the error is signaled up to socket which is handled by application (quic library). This is not related to the issue we are hunting down. Had to adjust firewall rules to set slightly better trap:
this way I've started to see more interesting failures. |
It indeed looks like the QUIC connection handshake is sensitive to packet loss. Once connection is established the handling of streams feel reliable. To investigate it further I need to re-arrange test environment a bit. Currently I let server and client to talk over loopback sockets. It is more simple to run my test but it produces less useful packet dumps. There is no easy way to identify lost packets because those packets. what I've observed often time one of the peers get stuck in ACCEPT operation while the other operates on established connection already (has object which comes from connect/new_stream) operation. And then gets stuck on read from accepted object. The read operation fails after underlying connection dies on idle time out (the read from channel fails with network/port is in shutdown state). I'm using blocking semantics which might be more prone to this kind of communication stall, because the QUIC engine stops ticking, I suspect this makes any retransmission timers stop to fire. This is just suspicion. I did try to workaround that by making sure there will always be a tick every second. It did help a bit, but nothing exciting. I think it's also worth to try andrew's quiapitest here: |
Continued with #971 |
This is likely just an artifact of handshakeloss not being listed in the supported tests within the containers run_endpoint.sh script. Adding it there with the appropriate client/server executions should allow it to pass
The text was updated successfully, but these errors were encountered: