Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart retransmit timer on new data ACK #1018

Merged
merged 1 commit into from
Dec 27, 2024

Conversation

simonborje
Copy link
Contributor

This is an attempt to address the issue described in #949, inspired by the patch and discussion in #953.

I don't want to step on anybody's toes - if you want to continue working on the original PR I don't mind 🙂.

Basically, what I've tried to do is to implement rfc6298 5.3 based on my understanding of that and how the code works. I added a test for the retransmission timer restart but I'm not sure if it is enough.

For testing I simulated a high latency link with 1000ms RTT and wrote a simple smoltcp client that sends a single byte every 100ms with Nagle disabled so that there is always outstanding data. Without the patch retransmissions keep happening, while the patched version only has a few initial retransmissions before they stop occurring. I don't have access to a real setup like the one in the original issue so I can't say for sure how much it helps in such situations.

Copy link

codecov bot commented Dec 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.83%. Comparing base (2ecd77f) to head (13774ff).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
+ Coverage   80.81%   80.83%   +0.02%     
==========================================
  Files          81       81              
  Lines       28385    28419      +34     
==========================================
+ Hits        22939    22973      +34     
  Misses       5446     5446              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@HKalbasi
Copy link
Contributor

If you want to test it you can use mininet. It can simulate networks with various host and switches, connected with links with custom delay and bandwidth. I can share my mininet config if you are interested.

@simonborje
Copy link
Contributor Author

If you want to test it you can use mininet. It can simulate networks with various host and switches, connected with links with custom delay and bandwidth. I can share my mininet config if you are interested.

That would be really helpful, thank you :)

@tomDev5
Copy link
Contributor

tomDev5 commented Dec 23, 2024

As @Dirbaio mentioned in the matrix channel, the removal of delay: delay * 2, looks a bit suspicious.
From what I gather, mostly from the RFC-based comments in the code, this part (line 2527) should double the delay:

        if !self.seq_to_transmit(cx) && repr.segment_len() > 0 && !self.timer.is_retransmit() {
            // RFC 6298: (5.1) If we've transmitted all data we could (and there was
            // something at all, data or flag, to transmit, not just an ACK), start the
            // retransmit timer if it is not already running.
            self.timer
                .set_for_retransmit(cx.now(), self.rtte.retransmission_timeout());
        }

While this part (line 1838) should not, as it specifically calls for restarting the timer on retransmission:

            // RFC 6298: (5.2) ACK of all outstanding data turn off the retransmit timer.
            // (5.3) ACK of new data in ESTABLISHED state restart the retransmit timer.
            (State::Established, TcpControl::None) => {
                if ack_all {
                    self.timer.set_for_idle(cx.now(), self.keep_alive);
                } else if ack_len > 0 {
                    self.timer
                        .set_for_retransmit(cx.now(), self.rtte.retransmission_timeout());
                }
            }

If I am correct, it should be an easy fix (maybe an reset_retransmission: bool parameter)
I don't have a setup to test that this still fixes the issue, @simonborje / @HKalbasi if you could post here / create a repository of a test that would be great

@simonborje
Copy link
Contributor Author

I was a bit unsure about the delay: delay * 2 part and initially thought it might be responsible for something like exponential backoff, however when testing it I couldn't see that it actually did anything useful.

I see that there is now a new PR #1020 for fixing exponential backoff that modifies the same parts as this PR, so perhaps I should wait for that to land and then rebase this PR.

@HKalbasi
Copy link
Contributor

Here is my mininet config:

import subprocess

from mininet.cli import CLI
from mininet.link import TCLink
from mininet.net import Mininet
from mininet.node import OVSController
from mininet.topo import Topo

LEFT_HOST_NAME: str = 'left_h'
RIGHT_HOST_NAME: str = 'right_h'

class HostInTheMiddleTopo(Topo):

    def build(self):

        left_host = self.addHost(LEFT_HOST_NAME)
        right_host = self.addHost(RIGHT_HOST_NAME)

        self.addLink(left_host, right_host, bw=100, delay="1ms")


def get_interface(host: str, index: int):
    return f"{host}-eth{index}"


def disable_offload(host, offload: str, index: int):
    host.cmd(f"ethtool -K {get_interface(host.name, index)} {offload} off")

def run():
    net = Mininet(topo=HostInTheMiddleTopo(),
                  controller=OVSController, link=TCLink)

    left_host = net[LEFT_HOST_NAME]
    right_host = net[RIGHT_HOST_NAME]

    for host in [left_host, right_host, middle_host]:
        for name in ["tx", "rx"]:
            disable_offload(host, name, 0)

    left_host.cmd("tcpdump -w left.pcap &")
    right_host.cmd("tcpdump -w right.pcap &")

    # left_host.cmd(f"nc -l 0.0.0.0 8000 > /dev/null &")
    # left_host.cmd(f"yes | nc -l 0.0.0.0 8000 &")
    # right_host.cmd("ifstat -i right_h-eth0 > speedd.txt &")
    # left_host.cmd(f"sudo RUST_LOG=trace ./target/release/smol-speed > out.txt 2> err.txt &")

    net.start()
    CLI(net)
    net.stop()

if __name__ == "__main__":
    run()

You can run it by python3 net.py. It will give you a cli that have access to two machines connected with a link with bandwidth 100 Mbit/s and delay 1ms. You can run bash commands on each machine, for example left_h ip a or right_h ./run_my_program_in_background &. To test smoltcp with it, you should use a smoltcp::phy::RawSocket("left_h-eth0, Medium::Ethernet) and you can connect to it from right host using netcat, both from the python code and in the cli. You can also add more hosts and switches to make more complex network topologies.

@Dirbaio Dirbaio force-pushed the retransmit_timer_patch branch from 546aba9 to 13774ff Compare December 27, 2024 17:56
@Dirbaio Dirbaio enabled auto-merge December 27, 2024 17:56
@Dirbaio Dirbaio added this pull request to the merge queue Dec 27, 2024
Merged via the queue into smoltcp-rs:main with commit 37a6e1c Dec 27, 2024
12 checks passed
@simonborje simonborje deleted the retransmit_timer_patch branch December 27, 2024 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants