Skip to content

Document connection-layer redesign requirements #2079

@sanity

Description

@sanity

Background

While stabilizing PR #2064 we instrumented the transport/event-loop boundary (peer_connection_listener, priority_select, etc.) and uncovered multiple structural issues. The current model spawns a short-lived peer_connection_listener future for every inbound packet and relies on the main event loop to immediately push that future back into the select-stream. Outbound messages to that peer share the same tokio::mpsc channel, so any delay in re-spawning the listener starves both inbound and outbound traffic.

Problems Observed

  • Starvation of outbound acknowledgements. In repeated failures of test_three_node_network_connectivity we see the storage node log Sending outbound message… SuccessfulPut, yet the requester never logs the corresponding inbound packet. No errors or connection drops occur—the ack just sits in the channel because the listener was busy waiting on conn.recv() and never drained the queued outbound work. (See /tmp/connectivity_attempt_new_13.log and /tmp/freenet-test-test_failure-20251112-182337/events.md).
  • Listener lifecycle coupled to event-loop scheduling. Because the listener future returns after every packet, any hiccup in priority_select (e.g., task cancellation, delay while processing other sources) means the connection stops reading from the UDP socket entirely. This leads to the “random” PUT timeouts we chased for days.
  • Shared channel for data and control. peer_connection_listener multiplexes outbound NetMessages and control events (DropConnection, ClosedChannel) on the same channel it is supposed to drain promptly. When the listener blocks on conn.recv(), these control signals can also backlog, delaying disconnects and causing misleading diagnostics.
  • Difficult diagnostics/backpressure visibility. Without a persistent per-connection task we lack a stable place to collect metrics (queue depth, dropped packets, last-seen timestamps). The workaround is ad-hoc logging sprinkled in the event loop, which still can’t tell us if the socket send actually happened.
  • Tight coupling to higher-level routing. Transport has to call back into routing for every packet just to get re-polled, increasing the chance of circular dependencies (see Nacho’s warning about deadlocks in p2p_protoc).

Desired Outcome

We need a design for a connection layer where each peer connection is driven by a persistent async task that continuously handles both outbound channel draining and inbound UDP reads, and emits well-defined events back to the rest of the node. The issue should catalog the requirements, failure modes, and observability gaps so we can evaluate redesign options (possibly replacing the spawn-per-packet model described above).

Related / Overlapping Work

Let’s use this issue to agree on the problem statement and success criteria; concrete design proposals can be follow-ups once we’re aligned on the gaps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-networkingArea: Networking, ring protocol, peer discoveryE-hardExperience needed to fix/implement: Hard / a lotS-needs-designStatus: Needs architectural design or RFCT-enhancementType: Improvement to existing functionality

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions