-
-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Background
While stabilizing PR #2064 we instrumented the transport/event-loop boundary (peer_connection_listener, priority_select, etc.) and uncovered multiple structural issues. The current model spawns a short-lived peer_connection_listener future for every inbound packet and relies on the main event loop to immediately push that future back into the select-stream. Outbound messages to that peer share the same tokio::mpsc channel, so any delay in re-spawning the listener starves both inbound and outbound traffic.
Problems Observed
- Starvation of outbound acknowledgements. In repeated failures of
test_three_node_network_connectivitywe see the storage node logSending outbound message… SuccessfulPut, yet the requester never logs the corresponding inbound packet. No errors or connection drops occur—the ack just sits in the channel because the listener was busy waiting onconn.recv()and never drained the queued outbound work. (See/tmp/connectivity_attempt_new_13.logand/tmp/freenet-test-test_failure-20251112-182337/events.md). - Listener lifecycle coupled to event-loop scheduling. Because the listener future returns after every packet, any hiccup in
priority_select(e.g., task cancellation, delay while processing other sources) means the connection stops reading from the UDP socket entirely. This leads to the “random” PUT timeouts we chased for days. - Shared channel for data and control.
peer_connection_listenermultiplexes outbound NetMessages and control events (DropConnection,ClosedChannel) on the same channel it is supposed to drain promptly. When the listener blocks onconn.recv(), these control signals can also backlog, delaying disconnects and causing misleading diagnostics. - Difficult diagnostics/backpressure visibility. Without a persistent per-connection task we lack a stable place to collect metrics (queue depth, dropped packets, last-seen timestamps). The workaround is ad-hoc logging sprinkled in the event loop, which still can’t tell us if the socket send actually happened.
- Tight coupling to higher-level routing. Transport has to call back into routing for every packet just to get re-polled, increasing the chance of circular dependencies (see Nacho’s warning about deadlocks in
p2p_protoc).
Desired Outcome
We need a design for a connection layer where each peer connection is driven by a persistent async task that continuously handles both outbound channel draining and inbound UDP reads, and emits well-defined events back to the rest of the node. The issue should catalog the requirements, failure modes, and observability gaps so we can evaluate redesign options (possibly replacing the spawn-per-packet model described above).
Related / Overlapping Work
- Make peer_connection_listener persistent #2078 "Make peer_connection_listener persistent" already identified one symptom; this issue broadens the scope to capture all architectural problems before we pick a specific solution.
- Decouple local client subscriptions from network subscriptions #2075 (subscription routing concerns) and Fail fast when peer has no ring location #2069 (fail-fast when peer has no ring location) highlight downstream effects of unreliable connection handling.
Let’s use this issue to agree on the problem statement and success criteria; concrete design proposals can be follow-ups once we’re aligned on the gaps.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status