Document connection-layer redesign requirements

## Background
While stabilizing PR #2064 we instrumented the transport/event-loop boundary (`peer_connection_listener`, `priority_select`, etc.) and uncovered multiple structural issues. The current model spawns a short-lived `peer_connection_listener` future for every inbound packet and relies on the main event loop to immediately push that future back into the select-stream. Outbound messages to that peer share the same `tokio::mpsc` channel, so any delay in re-spawning the listener starves both inbound and outbound traffic.

## Problems Observed
- **Starvation of outbound acknowledgements.** In repeated failures of `test_three_node_network_connectivity` we see the storage node log `Sending outbound message… SuccessfulPut`, yet the requester never logs the corresponding inbound packet. No errors or connection drops occur—the ack just sits in the channel because the listener was busy waiting on `conn.recv()` and never drained the queued outbound work. (See `/tmp/connectivity_attempt_new_13.log` and `/tmp/freenet-test-test_failure-20251112-182337/events.md`).
- **Listener lifecycle coupled to event-loop scheduling.** Because the listener future returns after every packet, any hiccup in `priority_select` (e.g., task cancellation, delay while processing other sources) means the connection stops reading from the UDP socket entirely. This leads to the “random” PUT timeouts we chased for days.
- **Shared channel for data and control.** `peer_connection_listener` multiplexes outbound NetMessages and control events (`DropConnection`, `ClosedChannel`) on the same channel it is supposed to drain promptly. When the listener blocks on `conn.recv()`, these control signals can also backlog, delaying disconnects and causing misleading diagnostics.
- **Difficult diagnostics/backpressure visibility.** Without a persistent per-connection task we lack a stable place to collect metrics (queue depth, dropped packets, last-seen timestamps). The workaround is ad-hoc logging sprinkled in the event loop, which still can’t tell us if the socket send actually happened.
- **Tight coupling to higher-level routing.** Transport has to call back into routing for every packet just to get re-polled, increasing the chance of circular dependencies (see Nacho’s warning about deadlocks in `p2p_protoc`).

## Desired Outcome
We need a design for a connection layer where each peer connection is driven by a persistent async task that continuously handles both outbound channel draining and inbound UDP reads, and emits well-defined events back to the rest of the node. The issue should catalog the requirements, failure modes, and observability gaps so we can evaluate redesign options (possibly replacing the spawn-per-packet model described above).

## Related / Overlapping Work
- #2078 "Make peer_connection_listener persistent" already identified one symptom; this issue broadens the scope to capture **all** architectural problems before we pick a specific solution.
- #2075 (subscription routing concerns) and #2069 (fail-fast when peer has no ring location) highlight downstream effects of unreliable connection handling.

Let’s use this issue to agree on the problem statement and success criteria; concrete design proposals can be follow-ups once we’re aligned on the gaps.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Document connection-layer redesign requirements #2079

Background

Problems Observed

Desired Outcome

Related / Overlapping Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Document connection-layer redesign requirements #2079

Description

Background

Problems Observed

Desired Outcome

Related / Overlapping Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions