Skip to content

fix: preserve connection on read timeout in recv_header#101

Open
sbaldis wants to merge 1 commit into
obabec:mainfrom
sbaldis:fix/timeout-no-close
Open

fix: preserve connection on read timeout in recv_header#101
sbaldis wants to merge 1 commit into
obabec:mainfrom
sbaldis:fix/timeout-no-close

Conversation

@sbaldis
Copy link
Copy Markdown

@sbaldis sbaldis commented Mar 29, 2026

Problem

recv_header() delegates all errors to handle_rx(), which unconditionally calls close_with() — terminating the connection state machine. This includes Network(ErrorKind::TimedOut), even though a read timeout is a recoverable condition.

In std environments, transports commonly use SO_RCVTIMEO (socket read timeout) to implement bounded polling. When the timeout fires, the transport returns ErrorKind::TimedOut. The current behavior kills the connection, requiring a full reconnect cycle (TCP + CONNECT + SUBSCRIBE) even though no actual I/O failure occurred.

Fix

Intercept ErrorKind::TimedOut in recv_header() before it reaches handle_rx(). The error is still returned to the caller, but the connection remains in NetState::Ok and can resume receiving on the next call.

This is safe because recv_header() is cancel-safe — HeaderState reads one byte at a time and preserves partial progress across interruptions. A read timeout is semantically equivalent to the cancellation that HeaderState already handles.

The fix is scoped to recv_header() only. recv_body() is explicitly not cancel-safe, so its errors still go through handle_rx() on all error kinds.

Test

Adds recv_header_timeout_preserves_connection:

  1. Mock transport returns ErrorKind::TimedOut on the first read
  2. Verifies recv_header() returns RawError::Network(TimedOut)
  3. Verifies the connection is still usable — a subsequent recv_header() succeeds

All 104 existing tests continue to pass.

recv_header() is cancel-safe (HeaderState reads one byte at a time and
preserves partial progress). When the transport returns
ErrorKind::TimedOut, the connection can safely resume on the next call.

Previously, all errors from header reads went through handle_rx(),
which unconditionally calls close_with() — terminating the connection
state machine. This meant a simple read timeout (e.g., from
SO_RCVTIMEO) would kill the connection, requiring a full reconnect
cycle (TCP + CONNECT + SUBSCRIBE).

This change intercepts TimedOut errors in recv_header() before they
reach handle_rx(), returning the error without terminating the
connection. The fix is scoped to recv_header only — recv_body is NOT
cancel-safe and still goes through handle_rx() on all errors.

Includes a test verifying that recv_header() returns the timeout
error but the connection remains usable for subsequent reads.
@julian-graf
Copy link
Copy Markdown
Collaborator

Hey! Thanks for the PR.

Regarding the issue, I have a few questions and I'm not sure whether this is actually a problem:

  • Bounded polling can already be implemented by dropping the poll_header future. For timeouts, you can just poll the client along with a timer future and set the socket timeout to a value greater than the timer. Is this something that doesn't work in your environment? (I do agree that this is probably the easiest way to support bounded polling in a sync implementation, which is something that is planned but I can't say when it will be implemented)
  • According to your implementation, all occurences of a timeout ErrorKind can be treated as recoverable from the IO side, which is something I can't say for sure about all embedded_io_async implementations.
  • The error handling in poll_body still treats a timeout as a hard error. As a user, I think this is surprising. While this is not something that realistically happens because once a header is received, the rest of the packet should follow in time, I think this violates the principle of least surprise. This could probably be worked around in the low level calls of embedded_io_async::Read though.

Regarding the implementation:

  • When a timeout occurs, the client will still return MqttError::Network which is still unrecoverable in the client's error semantics. Perhaps it is worth introducing a new error variant or just adjusting the documentation and MqttError::is_recoverable implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants