Protocol desync on undelivered `WIRE_UPDATE_ADD_HTLC`, `WIRE_COMMITMENT_SIGNED` messages leads to channel failure

### On our side (CLN):
1. Transmit `WIRE_UPDATE_ADD_HTLC` with HTLC #*X*.
2. Transmit `WIRE_COMMITMENT_SIGNED` with commitment #*Y*.
3. Transmit `WIRE_UPDATE_ADD_HTLC` with HTLC #[*X*+1].
4. (Can't send another commitment now because we have not yet received acknowledgment of commitment #*Y*.)
5. Get notified of peer connection lost.
6. Peer reconnects inbound.
7. Transmit `WIRE_CHANNEL_REESTABLISH`.
8. Receive `WIRE_CHANNEL_REESTABLISH` with commit=[*Y*−1].
9. Send `WIRE_ERROR`: “bad reestablish commitment_number: [*Y*−1] vs [*Y*+1].”
10. Fail channel and unilaterally close.

### On peer's side:
1. Cleanly stop node and reboot.
2. Restart and reconnect.
3. Transmit `WIRE_CHANNEL_REESTABLISH` with commit=[*Y*−1].
4. Receive `WIRE_CHANNEL_REESTABLISH`.
5. Receive `WIRE_ERROR`: “bad reestablish commitment_number: [*Y*−1] vs [*Y*+1].”

This is a protocol desynchronization because apparently we are assuming that the peer will receive every message we transmit, and we are unforgiving when it turns out that some of our last messages were never delivered. In this particular scenario, since we cannot know whether the peer ever received our three messages (of steps 1-3 on our side), as we never received any messages acknowledging them, we must be prepared to retransmit those messages after reconnecting.

A sane protocol design would _always_ retransmit any unacknowledged messages upon reconnection — and the recipient would acknowledge but otherwise take no action on any messages that it had already received and processed — but this might not be possible with the current LN protocol spec. However, what _should_ be possible in any case would be for us to recognize that the peer's reestablish commitment number corresponds to some instant between the last acknowledgement we received from them and the last message we sent to them and to retransmit all messages that we had transmitted between those two instants, to "catch the peer up" to where we believe it should be. (Note that it would **not** be safe simply to rewind our own state to match the state claimed by the peer, as the peer could be lying about its most recent state in an attempt to steal our funds with a justice transaction.)

It looks like this issue has been the cause of as many as 7 of my unilateral channel closures since I first started running my node, counting only unique channels to which I sent “bad reestablish commitment_number” where the numbers were within 30 (the default `max-concurrent-htlcs`) of each other. It appears that I have been on both sides of the issue, too.

```
2021-05-21T14:52:17.810Z UNUSUAL 03…2c-chan#61718: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 9897 vs 9899
2022-06-08T07:40:00.135Z UNUSUAL 03…e8-chan#248717: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 389 vs 391
2022-06-15T13:34:03.108Z UNUSUAL 02…d2-chan#248606: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 29512 vs 29514
2022-09-14T18:06:47.881Z UNUSUAL 03…7c-chan#249973: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 55…1e: bad reestablish commitment_number: 8210 vs 8209
2022-11-02T08:36:11.905Z UNUSUAL 02…91-chan#250060: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 8b…ba: bad reestablish commitment_number: 16191 vs 16190
2023-08-09T10:56:51.138Z UNUSUAL 02…70-chan#254057: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 92806 vs 92808
2025-01-28T14:37:36.571Z UNUSUAL 03…00-chan#255673: Peer permanent failure in CHANNELD_NORMAL: channeld: sent bad reestablish commitment_number: 17757 vs 17759 (reason=local)
```

I would note that in the cases where I was the sender of the error, the commitment numbers in the error message have always been 2 apart. This may or may not be significant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Protocol desync on undelivered `WIRE_UPDATE_ADD_HTLC`, `WIRE_COMMITMENT_SIGNED` messages leads to channel failure #8040

On our side (CLN):

On peer's side:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Protocol desync on undelivered WIRE_UPDATE_ADD_HTLC, WIRE_COMMITMENT_SIGNED messages leads to channel failure #8040

Description

On our side (CLN):

On peer's side:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Protocol desync on undelivered `WIRE_UPDATE_ADD_HTLC`, `WIRE_COMMITMENT_SIGNED` messages leads to channel failure #8040