Skip to content

Protocol desync on undelivered WIRE_UPDATE_ADD_HTLC, WIRE_COMMITMENT_SIGNED messages leads to channel failure #8040

Open
@whitslack

Description

@whitslack

On our side (CLN):

  1. Transmit WIRE_UPDATE_ADD_HTLC with HTLC #X.
  2. Transmit WIRE_COMMITMENT_SIGNED with commitment #Y.
  3. Transmit WIRE_UPDATE_ADD_HTLC with HTLC #[X+1].
  4. (Can't send another commitment now because we have not yet received acknowledgment of commitment #Y.)
  5. Get notified of peer connection lost.
  6. Peer reconnects inbound.
  7. Transmit WIRE_CHANNEL_REESTABLISH.
  8. Receive WIRE_CHANNEL_REESTABLISH with commit=[Y−1].
  9. Send WIRE_ERROR: “bad reestablish commitment_number: [Y−1] vs [Y+1].”
  10. Fail channel and unilaterally close.

On peer's side:

  1. Cleanly stop node and reboot.
  2. Restart and reconnect.
  3. Transmit WIRE_CHANNEL_REESTABLISH with commit=[Y−1].
  4. Receive WIRE_CHANNEL_REESTABLISH.
  5. Receive WIRE_ERROR: “bad reestablish commitment_number: [Y−1] vs [Y+1].”

This is a protocol desynchronization because apparently we are assuming that the peer will receive every message we transmit, and we are unforgiving when it turns out that some of our last messages were never delivered. In this particular scenario, since we cannot know whether the peer ever received our three messages (of steps 1-3 on our side), as we never received any messages acknowledging them, we must be prepared to retransmit those messages after reconnecting.

A sane protocol design would always retransmit any unacknowledged messages upon reconnection — and the recipient would acknowledge but otherwise take no action on any messages that it had already received and processed — but this might not be possible with the current LN protocol spec. However, what should be possible in any case would be for us to recognize that the peer's reestablish commitment number corresponds to some instant between the last acknowledgement we received from them and the last message we sent to them and to retransmit all messages that we had transmitted between those two instants, to "catch the peer up" to where we believe it should be. (Note that it would not be safe simply to rewind our own state to match the state claimed by the peer, as the peer could be lying about its most recent state in an attempt to steal our funds with a justice transaction.)

It looks like this issue has been the cause of as many as 7 of my unilateral channel closures since I first started running my node, counting only unique channels to which I sent “bad reestablish commitment_number” where the numbers were within 30 (the default max-concurrent-htlcs) of each other. It appears that I have been on both sides of the issue, too.

2021-05-21T14:52:17.810Z UNUSUAL 03…2c-chan#61718: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 9897 vs 9899
2022-06-08T07:40:00.135Z UNUSUAL 03…e8-chan#248717: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 389 vs 391
2022-06-15T13:34:03.108Z UNUSUAL 02…d2-chan#248606: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 29512 vs 29514
2022-09-14T18:06:47.881Z UNUSUAL 03…7c-chan#249973: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 55…1e: bad reestablish commitment_number: 8210 vs 8209
2022-11-02T08:36:11.905Z UNUSUAL 02…91-chan#250060: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 8b…ba: bad reestablish commitment_number: 16191 vs 16190
2023-08-09T10:56:51.138Z UNUSUAL 02…70-chan#254057: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 92806 vs 92808
2025-01-28T14:37:36.571Z UNUSUAL 03…00-chan#255673: Peer permanent failure in CHANNELD_NORMAL: channeld: sent bad reestablish commitment_number: 17757 vs 17759 (reason=local)

I would note that in the cases where I was the sender of the error, the commitment numbers in the error message have always been 2 apart. This may or may not be significant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions