Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol desync on undelivered WIRE_UPDATE_ADD_HTLC, WIRE_COMMITMENT_SIGNED messages leads to channel failure #8040

Open
whitslack opened this issue Jan 29, 2025 · 1 comment

Comments

@whitslack
Copy link
Collaborator

whitslack commented Jan 29, 2025

On our side (CLN):

  1. Transmit WIRE_UPDATE_ADD_HTLC with HTLC #X.
  2. Transmit WIRE_COMMITMENT_SIGNED with commitment #Y.
  3. Transmit WIRE_UPDATE_ADD_HTLC with HTLC #[X+1].
  4. (Can't send another commitment now because we have not yet received acknowledgment of commitment #Y.)
  5. Get notified of peer connection lost.
  6. Peer reconnects inbound.
  7. Transmit WIRE_CHANNEL_REESTABLISH.
  8. Receive WIRE_CHANNEL_REESTABLISH with commit=[Y−1].
  9. Send WIRE_ERROR: “bad reestablish commitment_number: [Y−1] vs [Y+1].”
  10. Fail channel and unilaterally close.

On peer's side:

  1. Cleanly stop node and reboot.
  2. Restart and reconnect.
  3. Transmit WIRE_CHANNEL_REESTABLISH with commit=[Y−1].
  4. Receive WIRE_CHANNEL_REESTABLISH.
  5. Receive WIRE_ERROR: “bad reestablish commitment_number: [Y−1] vs [Y+1].”

This is a protocol desynchronization because apparently we are assuming that the peer will receive every message we transmit, and we are unforgiving when it turns out that some of our last messages were never delivered. In this particular scenario, since we cannot know whether the peer ever received our three messages (of steps 1-3 on our side), as we never received any messages acknowledging them, we must be prepared to retransmit those messages after reconnecting.

A sane protocol design would always retransmit any unacknowledged messages upon reconnection — and the recipient would acknowledge but otherwise take no action on any messages that it had already received and processed — but this might not be possible with the current LN protocol spec. However, what should be possible in any case would be for us to recognize that the peer's reestablish commitment number corresponds to some instant between the last acknowledgement we received from them and the last message we sent to them and to retransmit all messages that we had transmitted between those two instants, to "catch the peer up" to where we believe it should be. (Note that it would not be safe simply to rewind our own state to match the state claimed by the peer, as the peer could be lying about its most recent state in an attempt to steal our funds with a justice transaction.)

It looks like this issue has been the cause of as many as 7 of my unilateral channel closures since I first started running my node, counting only unique channels to which I sent “bad reestablish commitment_number” where the numbers were within 30 (the default max-concurrent-htlcs) of each other. It appears that I have been on both sides of the issue, too.

2021-05-21T14:52:17.810Z UNUSUAL 03…2c-chan#61718: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 9897 vs 9899
2022-06-08T07:40:00.135Z UNUSUAL 03…e8-chan#248717: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 389 vs 391
2022-06-15T13:34:03.108Z UNUSUAL 02…d2-chan#248606: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 29512 vs 29514
2022-09-14T18:06:47.881Z UNUSUAL 03…7c-chan#249973: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 55…1e: bad reestablish commitment_number: 8210 vs 8209
2022-11-02T08:36:11.905Z UNUSUAL 02…91-chan#250060: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 8b…ba: bad reestablish commitment_number: 16191 vs 16190
2023-08-09T10:56:51.138Z UNUSUAL 02…70-chan#254057: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish commitment_number: 92806 vs 92808
2025-01-28T14:37:36.571Z UNUSUAL 03…00-chan#255673: Peer permanent failure in CHANNELD_NORMAL: channeld: sent bad reestablish commitment_number: 17757 vs 17759 (reason=local)

I would note that in the cases where I was the sender of the error, the commitment numbers in the error message have always been 2 apart. This may or may not be significant.

@whitslack
Copy link
Collaborator Author

Implementing graceful shutdown (#4842) might partially mitigate this issue, but it won't be an airtight fix. In general we must always retransmit any messages that we previously transmitted but for which we never received acknowledgement. TCP takes care of this for us within the context of a single socket connection, but we have to do it at the application layer in the cases when the TCP connection breaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant