Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/htlc flush shutdown #8145

Conversation

ProofOfKeags
Copy link
Collaborator

@ProofOfKeags ProofOfKeags commented Nov 2, 2023

Change Description

This change fixes an issue where we would fail a channel when a peer sent us a shutdown message while there were still active HTLCs: see #6039

Steps to Test

Steps for reviewers to follow to test the change.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

this commit introduces many of the most common functions you will
want to use with the Option type. Not all of them are used
immediately in this PR.
@ProofOfKeags ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch 7 times, most recently from db1d08a to 4154113 Compare November 7, 2023 17:23
@@ -6490,94 +6490,6 @@ func TestPendingCommitTicker(t *testing.T) {
}
}

// TestShutdownIfChannelClean tests that a link will exit the htlcManager loop
// if and only if the underlying channel state is clean.
func TestShutdownIfChannelClean(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Figure out an alternative test to this, since just deleting this test is unacceptable.

Note for Reviewers: If you have opinions on how I should go about testing this new asynchronous shutdown procedure please reply to this thread with your ideas.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cases I want to add:

  • Ensure that the flush operation itself works. It must block update_adds while flushing
  • Ensure that the switch doesn't forward to the link when it is flushing.
  • Ensure that inbound shutdown results in immediate outbound shutdown and then puts it into a flushing state

@ProofOfKeags ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 7601fef to 5401d4e Compare November 7, 2023 18:21
@saubyk saubyk added this to the v0.18.0 milestone Nov 7, 2023
@@ -135,7 +135,21 @@ type ChannelUpdateHandler interface {
// ShutdownIfChannelClean shuts the link down if the channel state is
// clean. This can be used with dynamic commitment negotiation or coop
// close negotiation which require a clean channel state.
ShutdownIfChannelClean() error
ShutdownHtlcManager()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: I am having second thoughts about this as I think that having this kind of "unsafe" method on the interface isn't a good idea. I think it may be better to replace it with a "true" shutdown method that implicitly calls the flush before calling an internal unsafe shutdownHtlcManager

@saubyk saubyk linked an issue Nov 9, 2023 that may be closed by this pull request
@ProofOfKeags ProofOfKeags marked this pull request as ready for review November 9, 2023 15:26
@saubyk saubyk requested review from ellemouton, bhandras and Crypt-iQ and removed request for ellemouton November 9, 2023 15:27
@ProofOfKeags ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch 3 times, most recently from 7a5d9ba to 62dc6b6 Compare November 10, 2023 00:04
@ProofOfKeags ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 62dc6b6 to 75dd860 Compare November 10, 2023 00:10
This commit removes the requirement that the channel state is clean
prior to shutdown. Now we invoke the new flush api and make the
htlcManager quit and remove the link from the switch when the flush
is complete.
@ProofOfKeags ProofOfKeags force-pushed the bugfix/htlc-flush-shutdown branch from 75dd860 to e6a55b4 Compare November 10, 2023 00:12
Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the MVar for now as I think it needs some changes.

fn/option.go Show resolved Hide resolved
fn/option.go Show resolved Hide resolved
fn/option.go Show resolved Hide resolved
fn/mvar.go Show resolved Hide resolved
fn/mvar.go Show resolved Hide resolved
fn/mvar.go Show resolved Hide resolved
fn/mvar.go Show resolved Hide resolved
Copy link
Collaborator

@Crypt-iQ Crypt-iQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.

One thing that we'll want to add is retransmission of Shutdown which will likely require a database flag to let us know whether we've ever sent Shutdown before.

p.cfg.Switch.RemoveLink(cid)

return nil
return chanLink.Flush(func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some awareness/signal needs to be added to the chancloser or we'll run into spec violations. Consider the case where we're the responder in the coop close flow:

  1. We receive a Shutdown and call fetchActiveChanCloser.
  2. fetchActiveChanCloser calls tryLinkShutdown which calls chanLink.Flush. Flush is non-blocking so this returns immediately. We can't make this blocking as this would prevent other coop close requests from continuing.
  3. ProcessCloseMsg is called on the received Shutdown message. This will make us send a Shutdown to the peer.
  4. It's possible that the ChannelLink sends an htlc after this in handleDownstreamUpdateAdd. This would be a spec violation.

The case where we are the coop close initiator leads to the same scenario above, but the initiator will actually send a premature ClosingSigned when the chancloser receives the peer's Shutdown.

Shutdown should only be sent if it's not possible for the link to send a htlc after Shutdown due to concurrency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok If I understand this correctly you're saying:

It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager. It must be handled in the actual handleDownstreamPkt itself otherwise something may get into the link process queue, then we flush and send shutdown to our peer, then when we process that downstream add we still send it out because it was already in the queue.

I don't think that automatically implies we need to add something to the ChanCloser, but I do think it means that something needs to change here.

Copy link
Member

@Roasbeef Roasbeef Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager

My understanding is that the change in EligibleToForward is necessary but not sufficient. We also need a way to cancel back anything that's un-ack'd that might be sitting in the mailbox. We can do that by doing a similar flush check in handleDownstreamUpdateAdd, which'll then allow us to cancel stuff within the mailbox (mailbox.FailAdd) batch to the switch.

// net new HTLCs rather than forwarding them. This is the first
// opportunity we have to bounce invalid HTLC adds without
// doing a force-close.
if l.IsFlushing() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this doesn't need to be caught here, it can be caught earlier in handleUpstreamMsg when the add is first sent across. It should also only apply if we've received the peer's Shutdown

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @Roasbeef about this. Earlier versions of this did exactly what you suggest but the issue is a question of how you'd respond in the case when they issued the add. From a code cleanliness perspective I would love to do this in handleUpstreamMsg but the issue is it will result in more aggressive behavior than is necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to fail HTLCs here. If we've previously forwarded the HTLC it may exist on the outgoing link's commitment tx. If we then startup and call resolveFwdPkgs, it will sequentially call processRemoteAdds for each non-full forwarding package. If we then call Flush at some point during this, we'll cancel one of these HTLCs on the incoming side. Since the outgoing side may still have the HTLC, we'd lose the value of an HTLC

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from my conversation with @Roasbeef yesterday was that this is the function responsible for actually locking in upstream HTLCs and then deciding to forward them after they have been locked in.

My understanding of my change here is that if we were in this flushing state when these irrevocable commitments were made to net-new htlc's, then we haven't forwarded them yet.

What I am getting from your comment here is that if we are mid-way through this process of dealing with the remote adds, then we shouldn't be able to put things into a flushing state, which I agree with.

That said, I'd like to understand what you mean by "if we've previously forwarded the HTLC..." because as I understand things that shouldn't be possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On link startup, resolveFwdPkgs is called which will reforward adds, settles, fails to ensure that these messages reach the intended link (this link will be different from the one doing the reforwarding). This is done in processRemoteSettleFails & in processRemoteAdds. We do this because it ensures that if our node crashed or shutdown before these messages reached the other link, we're able to recover on startup. So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state. One way to get around this might be to not cancel if the state is FwdStateProcessed (see several lines below), but I'd have to see if there is any funny business with exit hops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state.

If we've reached this point after an internal retransmission, then my understanding is that we haven't wrote anything in the forwarding packages for that entry so it's safe to cancel back. Otherwise, the calls to sendHTLCError above would introduce the same issue (cancel back HTLC when we have something locked in the outgoing link).

Copy link
Collaborator

@Crypt-iQ Crypt-iQ Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forwarding package is only written to once the response to the add is signed for on the incoming link. So it's possible that an outgoing htlc exists already. The calls to sendHTLCError above and below should be reproducible such that if it this is a retransmission and it fails, any earlier invocation should have also failed (otherwise we have a bug).

This isn't the case with the flush logic because an earlier processRemoteAdds may have forwarded the HTLC and now if we've come back up and are flushing, we could hit this error while the HTLC is outgoing

// After we are finished processing the event, if the link is
// flushing, we check if the channel is clean and invoke the
// post-flush hook if it is.
if l.IsFlushing() && l.channel.IsChannelClean() {
Copy link
Collaborator

@Crypt-iQ Crypt-iQ Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only shutdown here if we've also received their Shutdown. IsFlushing may be set when we're the initiator meaning we haven't received their Shutdown. The counterparty is still allowed to add HTLC's in this case according to the spec.

I think what we'll need is one variable to know when we've received their Shutdown and another variable to know when we've sent our Shutdown

Copy link
Collaborator Author

@ProofOfKeags ProofOfKeags Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at the spec and see what we can do. I was trying to avoid duplicating the ChanCloser state logic inside of the link since that was a barrier in the PR that precedes this one. As far as I know the ChanCloser is responsible for the state machine from the "initial state" where no shutdowns have been sent all the way through closing_signeds. Is there a reason you think this has to be duplicated in the link itself?

Also keep in mind that if we had a "flush" operation that wasn't a shutdown (dyncomm execution), then this wouldn't translate properly. Conceptually all this does is execute the continuation at the first clean state after the flush has been initiated. The shutdown use case puts the shutdown logic into that continuation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently flushing means that Flush has been called once (either we've sent Shutdown or the peer has), but I think there could be a new interface call that means we've sent and received Shutdown. The issue is that if we've sent Shutdown, the peer can still send HTLCs or update_fee and if we've exited, the peer will still be waiting for that HTLC or update_fee to get resolved. I'm not sure whether this translates well to dynamic commitments, but ideally we'd find a way to do that as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Flush can only begin once we've both received and sent shutdown, then.

// net new HTLCs rather than forwarding them. This is the first
// opportunity we have to bounce invalid HTLC adds without
// doing a force-close.
if l.IsFlushing() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this should also be caught in handleDownstreamMsg or in handleDownstreamUpdateAdd

@Roasbeef Roasbeef deleted the branch lightningnetwork:0-18-staging November 10, 2023 21:31
@Roasbeef Roasbeef closed this Nov 10, 2023
@Roasbeef
Copy link
Member

Deleted the staging branch, but this should go into master.

I can't re-open this for some reason.

@ProofOfKeags
Copy link
Collaborator Author

I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.

@Crypt-iQ can you hint at what the easier way might be? Maybe it's an atomic.Pointer[func ()] ? I worry about using that as it leaves a lot of the synchronization stuff to the caller and I'm trying to move us towards a world where we aren't actively coding synchronization everywhere in the codebase.

@ProofOfKeags ProofOfKeags mentioned this pull request Nov 16, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't mutual close with pending HTLC
5 participants