-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/htlc flush shutdown #8145
Bugfix/htlc flush shutdown #8145
Conversation
this commit introduces many of the most common functions you will want to use with the Option type. Not all of them are used immediately in this PR.
db1d08a
to
4154113
Compare
@@ -6490,94 +6490,6 @@ func TestPendingCommitTicker(t *testing.T) { | |||
} | |||
} | |||
|
|||
// TestShutdownIfChannelClean tests that a link will exit the htlcManager loop | |||
// if and only if the underlying channel state is clean. | |||
func TestShutdownIfChannelClean(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Figure out an alternative test to this, since just deleting this test is unacceptable.
Note for Reviewers: If you have opinions on how I should go about testing this new asynchronous shutdown procedure please reply to this thread with your ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test cases I want to add:
- Ensure that the flush operation itself works. It must block update_adds while flushing
- Ensure that the switch doesn't forward to the link when it is flushing.
- Ensure that inbound shutdown results in immediate outbound shutdown and then puts it into a flushing state
7601fef
to
5401d4e
Compare
@@ -135,7 +135,21 @@ type ChannelUpdateHandler interface { | |||
// ShutdownIfChannelClean shuts the link down if the channel state is | |||
// clean. This can be used with dynamic commitment negotiation or coop | |||
// close negotiation which require a clean channel state. | |||
ShutdownIfChannelClean() error | |||
ShutdownHtlcManager() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: I am having second thoughts about this as I think that having this kind of "unsafe" method on the interface isn't a good idea. I think it may be better to replace it with a "true" shutdown method that implicitly calls the flush before calling an internal unsafe shutdownHtlcManager
7a5d9ba
to
62dc6b6
Compare
62dc6b6
to
75dd860
Compare
This commit removes the requirement that the channel state is clean prior to shutdown. Now we invoke the new flush api and make the htlcManager quit and remove the link from the switch when the flush is complete.
75dd860
to
e6a55b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the MVar
for now as I think it needs some changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think mvar is necessary here and the same thing can be accomplished in a simpler way.
One thing that we'll want to add is retransmission of Shutdown
which will likely require a database flag to let us know whether we've ever sent Shutdown
before.
p.cfg.Switch.RemoveLink(cid) | ||
|
||
return nil | ||
return chanLink.Flush(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some awareness/signal needs to be added to the chancloser
or we'll run into spec violations. Consider the case where we're the responder in the coop close flow:
- We receive a
Shutdown
and callfetchActiveChanCloser
. fetchActiveChanCloser
callstryLinkShutdown
which callschanLink.Flush
.Flush
is non-blocking so this returns immediately. We can't make this blocking as this would prevent other coop close requests from continuing.ProcessCloseMsg
is called on the receivedShutdown
message. This will make us send aShutdown
to the peer.- It's possible that the
ChannelLink
sends an htlc after this inhandleDownstreamUpdateAdd
. This would be a spec violation.
The case where we are the coop close initiator leads to the same scenario above, but the initiator will actually send a premature ClosingSigned
when the chancloser
receives the peer's Shutdown
.
Shutdown
should only be sent if it's not possible for the link to send a htlc after Shutdown
due to concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok If I understand this correctly you're saying:
It's not sufficient to handle this in EligibleToForward
because it may have already been added to the downstream mailbox but not processed by the htlcManager. It must be handled in the actual handleDownstreamPkt
itself otherwise something may get into the link process queue, then we flush and send shutdown to our peer, then when we process that downstream add we still send it out because it was already in the queue.
I don't think that automatically implies we need to add something to the ChanCloser, but I do think it means that something needs to change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not sufficient to handle this in EligibleToForward because it may have already been added to the downstream mailbox but not processed by the htlcManager
My understanding is that the change in EligibleToForward
is necessary but not sufficient. We also need a way to cancel back anything that's un-ack'd that might be sitting in the mailbox. We can do that by doing a similar flush check in handleDownstreamUpdateAdd
, which'll then allow us to cancel stuff within the mailbox (mailbox.FailAdd
) batch to the switch.
// net new HTLCs rather than forwarding them. This is the first | ||
// opportunity we have to bounce invalid HTLC adds without | ||
// doing a force-close. | ||
if l.IsFlushing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this doesn't need to be caught here, it can be caught earlier in handleUpstreamMsg
when the add is first sent across. It should also only apply if we've received the peer's Shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this should also be caught in handleDownstreamMsg
or in handleDownstreamUpdateAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @Roasbeef about this. Earlier versions of this did exactly what you suggest but the issue is a question of how you'd respond in the case when they issued the add. From a code cleanliness perspective I would love to do this in handleUpstreamMsg but the issue is it will result in more aggressive behavior than is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's safe to fail HTLCs here. If we've previously forwarded the HTLC it may exist on the outgoing link's commitment tx. If we then startup and call resolveFwdPkgs
, it will sequentially call processRemoteAdds
for each non-full forwarding package. If we then call Flush
at some point during this, we'll cancel one of these HTLCs on the incoming side. Since the outgoing side may still have the HTLC, we'd lose the value of an HTLC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding from my conversation with @Roasbeef yesterday was that this is the function responsible for actually locking in upstream HTLCs and then deciding to forward them after they have been locked in.
My understanding of my change here is that if we were in this flushing state when these irrevocable commitments were made to net-new htlc's, then we haven't forwarded them yet.
What I am getting from your comment here is that if we are mid-way through this process of dealing with the remote adds, then we shouldn't be able to put things into a flushing state, which I agree with.
That said, I'd like to understand what you mean by "if we've previously forwarded the HTLC..." because as I understand things that shouldn't be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On link startup, resolveFwdPkgs
is called which will reforward adds, settles, fails to ensure that these messages reach the intended link (this link will be different from the one doing the reforwarding). This is done in processRemoteSettleFails
& in processRemoteAdds
. We do this because it ensures that if our node crashed or shutdown before these messages reached the other link, we're able to recover on startup. So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state. One way to get around this might be to not cancel if the state is FwdStateProcessed
(see several lines below), but I'd have to see if there is any funny business with exit hops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my point above was that if the add reached the outgoing link and then we end up canceling here due to flushing, we're in a bad state.
If we've reached this point after an internal retransmission, then my understanding is that we haven't wrote anything in the forwarding packages for that entry so it's safe to cancel back. Otherwise, the calls to sendHTLCError
above would introduce the same issue (cancel back HTLC when we have something locked in the outgoing link).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forwarding package is only written to once the response to the add is signed for on the incoming link. So it's possible that an outgoing htlc exists already. The calls to sendHTLCError
above and below should be reproducible such that if it this is a retransmission and it fails, any earlier invocation should have also failed (otherwise we have a bug).
This isn't the case with the flush logic because an earlier processRemoteAdds
may have forwarded the HTLC and now if we've come back up and are flushing, we could hit this error while the HTLC is outgoing
// After we are finished processing the event, if the link is | ||
// flushing, we check if the channel is clean and invoke the | ||
// post-flush hook if it is. | ||
if l.IsFlushing() && l.channel.IsChannelClean() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only shutdown here if we've also received their Shutdown
. IsFlushing
may be set when we're the initiator meaning we haven't received their Shutdown
. The counterparty is still allowed to add HTLC's in this case according to the spec.
I think what we'll need is one variable to know when we've received their Shutdown
and another variable to know when we've sent our Shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a closer look at the spec and see what we can do. I was trying to avoid duplicating the ChanCloser state logic inside of the link since that was a barrier in the PR that precedes this one. As far as I know the ChanCloser is responsible for the state machine from the "initial state" where no shutdown
s have been sent all the way through closing_signed
s. Is there a reason you think this has to be duplicated in the link itself?
Also keep in mind that if we had a "flush" operation that wasn't a shutdown (dyncomm execution), then this wouldn't translate properly. Conceptually all this does is execute the continuation at the first clean state after the flush has been initiated. The shutdown use case puts the shutdown logic into that continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently flushing means that Flush
has been called once (either we've sent Shutdown
or the peer has), but I think there could be a new interface call that means we've sent and received Shutdown
. The issue is that if we've sent Shutdown
, the peer can still send HTLCs or update_fee
and if we've exited, the peer will still be waiting for that HTLC or update_fee
to get resolved. I'm not sure whether this translates well to dynamic commitments, but ideally we'd find a way to do that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the Flush
can only begin once we've both received and sent shutdown
, then.
// net new HTLCs rather than forwarding them. This is the first | ||
// opportunity we have to bounce invalid HTLC adds without | ||
// doing a force-close. | ||
if l.IsFlushing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this should also be caught in handleDownstreamMsg
or in handleDownstreamUpdateAdd
Deleted the staging branch, but this should go into master. I can't re-open this for some reason. |
@Crypt-iQ can you hint at what the easier way might be? Maybe it's an |
Change Description
This change fixes an issue where we would fail a channel when a peer sent us a
shutdown
message while there were still active HTLCs: see #6039Steps to Test
Steps for reviewers to follow to test the change.
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.