-
Notifications
You must be signed in to change notification settings - Fork 508
Attribution data (feature 36/37) #1044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I've started implementing it in eclair, do you have some test vectors so we can check that we are compatible? |
I don't have test vectors yet, but I can produce them. Will add them to this PR when ready. Capping the max hops at a lower number is fine to me, but do you have a scenario in mind where this would really make the difference? Or is it to more generally that everything above 8 is wasteful? |
4b48481
to
24b10d5
Compare
@thomash-acinq added a happy fat error test vector. |
24b10d5
to
76dbf21
Compare
09-features.md
Outdated
@@ -41,6 +41,7 @@ The Context column decodes as follows: | |||
| 20/21 | `option_anchor_outputs` | Anchor outputs | IN | `option_static_remotekey` | [BOLT #3](03-transactions.md) | | |||
| 22/23 | `option_anchors_zero_fee_htlc_tx` | Anchor commitment type with zero fee HTLC transactions | IN | `option_static_remotekey` | [BOLT #3][bolt03-htlc-tx], [lightning-dev][ml-sighash-single-harmful]| | |||
| 26/27 | `option_shutdown_anysegwit` | Future segwit versions allowed in `shutdown` | IN | | [BOLT #2][bolt02-shutdown] | | |||
| 28/29 | `option_fat_error` | Can generate/relay fat errors in `update_fail_htlc` | IN | | [BOLT #4][bolt04-fat-errors] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this big gap in the bits has emerged here because of tentative spec changes that may or may not make it. Not sure why that is necessary. I thought for unofficial extensions, the custom range is supposed to be used?
I can see that with unofficial features deployed in the wild, it is easier to keep the same bit when something becomes official. But not sure if that is worth creating the gap here? An alternative is to deploy unofficial features in the custom range first, and then later recognize both the official and unofficial bit. Slightly more code, but this feature list remains clean.
Added fat error signaling to the PR. |
76dbf21
to
2de919a
Compare
I've spent a lot of time trying to make the test vector pass and I've finally found what was wrong:
implying that we need to concatenate them in that order. But in your code you follow a different order:
I think the order message + hop payloads + hmacs is more intuitive as it matches the order of the fields in the packet. |
Oh great catch! Will produce a new vector. |
2de919a
to
bcf022b
Compare
@thomash-acinq updated vector |
Updated LND implementation with sender-picked fat error structure parameters: lightningnetwork/lnd#7139 |
bcf022b
to
6bf0729
Compare
I still believe in a future of the lightning network where failures aren't tolerated at all, and all routing nodes deliver excellent service. Then it's just a matter of measuring latency and going for the sub-10 ms ones. |
Hold time filtering to avoid users doing things we think they shouldn't. OP_RETURN vibes 😂 |
I agree, but that's orthogonal to payment latency? I honestly don't understand why it's important to aim for minimal latency: is it for performance's sake, to boast about numbers? A payment that takes 500ms to complete is almost indistinguishable from being instantaneous for our brains, so why does it matter that it takes 50ms instead of 500ms if you have to sacrifice privacy for it? |
For me it is two-fold:
Furthermore I think that an instant instant payment with no observable latency at all is just cool too indeed. |
Are we talking about introducing actual time delays here, or just bucketing the reported value of the hold time? If we're talking about further delaying the actual forwarding of the HTLC then I'm very much against it. Keep in mind this is done per payment attempt so if I need to try out 20 different routes this will accumulate quite fast, significantly killing ux. Seems like the concern is that the accurate reporting of hold times does not directly ruin the privacy for the nodes reporting the hold times, but will eventually lead to intense competition around latency, contributing to further centralization in the network? If we're talking about hold times leaking information about the routing nodes (fingerprinting?) then this is already too noisy to produce consistent results. Every node has different hardware and internet connectivity configuration, I wouldn't expect all LNDs or LDKs to have some entropy in their reports, and there exist better ways to fingerprint anyway. I don't think we should pollute the protocol with this kind of safeguard/filter against more precise reports. If we assume that a user is willing to get their hands dirty to somehow signal lower hold times, they'll find a way. Also let's not underestimate the power of what implementations default to. By having implementations default to scoring all hold times <=100ms the same, or reporting buckets of 100ms by default, most of the concerns seem to be eliminated. Of course, if we assume the majority of the network to be running custom software then what's even the point of this discussion? |
I’m very much against the intentional slowing down of HTLC resolves. Every milli second a HTLC is unresolved it poses a risk to the node runner. A HTLC gets stuck when a node goes offline after the HTLC passed that node. So if you intentionally slow down the HTLC resolve speed, you also increase the number of force closures. I would consider the intentional hold of a HTLC of more than 10ms as an jamming attack. |
That's not at all a valid conclusion. HTLCs get stuck and channels force-close because of bugs, not because nodes go temporarily offline or introduce batching/delays. That's completely unrelated. I 100% agree with you that all bugs that lead to stuck HTLCs or force-closed channels must be fixed and should be the highest priority for all implementations. But that has absolutely nothing to do with whether or not we should introduce delays or measure relay latency in buckets of 100ms. |
I do agree, that it is usually a bug or sometimes negligence, that is the root cause of most force-closes. I also totally agree that there should be an easy way to measure relay latency. In fact I do exactly this to generate the list you can find here on the “node speeds” tab But saying that adding a relay delay would not have an impact of the number of force-closes is - in my opinion - wrong. If a channel with active HTLC goes offline and does not come online again before the HTLC reaches its timeout, this will lead to a force-close. I send out more than 100’000 HTLCs each day. |
Wrote up some thoughts on delving that I think are worth considering when we think about latency measurements and privacy: https://delvingbitcoin.org/t/latency-and-privacy-in-lightning/1723. |
I did a rough implementation in LDK of hold times for the success case. It turned out to be straight-forward, mostly reusing code from the failure case. Test vectors in #1261 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic all looks good, but would like to restructure the Returning Errors
section to follow the requirements / rationale structure that we have elsewhere in the spec - as is the two are interspersed.
Also needs to refer to option_attributable_failures
rather than "attributable failures" in a few place.
09-features.md
Outdated
@@ -46,6 +46,7 @@ The Context column decodes as follows: | |||
| 26/27 | `option_shutdown_anysegwit` | Future segwit versions allowed in `shutdown` | IN | | [BOLT #2][bolt02-shutdown] | | |||
| 28/29 | `option_dual_fund` | Use v2 of channel open, enables dual funding | IN | | [BOLT #2](02-peer-protocol.md) | | |||
| 34/35 | `option_quiesce` | Support for `stfu` message | IN | | [BOLT #2][bolt02-quiescence] | | |||
| 36/37 | `option_attributable_failure` | Can generate/relay attributable failures in `update_fail_htlc` | IN9 | | [BOLT #4][bolt04-attributable-failures] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be in invoices (9
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed the question whether there are scenarios where senders would refuse a payment that pays to a node that doesn't support attribution data? If they'd return random data, and their predecessor adds attribution data, blame still lands on the final node pair.
Maybe attribution data is unnecessary for the final hop? 🤔 Of course they still want to pass back something to not stand out as a recipient.
For hold times, they will probably always report zero anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any instructions in this PR for how to react to the presence / absence of this feature bit in an invoice, so I think we can take it out?
Seems easy enough to come back and add signaling in invoices (+ handling instructions) if we want it in the future?
04-onion-routing.md
Outdated
@@ -579,6 +579,9 @@ should add a random delay before forwarding the error. Failures are likely to | |||
be probing attempts and message timing may help the attacker infer its distance | |||
to the final recipient. | |||
|
|||
Note that nodes on the blinded route return failures through `update_fail_malformed_htlc` and therefore do not and can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: note that nodes on -> note that nodes in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Language skills insufficient for me to flag 'on the route' as incorrect.
04-onion-routing.md
Outdated
Upon receiving a return packet, each hop generates its `ammag`, generates the | ||
pseudo-random byte stream, and applies the result to the return packet before | ||
return-forwarding it. | ||
When supported, the erring node should also populate the `attribution_data` field in `update_fail_htlc` consisting of the following data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can more formally specify this:
- if
option_attributable_failure
is advertised:- if
path_key
is not set in the incomingupdate_add_htlc
:- MUST include
htlc_hold_times
in payload.
- MUST include
- if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
04-onion-routing.md
Outdated
1. data: | ||
* [`20*u32`:`htlc_hold_times`] | ||
* [`210*sha256[..4]`:`truncated_hmacs`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated here and in BOLT02 - perhaps just link to the BOLT02 description?
In addition, each node locally stores data regarding its own sending peer in the | ||
route, so it knows where to return-forward any eventual return packets. | ||
The node generating the error message (_erring node_) builds a return packet | ||
|
||
## Erring node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment on this section: I'm missing the structure of requirements/rationale that we have elsewhere in the specification.
For example, the section for the erring node notes "The sender can use this information to score nodes on latency". To me this should be in the origin node section or all contained in a single rationale at the end. Ditto with the game theory of guessing HMACs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. I've added a commit trying to address this and improve it a little, but it remains difficult to capture this in requirements and still have explanation too I found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see the start time and end time of the htlc_hold_time
defined. I assume these are upon receiving the update_add_htlc
and before sending the update_fulfill_htlc
/update_fail_htlc
, respectively?
Overall, I like the idea of htlc_hold_time
but suspect it'll be gamed by routing nodes.
A motivated sender can always point point what the actual forwarding delays are by bisecting the route. They also always know how long the entire route took as well. You are correct though that it's intended mainly on a best effort basis, the overall assumption is that this can be used to allow a "fast" node to distinguish themselves, and be rewarded for that. |
Attribution data is added to both failed and fulfilled HTLCs lightning/bolts#1044
Failure attribution is important to properly penalize nodes after a payment failure occurs. The goal of the penalty is to give the next attempt a better chance at succeeding. In the happy failure flow, the sender is able to determine the origin of the failure and penalizes a single node or pair of nodes.
Unfortunately it is possible for nodes on the route to hide themselves. If they return random data as the failure message, the sender won't know where the failure happened.
This PR proposes a new failure message format that lets each node commit to the failure message. If one of the nodes corrupts the failure message, the sender will be able to identify that node.
For more information, see https://lists.linuxfoundation.org/pipermail/lightning-dev/2022-October/003723.html.
Furthermore, the htlc fail and fulfill flows are extended to convey self-reported htlc hold times to the sender.
Fail flow implementations
LND implementation: lightningnetwork/lnd#9888
LDK implementation: lightningdevkit/rust-lightning#2256
Eclair implementation: ACINQ/eclair#3065
CLN implementation: ElementsProject/lightning#8291
Fulfill flow implementations
LDK implementation: lightningdevkit/rust-lightning#3801
Eclair implementation: ACINQ/eclair#3100