Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(request-response): Add support for custom dial options when making request to disconnected peer #5692

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nazar-pc
Copy link
Contributor

Description

We discussed #5634 at one of the community calls and the ways to approach it, one of the challenges we discussed was around the argument type and how to handle the case where PeerId is not specified in DialOpts.

I went with a relatively straightforward route: use PeerId when it is specified (same behavior as before this PR) and use ConnectionId otherwise, wrapping both into an enum to use as a key in internal HashMap.

Implementation ended up being fairly simple with the only breaking change being peer field in request_response::Event::OutboundFailure changing from PeerId to Option<PeerId>.

Resolves #5634

Notes & open questions

One thing I do not like too much and only did for backwards compatibility purposes is additional impl From<&PeerId> for DialOpts such that existing calls continue to work. If changing &peer_id to peer_id is an acceptable breaking change then that impl can be removed.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@nazar-pc nazar-pc force-pushed the request-response-with-addresses branch from d8d075f to 05fe824 Compare November 26, 2024 17:33
Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't in the community call where this was discussed, so apologies if this has already been talked through.
But why isn't adding the address of the peer to the swarm (and thus torequest_response::Behavior.addresses) an option?
You mention in #5634 that doing it that way is annoying, but would you mind expanding why?

protocols/request-response/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 438 to +448
/// > **Note**: In order for such a dialing attempt to succeed,
/// > the `RequestResonse` protocol must either be embedded
/// > in another `NetworkBehaviour` that provides peer and
/// > address discovery, or known addresses of peers must be
/// > managed via [`Behaviour::add_address`] and
/// > the `peer` must be [`DialOpts`] with multiaddresses or
/// > in case of simple [`PeerId`] `RequestResponse` protocol
/// > must either be embedded in another `NetworkBehaviour`
/// > that provides peer and address discovery, or known addresses of
/// > peers must be managed via [`Behaviour::add_address`] and
/// > [`Behaviour::remove_address`].
pub fn send_request(&mut self, peer: &PeerId, request: TCodec::Request) -> OutboundRequestId {
pub fn send_request<Peer>(&mut self, peer: Peer, request: TCodec::Request) -> OutboundRequestId
where
DialOpts: From<Peer>,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current description and usage of peer is a bit confusing IMO as it can be a peer id,multiaddress, or DialOpts.

What do you think of adding instead a separate function:
pub fn send_request_with_dial_ops(&mut self, dial_ops: DialOpts, ...).

That way no From<&PeerId> for DialOpts or breaking change on send_request would be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed various options during community call and my understanding was that it is desirable to not have multiple methods and it is also desirable to keep it backwards compatible in terms of API. I do not have strong opinion though and happy to change if necessary.

Copy link
Contributor Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why isn't adding the address of the peer to the swarm (and thus to request_response::Behavior.addresses) an option?
You mention in #5634 that doing it that way is annoying, but would you mind expanding why?

The logic is external to the swarm and implemented using a custom request/response protocol.
We have convenient abstractions that allow us to implement request-response protocols with a single async callback comparing to writing libp2p's behaviors.
It is not needed for anything except a single request, hence no inherent need to store it anywhere in the swarm.

If I was to add a one-time-use peer to the swarm I'd have to:

  1. implement a custom behavior that will actually store it
  2. Make a call into behavior before request to make sure request-response can make a dial if it needs to
  3. Remove address from the behavior right after this request in order to avoid leaking memory

This essentially means a ton of boilerplate that I do not need to do if I have the ability to simply specify the address to dial explicitly.

Comment on lines 438 to +448
/// > **Note**: In order for such a dialing attempt to succeed,
/// > the `RequestResonse` protocol must either be embedded
/// > in another `NetworkBehaviour` that provides peer and
/// > address discovery, or known addresses of peers must be
/// > managed via [`Behaviour::add_address`] and
/// > the `peer` must be [`DialOpts`] with multiaddresses or
/// > in case of simple [`PeerId`] `RequestResponse` protocol
/// > must either be embedded in another `NetworkBehaviour`
/// > that provides peer and address discovery, or known addresses of
/// > peers must be managed via [`Behaviour::add_address`] and
/// > [`Behaviour::remove_address`].
pub fn send_request(&mut self, peer: &PeerId, request: TCodec::Request) -> OutboundRequestId {
pub fn send_request<Peer>(&mut self, peer: Peer, request: TCodec::Request) -> OutboundRequestId
where
DialOpts: From<Peer>,
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed various options during community call and my understanding was that it is desirable to not have multiple methods and it is also desirable to keep it backwards compatible in terms of API. I do not have strong opinion though and happy to change if necessary.

@elenaf9
Copy link
Contributor

elenaf9 commented Nov 29, 2024

If I was to add a one-time-use peer to the swarm I'd have to:

  1. implement a custom behavior that will actually store it
  2. Make a call into behavior before request to make sure request-response can make a dial if it needs to
  3. Remove address from the behavior right after this request in order to avoid leaking memory

The request response behavior already persist explicitly added addresses. You just need to call Swarm::add_peer_address.

Edit: Actually, in an out-of-band discussion with @jxs just now we noticed that the request-response behavior doesn't handle the SwarmEvent::NewExternalAddrOfPeer. So that should be added in

fn on_swarm_event(&mut self, event: FromSwarm) {
self.addresses.on_swarm_event(&event);
match event {
FromSwarm::ConnectionEstablished(_) => {}
FromSwarm::ConnectionClosed(connection_closed) => {
self.on_connection_closed(connection_closed)
}
FromSwarm::AddressChange(address_change) => self.on_address_change(address_change),
FromSwarm::DialFailure(dial_failure) => self.on_dial_failure(dial_failure),
_ => {}
}
}
.
So there the reported address should then just be added to self.addresses, and when sending a request the address will then be used.

Edit2: The request-response behavior does handle relevant swarm events and updates its addresses based on it:

fn on_swarm_event(&mut self, event: FromSwarm) {
self.addresses.on_swarm_event(&event);
.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Nov 29, 2024

The request response behavior already persist explicitly added addresses. You just need to call Swarm::add_peer_address.

There are multiple issues with that:

  • Those addresses are stored in PeerAddresses, which is initialized with defaults, meaning LRU with 100 entries (we sometimes dispatch A LOT more requests than that)
  • There is still a need to add addresses there manually (which will also propagate to behaviors that I don't want it to propagate to)
  • There is no way to ever delete peers from that cache once .remove_address() is gone (already deprecated)

All for what? I just want to make a request and I happen to know the address to dial too, it is too much wasted CPU and brain cycled doing useless things comparing to what this PR offers.

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nazar!

Thanks for the feedback, I wasn't aware of what @elenaf9 referred with Swarm::add_peer_address and #4302.
But as I referred in the call we had, I'd like to follow #4302 and try to address this at the Swarm layer instead of Behaviour layer.
with that in mind and the topics you laid:

Those addresses are stored in PeerAddresses, which is initialized with defaults, meaning LRU with 100 entries (we sometimes dispatch A LOT more requests than that)

that is easily addressable right? We can expose choosing the size for the LruCache inside PeerAddresses via Config

There is still a need to add addresses there manually (which will also propagate to behaviors that I don't want it to propagate to)

this is interesting, can you explain why propagating to other hypothetical behaviours, (and which) would be a disadvantage to you?

There is no way to ever delete peers from that cache once .remove_address() is gone (already deprecated)

Can you also explain why removing would help you? #4302 doesn't mention it, but I don't see an objection to also remove addresses at the Swarm level with a FromSwarm::RemovedExternalAddrOfPeer event, which I also don't think would affect #4103 negatively

Thanks!

@nazar-pc
Copy link
Contributor Author

that is easily addressable right? We can expose choosing the size for the LruCache inside PeerAddresses via Config

It is not exposed right now and generally I see LRU cache as a hack when you can't specify the exact logic explicitly and instead just slap a heuristic number on it and hope it will work out. To be clear, it is a reasonable way to reduce complexity, but it is not optimal either for specialized use cases like this one.

It might work with large enough limit, but such limit will need to be correlated with other parameters elsewhere, increasing swarm configuration complexity and maintenance cost. It will also use more memory than it should and will use unnecessary CPU cycles in the process, not to say the data flow becomes much more convoluted than it needs to be on top of that.

this is interesting, can you explain why propagating to other hypothetical behaviours, (and which) would be a disadvantage to you?

The better question is why would I want to do that to begin with (except the design of request-response that mandates this right now)?

The disadvantages are many:

  • notifying a bunch of behaviors and potentially trigger a bunch of logic in them unnecessarily (meaning each request is more expensive to make than it could have been, which might actually be non-negligible for embedded use cases)
  • waste RAM and CPU on LRU cache and other storage containers other behaviors might have (will need to analyze behaviors unrelated to request-response to understand this, again probably bad for embedded use cases)
  • writing extra code to add/remove addresses that has strictly negative value due to ^ (bad for maintenance)

I simply want for dial to not fail if the address of the peer is not already known to existing behaviors, just dial the explicitly provided address, as simple as that.

I also can't have a shortcut and not add address and not pay the cost of triggering logic in behaviors by checking if established or pending connection is there without writing even more code and spending even more RAM and CPU cycles (I tried that initially and quickly abandoned because I had to store additional hashmaps with non-trivial logic).

Can you also explain why removing would help you? #4302 doesn't mention it, but I don't see an objection to also remove addresses at the Swarm level with a FromSwarm::RemovedExternalAddrOfPeer event, which I also don't think would affect #4103 negatively

As mentioned many times already, these addresses are 100% useless for anything except the request that I want to make. So if I want to keep memory usage under control, I need to clean up the things that I add after they are no longer needed (meaning after request-response request, which is the only purpose here). This gets even more complex though because I need to know if the address wasn't added by something else prior (or else I'll remove it unnecessarily) and handle potential concurrent requests gracefully as well, all of which is very non-trivial.

If this PR goes through none of that is needed and indeed #4302 doesn't inherently require removing anything because it doesn't inherently require adding anything either, only current design of request-response does.


I would even go as far as saying that this cache in request-response is not a feature, but rather a bug, it should be removed without replacement. libp2p users can store addresses anywhere in any way they want (including a simple sidecar behavior) and decide if/how they want to limit number of addresses (LRU, explicit removal, etc.) instead of this storage being forced only everyone unconditionally.

@elenaf9
Copy link
Contributor

elenaf9 commented Dec 2, 2024

Addresses are added to behaviors quite frequently because we generally build on the assumption that more "knowledge" about peer addresses is favorable. Compared to the workload and memory consumption of establishing a whole connection, I'd argue that reporting/ storing/ deleting a single address is rather cheap.

For instance, the identify behavior reports a ToSwarm::NewExternalAddrOfPeer for every listening address that the remote peer sends, which all behaviors then receive. So if you are using identify, the addresses will end up being added and propagated to all behaviors anyway once the connection is established.

Behaviors can also just ignore the reported address, which most actually do.

For request-response specifically, I think the address cache does make sense and should not be removed. IMO network behaviors should be self contained in the sense that it should possible to use a behavior on its own in the swarm, which wouldn't be possible for request-response when there is no way of storing an address.
I do agree with your point that an LRU cache is not optimal for all use-cases. But then again, for your specific case, wouldn't a small LRU cache size already be sufficient, because an added address is only used once, so it doesn't matter if it's evicted from the cache afterwards?

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Dec 3, 2024

Compared to the workload and memory consumption of establishing a whole connection, I'd argue that reporting/ storing/ deleting a single address is rather cheap.

I see your point and agree it is not necessarily particularly expensive, but things that allocate (like LRU) are certainly not free and I would also prefer not to think about that. After working on performance-sensitive code for a while I try to avoid unnecessary overhead whenever possible, so I might be biased.

For request-response specifically, I think the address cache does make sense and should not be removed. IMO network behaviors should be self contained in the sense that it should possible to use a behavior on its own in the swarm, which wouldn't be possible for request-response when there is no way of storing an address.

I see the value in current behavior, so was looking at it from the point of view of extracting that logic into a separate behavior that can be composed together with request-response rather than being forced onto users. It'll also help with debugging in some cases since LRU means connection might succeed in some cases, but not others, which is sometimes tricky to find.

But then again, for your specific case, wouldn't a small LRU cache size already be sufficient, because an added address is only used once, so it doesn't matter if it's evicted from the cache afterwards?

Not really, it depends on how many requests are triggered before the dial is actually happening. If I queue over 100 requests to previously unknown peers, dial will annoyingly fail for some of them. Moreover the cognitive distance between where the address comes from and where it is used is way too large in that case, which hurts debugging.

I already have an address ready to be used, the DialOpts was already conditionally constructed before this PR, just let me construct it explicitly and not think about LRU and other non-determinism 🙏

@elenaf9
Copy link
Contributor

elenaf9 commented Dec 23, 2024

I see the value in current behavior, so was looking at it from the point of view of extracting that logic into a separate behavior that can be composed together with request-response rather than being forced onto users. It'll also help with debugging in some cases since LRU means connection might succeed in some cases, but not others, which is sometimes tricky to find.

An abstract behavior for storing peer addresses (and other info) is currently WIP in #5724.
With this, we could make the address store in request-response opt-out, so that in cases where the LRU cache isn't suitable one can instead not cache addresses at all, or combine it with the new peer-store. Wdyt @jxs @nazar-pc?

I already have an address ready to be used, the DialOpts was already conditionally constructed before this PR, just let me construct it explicitly and not think about LRU and other non-determinism 🙏

I completely understand that for you specific use-case this is the most straightforward solution.
That being said, I (and I think @jxs agrees here) would like to keep our API simple, and only enhance it when really necessary. Adding dial-opts to specific behavior methods mixes swarm-layer logic into behavior-layer logic, as @jxs already mentioned in #5692 (review).

IMO we should only move forward with this PR if either a) the use-case can't be solved with the existing API (which I don't think is the case here, especially with #5724) or b) we have numbers that show that the theoretical performance impact can in fact be measured 🙂

@nazar-pc
Copy link
Contributor Author

With this, we could make the address store in request-response opt-out, so that in cases where the LRU cache isn't suitable one can instead not cache addresses at all, or combine it with the new peer-store. Wdyt @jxs @nazar-pc?

I am more in favor of removing the logic completely. It is trivial to write a behavior that has either LRU or takes advantage of #5724 or something else. Opt-out means more complexity, not less.

IMO we should only move forward with this PR if either a) the use-case can't be solved with the existing API (which I don't think is the case here, especially with #5724) or b) we have numbers that show that the theoretical performance impact can in fact be measured 🙂

IMO #5724 is completely orthogonal to what I need here. I don't need more behaviors and more complex data flow in the app, I need less behavior and logic that is easy to follow.


I don't have anything meaningful to add to this conversation. Not all changes make sense upstream, it is totally fine if this PR is closed without merging, I can understand and respect that, certainly appreciate non-trivial amount of time dedicated to review and thinking about implications regardless of the outcome.

In such case either libp2p fork or fork of request-response protocol will likely be the path forward for us, dealing with caches and additional peer store behaviors triggers way too many edge-cases as alluded to above and will in fact be harder to maintain than just forking libp2p, sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support explicit addresses for dialing purposes send_request
3 participants