-
Notifications
You must be signed in to change notification settings - Fork 5
BGP unnumbered #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trey/mp-bgp
Are you sure you want to change the base?
BGP unnumbered #598
Conversation
3b513d3 to
ad340c9
Compare
18b39ed to
54b29ba
Compare
fa6bc23 to
c423402
Compare
f96b4d7 to
fbbb83f
Compare
b7b85c5 to
0d429d2
Compare
1f0a732 to
f13f187
Compare
0d429d2 to
af529b5
Compare
af529b5 to
d7fd911
Compare
0527ed9 to
80e651a
Compare
ndp/src/packet.rs
Outdated
| lifetime: 0, //indicates this is not a default router | ||
| // XXX arista routers will only peer with neighbors that have a | ||
| // non-zero lifetime. This is extremely bad behavior as a non-zero | ||
| // lifetime indicates the router is advertising itself as a default | ||
| // router. We need to plumb an arista-workaround option for this. | ||
| lifetime: 1800, | ||
| //lifetime: 0, //indicates this is not a default router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a TODO for this PR. We should talk live about what we want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done.
bgp/src/messages.rs
Outdated
| // Default to supporting v4 over v6 encoding. | ||
| Capability::ExtendedNextHopEncoding { | ||
| elements: vec![ExtendedNexthopElement { | ||
| afi: Afi::Ipv4.into(), | ||
| safi: u8::from(Safi::Unicast).into(), | ||
| nh_afi: Afi::Ipv6.into(), | ||
| }], | ||
| }, | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.
My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done.
| ExtendedNextHopEncoding { | ||
| //XXX trying to avoid a version bump on 86 billion data structures | ||
| // right now. | ||
| #[schemars(skip)] | ||
| elements: Vec<ExtendedNexthopElement>, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.
| pub struct UnnumberedNeighborManager { | ||
| routers: Arc<Mutex<BTreeMap<u32, Arc<Router<BgpConnectionTcp>>>>>, | ||
| ndp_mgr: Arc<NdpManager>, | ||
| pending_sessions: Mutex<HashMap<NbrKey, NbrInfo>>, | ||
| db: Db, | ||
| log: Logger, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you put the type def directly above its impl block
(hopefully this comment was only posted once... github does NOT like coffee shop wifi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here was to lay out all the structures up front and then get into impl blocks below to try to make the overall structure easier to read.
| tx_thread: Option<JoinHandle<()>>, | ||
| rx_thread: Option<JoinHandle<()>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can reuse ManagedThread here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly, i'll look with fresher eyes in the morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// receiving a advertisement or solicitation packet, the current neighbor | ||
| /// (if any) is checked for expiration. | ||
| pub fn rx_loop(&self, s: Socket) { | ||
| const INTERVAL: Duration = Duration::from_secs(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why are we defining the two interval const's in two different spots? why not just tuck them both under impl InterfaceNdpManagerInner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to contain this to the smallest possible scope. To that end i've moved RA_INTERVAL to be a function local constant in tx_loop.
| Ok((len, src)) => { | ||
| let buf: &[u8] = unsafe { | ||
| std::slice::from_raw_parts(buf.as_ptr().cast(), len) | ||
| }; | ||
| let Some(src) = src.as_socket_ipv6().map(|x| *x.ip()) | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add handling here for an EOF read? i.e. len == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a datagram socket, so in that case i think we'll just fail to parse the packet as either an advertisement or a solicitation which I think is what we want.
| s.join_multicast_v6(&ALL_NODES_MCAST, index) | ||
| .map_err(E::JoinAllNodesMulticast)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't also have to join the all routers group? Not necessarily suggesting a change, just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We should be. Fixed.
This causes us to peer with ourself, which i need to investigate a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer happening. Could have been a misconfigured setup. Will keep an eye out for it in tests. The code is back to joining all routers multicast.
|
Thanks for all this work, Ry! I've left individual comments above, but overall I'd say it looks good. |
|
Another couple items:
|
|
Re-running proptest -- it failed w/ the tokio "error 0" signature |
|
This logic needs to be updated Lines 7542 to 7561 in 365caf9
|
This PR adds support for BGP unnumbered.
Depends on
The Main Event
Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.
Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.
Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.
Testing
This PR also adds a new testing substrate in the
falcon-labfolder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.