Skip to content

Conversation

@rcgoodfellow
Copy link
Collaborator

@rcgoodfellow rcgoodfellow commented Jan 2, 2026

This PR adds support for BGP unnumbered.

Depends on

The Main Event

Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.

Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.

Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.

Testing

This PR also adds a new testing substrate in the falcon-lab folder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.

Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.

@rcgoodfellow rcgoodfellow added this to the 18 milestone Jan 2, 2026
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 3b513d3 to ad340c9 Compare January 2, 2026 16:31
@rcgoodfellow rcgoodfellow changed the title add ndp router manager BGP unnumbered Jan 3, 2026
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 18b39ed to 54b29ba Compare January 3, 2026 08:35
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from fa6bc23 to c423402 Compare January 4, 2026 05:07
@rcgoodfellow rcgoodfellow marked this pull request as ready for review January 5, 2026 22:09
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 5 times, most recently from f96b4d7 to fbbb83f Compare January 8, 2026 09:14
@rcgoodfellow rcgoodfellow changed the base branch from main to trey/mp-bgp January 8, 2026 09:14
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from b7b85c5 to 0d429d2 Compare January 9, 2026 01:47
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 0d429d2 to af529b5 Compare January 10, 2026 03:39
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from af529b5 to d7fd911 Compare January 10, 2026 03:40
@rcgoodfellow rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 0527ed9 to 80e651a Compare January 12, 2026 08:07
Comment on lines 77 to 82
lifetime: 0, //indicates this is not a default router
// XXX arista routers will only peer with neighbors that have a
// non-zero lifetime. This is extremely bad behavior as a non-zero
// lifetime indicates the router is advertising itself as a default
// router. We need to plumb an arista-workaround option for this.
lifetime: 1800,
//lifetime: 0, //indicates this is not a default router
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a TODO for this PR. We should talk live about what we want to do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done.

Comment on lines 571 to 579
// Default to supporting v4 over v6 encoding.
Capability::ExtendedNextHopEncoding {
elements: vec![ExtendedNexthopElement {
afi: Afi::Ipv4.into(),
safi: u8::from(Safi::Unicast).into(),
nh_afi: Afi::Ipv6.into(),
}],
},
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.

My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done.

Comment on lines +3977 to +3982
ExtendedNextHopEncoding {
//XXX trying to avoid a version bump on 86 billion data structures
// right now.
#[schemars(skip)]
elements: Vec<ExtendedNexthopElement>,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.

Comment on lines +22 to +28
pub struct UnnumberedNeighborManager {
routers: Arc<Mutex<BTreeMap<u32, Arc<Router<BgpConnectionTcp>>>>>,
ndp_mgr: Arc<NdpManager>,
pending_sessions: Mutex<HashMap<NbrKey, NbrInfo>>,
db: Db,
log: Logger,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you put the type def directly above its impl block

(hopefully this comment was only posted once... github does NOT like coffee shop wifi)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here was to lay out all the structures up front and then get into impl blocks below to try to make the overall structure easier to read.

Comment on lines +90 to +91
tx_thread: Option<JoinHandle<()>>,
rx_thread: Option<JoinHandle<()>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can reuse ManagedThread here too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly, i'll look with fresher eyes in the morning

Copy link
Collaborator Author

@rcgoodfellow rcgoodfellow Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// receiving a advertisement or solicitation packet, the current neighbor
/// (if any) is checked for expiration.
pub fn rx_loop(&self, s: Socket) {
const INTERVAL: Duration = Duration::from_secs(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why are we defining the two interval const's in two different spots? why not just tuck them both under impl InterfaceNdpManagerInner?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to contain this to the smallest possible scope. To that end i've moved RA_INTERVAL to be a function local constant in tx_loop.

Comment on lines +173 to +178
Ok((len, src)) => {
let buf: &[u8] = unsafe {
std::slice::from_raw_parts(buf.as_ptr().cast(), len)
};
let Some(src) = src.as_socket_ipv6().map(|x| *x.ip())
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add handling here for an EOF read? i.e. len == 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a datagram socket, so in that case i think we'll just fail to parse the packet as either an advertisement or a solicitation which I think is what we want.

Comment on lines +184 to +185
s.join_multicast_v6(&ALL_NODES_MCAST, index)
.map_err(E::JoinAllNodesMulticast)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't also have to join the all routers group? Not necessarily suggesting a change, just curious

Copy link
Collaborator Author

@rcgoodfellow rcgoodfellow Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We should be. Fixed.

This causes us to peer with ourself, which i need to investigate a bit more.

Copy link
Collaborator Author

@rcgoodfellow rcgoodfellow Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer happening. Could have been a misconfigured setup. Will keep an eye out for it in tests. The code is back to joining all routers multicast.

@taspelund
Copy link
Contributor

Thanks for all this work, Ry!

I've left individual comments above, but overall I'd say it looks good.
The only places I have some actual concerns are in enabling ENHE capability by default (I want to make sure we don't break anything for numbered peers), and in the HashMap that rdb is using to store its unnumbered next-hops (I think we should map interface -> address rather than address -> interface, that way we can handle something like fe80::10%qsfp0 and fe80::10%qsfp1 at the same time).

@taspelund
Copy link
Contributor

Another couple items:

  1. There aren't any cargo tests for any of the rust code. I'd really like to see some unit tests for the unnumbered manager, and possibly some integration tests (although, it understandably seems difficult to integrate unnumbered into bgp/src/test.rs)
  2. I'd like to see some docs update for this. I've made an effort to get BGP documented (docs/bgp-architecture.md), so it would be nice to have the BGP-related changes added to that doc + to have the unnumbered manager documented too (perhaps in its own markdown file), even if it's just LLM-generated docs that have been quickly reviewed. This isn't a blocker, but a nice to have.

@taspelund
Copy link
Contributor

Re-running proptest -- it failed w/ the tokio "error 0" signature

@rcgoodfellow
Copy link
Collaborator Author

This logic needs to be updated

maghemite/bgp/src/session.rs

Lines 7542 to 7561 in 365caf9

RouteUpdate::V4(RouteUpdate4::Announce(nlri)) => {
let nh4 = match self.derive_nexthop(Afi::Ipv4, pc)? {
BgpNexthop::Ipv4(addr) => addr,
_ => {
return Err(Error::InvalidAddress(
"IPv4 routes require IPv4 next-hop".into(),
));
}
};
let mut path_attributes = self.router.base_attributes();
path_attributes.push(PathAttributeValue::NextHop(nh4).into());
UpdateMessage {
withdrawn: vec![],
path_attributes,
nlri,
..Default::default()
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants