BGP unnumbered #598

rcgoodfellow · 2026-01-02T09:08:56Z

This PR adds support for BGP unnumbered.

Depends on

The Main Event

Unnumbered peers are added as a new type of peer with their own set of API functions. When an unnumbered peer is added it goes into the pending state. The new peer request is handed over to an unnumbered peer manager where IPv6/NDP router advertisements (RA) are used to discover the address of the peer over the interface provided. When a router is discovered on the interface the pending peer then turns into a real peer. We use the IPv6 link-local address discovered in the RA and create a BGP session using that address. From here the extant BGP machinery works in the same way it currently does.

Neighbor discovery is a one-shot process when the unnumbered peer is created. Once we determine the peer address through an RA, that is what we use. If the peer address changes, the session needs to be deleted and recreated so discovery will happen again. No attempt is made to track neighbor state beyond initial session establishment. This may be revisited in a future release. One shot semantics has the benefit of being simple and stable. I don't believe we want to be overly sensitive to NDP RA dynamics as that would undercut the intent of things like BGP hold time and TCP connection tracking. Having a persistent monitor of NDP dynamics will need to be designed carefully.

Unnumbered support is designed to work in a point to point setting. We do not currently handle situations where there are multiple reachable routers reachable over a broadcast domain that the unnumbered interface is connected to. We simply attempt to establish a session with the first peer we find. This is how things have generally worked for unnumbered in my experience. We could develop something more robust, but this becomes a rather complex state machine where we have to try to establish sessions with every peer we discover. This could lead to doomed BGP sessions iterating through a state machine indefinitely.

Testing

This PR also adds a new testing substrate in the falcon-lab folder. BGP unnumbered requires multicast ICMPv6/NDP to work. This means we need a testing environment that routers connected over a broadcast domain, and some other unnumbered implementation to peer with to ensure we are not deluding ourselves with self-peering. Using illumos zones would have been nice here, but none of the BGP implementations I'm aware of on illumos besides mgd fully support unnumbered. FRR on the other hand has great support for unnumbered. So peering with a linux-based router puts us in falcon territory.

Another really important thing to test for unnumbered is mg-lower interaction. We need test coverage on nexthop interface mapping for data plane route installation. The interop test environment is currently upper-half only. Due to license nonsense, the interop topology machinery is also stuck in purgatory. So in this PR I decided to add a new falcon-based testing substrate that only uses unencumbered peers and the oxide nodes are softnpu vms so we have a full lower half with dendrite, P4 and the whole 9 yards. This also means the testing machinery can actually live in this repository which is a nice win. The best way to get acquainted with this new testing substrate is starting by reading through the first test.

falcon-lab/src/util.rs

falcon-lab/src/test.rs

taspelund · 2026-01-12T15:10:46Z

ndp/src/packet.rs

-            lifetime: 0, //indicates this is not a default router
+            // XXX arista routers will only peer with neighbors that have a
+            // non-zero lifetime. This is extremely bad behavior as a non-zero
+            // lifetime indicates the router is advertising itself as a default
+            // router. We need to plumb an arista-workaround option for this.
+            lifetime: 1800,
+            //lifetime: 0, //indicates this is not a default router


Are we wanting to add a config option for this in this PR? if we address it in a later PR, we may have to introduce a new dropshot API version to accommodate it

This is still a TODO for this PR. We should talk live about what we want to do.

Going to plumb a user accessible option act-as-a-default-ipv6-router that takes the lifetime as a parameter.

This is now done.

.github/buildomat/jobs/build.sh

.github/buildomat/jobs/falcon-lab.sh

taspelund · 2026-01-12T15:19:53Z

bgp/src/messages.rs

+                    // Default to supporting v4 over v6 encoding.
+                    Capability::ExtendedNextHopEncoding {
+                        elements: vec![ExtendedNexthopElement {
+                            afi: Afi::Ipv4.into(),
+                            safi: u8::from(Safi::Unicast).into(),
+                            nh_afi: Afi::Ipv6.into(),
+                        }],
+                    },
+                ],


I don't think we can use this as a default in Open messages for all peers.
My understanding of RFC 8950 is that we should only advertise this capability if we intend to advertise IPv4 Unicast NLRI with IPv6 next-hops... which is not something we can assume all the time, especially for preexisting numbered IPv4 peers.

I'm not immediately seeing how advertising this capability would be harmful even when not used, but happy to talk through the issues this could cause.

I don't recall if RFC 8950 says that ENHE advertises an additional next-hop AFI for the session or if ENHE advertises the only next-hop AFI for the session, but my experience with other routing stacks is that they assume all routes will use the ENHE next-hop AFI when that cap has been negotiated.

My concern is that unconditionally advertising ENHE would mean that we're allowing the peer to dictate which next-hop AFI we should use, rather than explicitly disallowing it when we don't want it.
e.g.
We configure a numbered peer over v4 and want to continue using v4 nexthops. If we enable ENHE unconditionally, it's now up to the peer to dictate whether we should send v4 or v6 next-hops depending on whether they advertise ENHE.

For this PR we'll advertise extended nexthop for unnumbered sessions only and revisit broader support later.

This is now done.

bgp/src/messages.rs

taspelund · 2026-01-12T15:23:32Z

bgp/src/messages.rs

+    ExtendedNextHopEncoding {
+        //XXX trying to avoid a version bump on 86 billion data structures
+        // right now.
+        #[schemars(skip)]
+        elements: Vec<ExtendedNexthopElement>,
+    },


If you want help with the struct version updates, I'm happy to do so. I think it would be best to try and get this into the API when we release ENHE so it can be visible to consumers of the API (e.g. when they poll the neighbor status to see what capabilities have been exchanged).

Yeah, this is still a TODO on this PR. Now that we have the API versioning machinery in place, I think we need to think a bit about having raw BGP messages in the API definitions. In the original message history APIs I had these data structures as opaque objects to avoid all the API pain as we fill in capabilities.

bgp/src/messages.rs

mgd/src/unnumbered_manager.rs

taspelund · 2026-01-12T18:56:14Z

mgd/src/unnumbered_manager.rs

+pub struct UnnumberedNeighborManager {
+    routers: Arc<Mutex<BTreeMap<u32, Arc<Router<BgpConnectionTcp>>>>>,
+    ndp_mgr: Arc<NdpManager>,
+    pending_sessions: Mutex<HashMap<NbrKey, NbrInfo>>,
+    db: Db,
+    log: Logger,
+}


nit: can you put the type def directly above its impl block

(hopefully this comment was only posted once... github does NOT like coffee shop wifi)

The intent here was to lay out all the structures up front and then get into impl blocks below to try to make the overall structure easier to read.

ndp/src/manager.rs

taspelund · 2026-01-12T19:09:12Z

ndp/src/manager.rs

+    tx_thread: Option<JoinHandle<()>>,
+    rx_thread: Option<JoinHandle<()>>,


Maybe we can reuse ManagedThread here too?

possibly, i'll look with fresher eyes in the morning

Added TODO for

Want comprehensive framework for observing/report event/thread statistics #609

taspelund · 2026-01-12T19:14:03Z

ndp/src/manager.rs

+    /// receiving a advertisement or solicitation packet, the current neighbor
+    /// (if any) is checked for expiration.
+    pub fn rx_loop(&self, s: Socket) {
+        const INTERVAL: Duration = Duration::from_secs(1);


nit: why are we defining the two interval const's in two different spots? why not just tuck them both under impl InterfaceNdpManagerInner?

I was trying to contain this to the smallest possible scope. To that end i've moved RA_INTERVAL to be a function local constant in tx_loop.

taspelund · 2026-01-12T19:16:14Z

ndp/src/manager.rs

+                Ok((len, src)) => {
+                    let buf: &[u8] = unsafe {
+                        std::slice::from_raw_parts(buf.as_ptr().cast(), len)
+                    };
+                    let Some(src) = src.as_socket_ipv6().map(|x| *x.ip())
+                    else {


Do we need to add handling here for an EOF read? i.e. len == 0

This is a datagram socket, so in that case i think we'll just fail to parse the packet as either an advertisement or a solicitation which I think is what we want.

ndp/src/manager.rs

ndp/src/util.rs

taspelund · 2026-01-12T19:53:54Z

ndp/src/util.rs

+    s.join_multicast_v6(&ALL_NODES_MCAST, index)
+        .map_err(E::JoinAllNodesMulticast)?;


We don't also have to join the all routers group? Not necessarily suggesting a change, just curious

~~Good catch. We should be. Fixed.~~

This causes us to peer with ourself, which i need to investigate a bit more.

This is no longer happening. Could have been a misconfigured setup. Will keep an eye out for it in tests. The code is back to joining all routers multicast.

rdb/src/db.rs

taspelund · 2026-01-12T20:38:31Z

Thanks for all this work, Ry!

I've left individual comments above, but overall I'd say it looks good.
The only places I have some actual concerns are in enabling ENHE capability by default (I want to make sure we don't break anything for numbered peers), and in the HashMap that rdb is using to store its unnumbered next-hops (I think we should map interface -> address rather than address -> interface, that way we can handle something like fe80::10%qsfp0 and fe80::10%qsfp1 at the same time).

ndp/src/manager.rs

taspelund · 2026-01-12T21:04:51Z

Another couple items:

There aren't any cargo tests for any of the rust code. I'd really like to see some unit tests for the unnumbered manager, and possibly some integration tests (although, it understandably seems difficult to integrate unnumbered into bgp/src/test.rs)
I'd like to see some docs update for this. I've made an effort to get BGP documented (docs/bgp-architecture.md), so it would be nice to have the BGP-related changes added to that doc + to have the unnumbered manager documented too (perhaps in its own markdown file), even if it's just LLM-generated docs that have been quickly reviewed. This isn't a blocker, but a nice to have.

taspelund · 2026-01-13T16:28:13Z

Re-running proptest -- it failed w/ the tokio "error 0" signature

rcgoodfellow · 2026-01-14T20:55:04Z

This logic needs to be updated

maghemite/bgp/src/session.rs

Lines 7542 to 7561 in 365caf9

    
           RouteUpdate::V4(RouteUpdate4::Announce(nlri)) => { 
        
               let nh4 = match self.derive_nexthop(Afi::Ipv4, pc)? { 
        
                   BgpNexthop::Ipv4(addr) => addr, 
        
                   _ => { 
        
                       return Err(Error::InvalidAddress( 
        
                           "IPv4 routes require IPv4 next-hop".into(), 
        
                       )); 
        
                   } 
        
               }; 
        
               let mut path_attributes = self.router.base_attributes(); 
        
               path_attributes.push(PathAttributeValue::NextHop(nh4).into()); 
        
               UpdateMessage { 
        
                   withdrawn: vec![], 
        
                   path_attributes, 
        
                   nlri, 
        
                   ..Default::default() 
        
               } 
        
           }

rcgoodfellow added this to the 18 milestone Jan 2, 2026

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 3b513d3 to ad340c9 Compare January 2, 2026 16:31

rcgoodfellow changed the title ~~add ndp router manager~~ BGP unnumbered Jan 3, 2026

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 18b39ed to 54b29ba Compare January 3, 2026 08:35

rcgoodfellow mentioned this pull request Jan 3, 2026

ipv6 e2e integration oxidecomputer/omicron#9570

Draft

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from fa6bc23 to c423402 Compare January 4, 2026 05:07

rcgoodfellow marked this pull request as ready for review January 5, 2026 22:09

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch 5 times, most recently from f96b4d7 to fbbb83f Compare January 8, 2026 09:14

rcgoodfellow changed the base branch from main to trey/mp-bgp January 8, 2026 09:14

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from b7b85c5 to 0d429d2 Compare January 9, 2026 01:47

rcgoodfellow force-pushed the trey/mp-bgp branch from 1f0a732 to f13f187 Compare January 10, 2026 01:26

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 0d429d2 to af529b5 Compare January 10, 2026 03:39

bgp unnumbered

d7fd911

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from af529b5 to d7fd911 Compare January 10, 2026 03:40

rcgoodfellow added 2 commits January 10, 2026 07:08

falcon-lab improvements

7533d80

peering with arista/eos working

80e651a

rcgoodfellow force-pushed the ry/mgd-v6-router-disco branch from 0527ed9 to 80e651a Compare January 12, 2026 08:07