Three Dimensions of BGP Address Family Nerd Knobs

Got into an interesting BGP discussion a few days ago, resulting in a wild chase through recent SRv6 and BGP drafts and RFCs. You might find the results mildly interesting ;)

BGP has three dimensions of address family configurability:

  • Transport sessions. Most vendors implement BGP over TCP over IPv4 and IPv6. I’m sure there’s someone out there running BGP over CLNS1, and there are already drafts proposing running BGP over QUIC2.
  • Address families enabled on individual transport sessions, more precisely a combination of Address Family Identifier (AFI) and Subsequent Address Family Identifier.
  • Next hops address family for enabled address families.

Let’s start with a few well-known AFI/SAFI combinations:

  • 1/1 stands for IPv4 (1) unicast (1)
  • 1/2 is IPv4 (1) multicast (2)
  • 2/1 is IPv6 (2) unicast (1)
  • 2/2 is … ;)
  • 1/4 is IPv4 (1) with MPLS labels (4 = MPLS-labeled prefix)
  • 1/128 is IPv4 MPLS VPN (128 = MPLS-labeled VPN address)
  • 2/128 is IPv6 MPLS VPN
  • 25/70 is EVPN (25 = L2VPN, 70 = EVPN)

AFI/SAFI combinations are negotiated between adjacent BGP neighbors in BGP capabilities exchange, bringing down the BGP transport session every time you change AFI/SAFI set on that session unless your vendor implemented Multisession BGP which opens a new transport session for every AFI/SAFI combination.

In Ye Olde Days they made a sane3 assumption that the next hop for an address family would belong to the same address family (AFI):

  • IPv4 unicast and multicast had IPv4 next hops
  • IPv6 unicast and multicast had IPv6 next hops

It was perfectly possible from the day Multiprotocol Extensions for BGP (RFC 4760) were implemented to run IPv4 unicast (1/1) address family over IPv6 session (or vice versa). After all, BGP updates are just application data, and you could theoretically use avian carries to transport them around. There’s just the tiny little detail of next hop processing: whenever a router doesn’t have a usable next hop handy, it takes its transport address as the next hop.

The results of running IPv4 AF over IPv6 transport session or vice versa are usually hilarious; you need some serious route-map-fu to make it work (set next hop by hand unless it’s already set), and the results are usually not worth the effort4.

The prefixes and next hops belong to the same address family assumption started to break down the moment someone decided to use BGP to propagate VPNv4 prefix information. As is so often the case they ended with a Quick-and-Dirty Solution (QDS):

  • VPNv4 (MPLS VPN) had VPNv4 next hops that were really IPv4 next hops encoded as L3VPN address with RD set to zero.
  • VPNv6 could have IPv4 or IPv6 next hops – IPv6 next hops are VPNv6 addresses with RD set to zero, IPv4 next hops are specified as IPv4-mapped IPv6 addresses with RD set to zero.

Labeled unicast (BGP-LU – RFC 8277) is another tweak: let’s pretend the prefix (NLRI) is a bit longer than usual and contains the label, but the next hop still belongs to the same address family.

Then there’s EVPN. EVPN with MPLS can use IPv4 or IPv6 next hops (because we just need a label toward the next hop anyway), EVPN with VXLAN usually needs IPv4 next hop. I don’t have enough time to dig out the how that works; the details are left as an exercise for an overzealous reader.

The list goes on and on and on and on… and every time you open another RFC you realize how much you don’t know.

But wait, there’s more. RFC 5549 (used to implement Cumulus unnumbered EBGP feature) and its successor RFC 8950 describe how to use IPv6 next hops for IPv4 prefixes. That obviously doesn’t make much sense until you throw some serious ARP glue at the problem (this RIPE presentation does a decent job of explaining the details)… unless you believe in the Power of SRv6.

SRv6 can be used to implement numerous BGP-based overlay services, including global IPv4 and IPv6, and VPN IPv4 and IPv6 – SRv6 proponents reinvented most everything we did in the MPLS world, the only difference being much higher overhead and more complex hardware required by SRv65.

Anyways, if you want to run IPv4 over SRv6, you have to send IPv4 traffic to a SRv6 SID, which is an IPv6 address, thus the need for IPv6 next hops on IPv4 unicast BGP address family, which is usually transported over an IPv6 TCP session.

A careful reader might wonder how we’re going to negotiate the plethora of options. Welcome to the Extended Next Hop Encoding Capability defined in RFC 8950. Capability value is a set of triplets specifying AFI/SAFI/Nexthop AFI, enabling you to tell your neighbors “I want to run IPv4 (AFI) VPN Unicast (SAFI) over IPv6 next hop (NH-AFI)”. Cluedo anyone?

Finally: is it possible to run IPv4 AF with IPv6 next hops over IPv4 transport session? Sure, I just hope no-one implemented it.


  1. Cisco implement CLNS address family, but it has to run over IPv4 transport session. ↩︎

  2. A while ago, everything was better with Bluetooth. These days, everything is better with QUIC. ↩︎

  3. The assumption is still sane, it’s the networking industry that got insane turning BGP into an eventually consistent policy-aware kitchen sink. ↩︎

  4. Not sure whether I wrote a blog post about this, I know it’s covered in Building Large IPv6 Networks webinar. ↩︎

  5. … which is awesome for everyone involved but the customers. ↩︎

Latest blog posts in BGP in Data Center Fabrics series

2 comments:

  1. What do you mean by "serious ARP glue"? I found no clue about that in your referenced RIPE presentation.

    Replies
    1. I find this paragraph vague, too.

    2. You're right -- I remember figuring that out during the RIPE presentation. Maybe it was handled during the questions, will write a follow-up blog post.

  2. Hi,

    are you aware of any BGP over CLNS implementation?

    thank you

    antonio

Add comment
Sidebar