Almost a decade ago I described a scenario in which a perfectly valid IBGP topology could result in a permanent routing loop. While one wouldn’t expect to see such a scenario in a well designed network, it’s been known for ages1 that using BGP route reflectors could result in suboptimal forwarding.
Here’s a simple description of how that could happen:
We all know that you have to use an AS number between 64512 and 65535 for private BGP autonomous systems, right? Well, we’re all wrong – the high end of the range is 65534, and Chris Parker wrote a nice blog post explaining the reasons behind that change.
One of my readers sent me this interesting question:
I understand that an SDN controller needs network topology information to build traffic engineering paths with PCE/PCEP… but why would we use BGP-LS to extract the network topology information? Why can’t we run OSPF with controller by simulating a software based OSPF instance in every area to get topology view?
There are several reasons to use BGP-LS:
It started with an interesting question tweeted by @pilgrimdave81
I’ve seen on Cisco NX-OS that it’s preferring a (ospf->bgp) locally redistributed route over a learned EBGP route, until/unless you clear the route, then it correctly prefers the learned BGP one. Seems to be just ooo but don’t remember this being an issue?
Ignoring the “why would you get the same route over OSPF and EBGP, and why would you redistribute an alternate copy of a route you’re getting over EBGP into BGP” aspect, Peter Palúch wrote a detailed explanation of what’s going on and allowed me to copy into a blog post to make it more permanent:
Henk Smit left numerous questions in a comment referring to the Rethinking BGP in the Data Center presentation by Russ White:
In Russ White’s presentation, he listed a few requirements to compare BGP, IS-IS and OSPF. Prefix distribution, filtering, TE, tagging, vendor-support, autoconfig and topology visibility. The one thing I was missing was: scalability.
I noticed the same thing. We kept hearing how BGP scales better than link-state protocols (no doubt about that) and how you couldn’t possibly build a large data center fabric with a link-state protocol… and yet this aspect wasn’t even mentioned.
In the previous blog post in this series, I described why it’s (almost) impossible to implement unequal-cost multipathing for anycast services (multiple servers advertising the same IP address or range) with OSPF. Now let’s see how easy it is to solve the same challenge with BGP DMZ Link Bandwidth attribute.
I didn’t want to listen to the fan noise generated by my measly Intel NUC when simulating a full leaf-and-spine fabric, so I decided to implement a slightly smaller network:
Here’s one of the major differences between Facebook and Google: one of them publishes research papers with helpful and actionable information, the other uses publications as recruitment drive full of we’re so awesome but you have to trust us – we’re not sharing the crucial details.
Just in case you haven’t realized: Petr Lapukhov of the RFC 7938 fame moved from Microsoft to Facebook a few years ago. Coincidence? I think not.
In the previous blog posts in this series, we explored whether we need addresses on point-to-point links (TL&DR: no), whether it’s better to have interface or node addresses (TL&DR: it depends), and why we got unnumbered IPv4 interfaces. Now let’s see how IP routing works over unnumbered interfaces.
A cursory look at an IP routing table (or at CCNA-level materials) tells you that the IP routing table contains prefixes and next hops, and that the next hops are IP addresses. How should that work over unnumbered interfaces, and what should we use for the next-hop IP address in that case?
Ever since draft-lapukhov was first published almost a decade ago, we all knew BGP was the only routing protocol suitable for data center networking… or at least Thought Leaders and vendor marketers seem to be of that persuasion.
After I created the Segment Routing lab to test the relationship between Node Segment ID (SID) and MPLS labels (and added support for IS-IS, SR-MPLS, and BGP to netsim-tools), I was just a minor step away from testing BGP-free core with SR-MPLS.
I added two nodes to my lab setup, this time using IOSv as those nodes need nothing more than EBGP support (and IOSv is tiny compared to IOS XE on CSR):
What is Katacoda? An awesome environment that allows content authors to create scenarios running on Linux VMs accessible through a web browser. I can only hope they’ll fix the quirks and keep going – I have so many ideas what could be done with it.
Why FRR? Not too long ago Jeroen van Bemmel sent me a link to a simple Katacoda scenario he created to demonstrate how to set up netsim-tools and containerlab. His scenario got the tools installed and set up, but couldn’t create a running network as there are almost no usable Network OS images on Docker Hub (that is accessible from within Katacoda) – the only image I could find was FRR.
TL&DR: If you want to test BGP, OSPF, IS-IS, or SR-MPLS in a virtual lab, you might build the lab faster with netsim-tools release 0.6.
In the netsim-tools release 0.6 I focused on adding routing protocol functionality:
- IS-IS on Cisco IOS/IOS XE, Cisco NX-OS, Arista EOS, FRR, and Junos.
- BGP on the same set of platforms, including support for multiple autonomous systems, EBGP, IBGP full mesh, IBGP with route reflectors, next-hop-self control, and BGP/IGP interaction.
- Segment Routing with MPLS on Cisco IOS XE and Arista EOS.
You’ll also get:
When I’d first seen BGP-LS I immediately thought: “it would be cool to use this to fetch link state topology data from the network and build a graph out of it”. In those days the only open-source way I could find to do it involved Open DayLight controller’s BGP-LS-to-REST-API converter, and that felt like deploying an aircraft carrier to fly a kite.
Things have improved dramatically since then. In Visualizing BGP-LS Tables, HB described how he solved the challenge with GoBGP, gRPC interface to GoBGP, and some Python code to parse the data and draw the topology graph with NetworkX. Enjoy!
This podcast introduction was written by Nick Buraglio, the host of today’s podcast.
In today’s evolving landscape of whitebox, brightbox, and software routing, a small but incredibly comprehensive routing platform called FreeRTR has quietly been evolving out of a research and education service provider network in Hungary.
Corey Quinn mentioned me in a tweet linking to AWS announcement that they are the biggest user of BGP RPKI (by the size of signed address space) worldwide. Good for them – I’m sure it got their marketing excited. It’s also trivial to do once you have the infrastructure in place. Just saying…
On a more serious front: how important is RPKI and what misuses can it stop?
If you’ve never heard of RPKI, the AWS blog post is not too bad, Nick Matthews wrote a “look grandma, this is how it works” version in 280-character installments, and you should definitely spend some time exploring MANRS resources. Here’s a short version for differently-attentive ;))