Category: IP routing
Early bridges implemented a single bridging domain across all ports. Within a few years, we got multiple bridging domains within a single device (including bridging implementation in Cisco IOS). The capability to have multiple bridging domains stretched across several devices was still missing… until the modern-day Pandora opened the VLAN box and forever swamped us in the complexities of large-scale bridging.
Network terminology was easy in the 1980s: bridges forwarded frames between Ethernet segments based on MAC addresses, and routers forwarded network layer packets between network segments. That nirvana couldn’t last long; eventually, a big-enough customer told Cisco: “I don’t want to buy another box if I already have your too-expensive router. I want your router to be a bridge.”
Turning a router into a bridge is easier than going the other way round1: add MAC table and dynamic MAC learning, and spend an evening implementing STP.
When I started implementing the netlab VLAN module, I encountered (at least) three different ways of configuring physical interfaces and bridging domains even though the underlying packet forwarding operations (and sometimes even the forwarding hardware) are the same. That confusopoly is guaranteed to make your head spin for years, and the only way to figure out what’s going on behind the scenes is to go back to the fundamentals.
A friend of mine working for a mid-sized networking vendor sent me an intriguing question:
We have a product using an old ASIC that has 12K forwarding entries, and would like to extend its lifetime. I know you were mentioning some useful tricks, would you happen to remember what they were?
This challenge has no perfect solution, but there are at least three tricks I’ve encountered so far (as always, comments are most welcome):
Most large content providers use some sort of egress traffic engineering on edge web proxy/caching servers to optimize the end-user experience (avoid congested transit autonomous systems) and link utilization on egress links.
I was planning to write a blog post about the tricks they use for ages, and never found time to do it… but if you don’t mind watching a video, the Source Routing on the Edge presentation Oliver Herms had at iNOG::14v does a pretty good job explaining the concepts and a particular implementation.
Every now and then someone tells me how much better the global Internet would be if only we were using recursive layers (RINA) and hierarchical addresses. I always answer “that’s a business problem, not a technical one, and you cannot solve business problems by throwing technology at them”, but of course that has never persuaded anyone who hasn’t been running a large-enough business for long enough.
Syed Khalid Ali left the following question on an old blog post describing the use of IBGP and EBGP in an enterprise network:
From an enterprise customer perspective, should I run iBGP or iBGP+IGP (OSPF/ISIS/EIGRP) or IGP while doing mutual redistribution on the edge routers. I was hoping if you could share some thoughtful insight on when to select one over the another?
We covered tons of relevant details in the January 2022 Design Clinic, here’s the CliffNotes version. Keep in mind that the road to hell (and broken designs) is paved with great recipes and best practices, and that I’m presenting a black-and-white picture because I don’t feel like transcribing the discussion we had into an oversized blog post. People wrote books on this topic; I’m pretty sure you can search for Russ White and find a few of them.
Finally, there’s no good substitute for understanding how things work (which brings me to another webinar ;).
One of my readers sent me an intriguing challenge based on the following design:
- He has a data center with two core switches (C1 and C2) and two Cisco Nexus edge switches (E1 and E2).
- He’s using static default routing from core to edge switches with HSRP on the edge switches.
- E1 is the active HSRP gateway connected to the primary WAN link.
The following picture shows the simplified network diagram:
The multi-threaded routing daemons blog post generated numerous in-depth comments here and on LinkedIn. As always, thanks a million for keeping me honest and providing more details or additional perspectives. Here are some of the best bits.
Jeff Tantsura provided the first dose of reality:
All modern routing protocols implementations are multi-threaded, with a minimum separation of adjacency handling, route calculations and update generation. Note - writing multi-threaded code for complex tasks is a non trivial exercise (you could search for thread safety and similar artifacts and what happens when not implemented correctly). Moving to a multi-threaded code in early 2010s resulted in a multi-release (year) effort and 100s of related bugs all around.
Dr. Tony Przygienda added his hands-on experience (he’s been developing routing protocol software for ages):
More than a decade ago (before SD-WAN was even a thing) I wrote an article describing how easy it is to route different applications onto different links (MPLS/VPN versus IPsec tunnels) using a distance vector routing protocol (preferably BGP, although even RIP would work).
You might find it interesting that it’s possible to solve tough problems with good network design instead of proprietary unicorn dust, so I salvaged the article from some dusty archive, cleaned it up, polished it, and published it on ipSpace.net.
I got into an interesting debate after I published the Anycast Works Just Fine with MPLS/LDP blog post, and after a while it turned out we have a slightly different understanding what anycast means. Time to fall back to a Wikipedia definition:
Anycast is a network addressing and routing methodology in which a single destination IP address is shared by devices (generally servers) in multiple locations. Routers direct packets addressed to this destination to the location nearest the sender, using their normal decision-making algorithms, typically the lowest number of BGP network hops.
Based on that definition, any transport technology that allows the same IP address or prefix to be announced from several locations supports anycast. To make it a bit more challenging, I would add “and if there are multiple paths to the anycast destination that could be used for multipath forwarding1, they should all be used”.
When I wrote the Why Does Internet Keep Breaking? blog post a few weeks ago, I claimed that FRR still uses single-threaded routing daemons (after a too-cursory read of their documentation).
Donald Sharp and Quentin Young politely told me
I was an idiot I should get my facts straight, I removed the offending part of the blog post, promised to write another one going into the details, and Quentin improved the documentation in the meantime, so here we are…
The PowerPoint-level description of this idea sounds fantastic:
- A device runs two active copies of its control plane.
- There is no cold/warm start of the backup control plane. The failover is almost instantaneous.
- The state of all control plane protocols is continuously synchronized between the two control plane instances. If one of them fails, the other one continues running.
- A failure of a control plane instance is thus invisible from the outside.
If this sounds an awful lot like VMware Fault Tolerance, you’re not too far off the mark.
We have school holidays this week, so I’m reposting wonderful comments that would otherwise be lost somewhere in the page margins. Today: Dmitry Perets on the interactions between BFD and GR.
Well, assuming that the C-bit is set honestly (will be funny if not) and assuming that the Helper is using this bit correctly (and I think it’s pretty well defined what “correctly” means - see section 4.3 in RFC 5882), the answer is pretty clear.
The whole High Availability Switching series started with a question along the lines of “does it make sense to run BFD together with Graceful Restart”. After Non-Stop Forwarding 101, Graceful Restart 101, and Graceful Restart and Convergence Speed we finally have enough information to answer that question.
TL&DR: Most probably not.
A more nuanced answer depends (as always) on a gazillion implementation details.