… updated on Friday, November 20, 2020 15:10 UTC
How Fast Can We Detect a Network Failure?
In the introductory fast failover blog post I mentioned the challenge of fast link- and node failure detection, and how it makes little sense to waste your efforts on fast failover tricks if the routing protocol convergence time has the same order of magnitude as failure detection time.
Now let’s focus on realistic failure detection mechanisms and detection times. Imagine a system connecting a hardware switching platform (example: data center switch or a high-end router) with a software switching platform (midrange router):
Self-promotion Disguised as Research Paper
From AI is wrestling with a replication crisis (HT: Drew Conry-Murray)
Last month Nature published a damning response written by 31 scientists to a study from Google Health that had appeared in the journal earlier this year. Google was describing successful trials of an AI that looked for signs of breast cancer in medical images. But according to its critics, the Google team provided so little information about its code and how it was tested that the study amounted to nothing more than a promotion of proprietary tech (emphasis mine).
No surprise there, we’ve seen it before (not to mention the “look how awesome we are, but we can’t tell you the details” Jupiter Rising article).
Video: Getting a Packet Across a Network
After (hopefully) agreeing on what routing, bridging, and switching are, let’s focus on the first important topic in this area: how do we get a packet across the network? Yet again, there are three fundamentally different technologies:
- Source node knows the full path (source routing)
- Source node opens a path (virtual circuit) to the destination node and uses that path to send traffic
- The network performs hop-by-hop destination-address-based packet forwarding.
More details in the Getting Packets Across the Network video.
Worth Reading: Protocol Options Rusted Shut
A long while ago I found a great article explaining TLS 1.3 and its migration woes on CloudFlare blog. While I would strongly recommend you read it just to get familiar with TLS 1.3, the real fun starts when the author discusses migration problems, kludges you have to use trying to fix them, less-than-compliant implementations breaking those kludges, and options that were supposed to be dynamic, but turn out to be static (rusted shut) due to middleboxes that implemented protocols as-seen-in-the-wild not as-described-in-RFCs.
Change a few TLAs and you could be reading about TCP, IP stack, IPv6, BGP… I addressed those aspects in the ossification and centralization part of Upcoming Internet Challenges webinar.
Fast Failover: The Challenge
Sometimes you’re asked to design a network that will reroute around a failure in milliseconds. Is that feasible? Maybe. Is it simple? Absolutely not.
In this series of blog posts we’ll start with the basics, explore the technologies that you can use to reach that goal, and discover one or two unexpected rabbit holes.
New Content: VMware NSX-T 3.0 Update
When VMware NSX-T 3.0 came out, I planned to do an update session of the VMware NSX Technical Deep Dive webinar along the lines of what I did for AWS Networking a few weeks ago. However, it turned out that most of the new features didn’t take more than a bullet or two on an existing slide, or at most a new slide.
Covering them in a live session and then slicing-and-dicing the resulting recording simply didn’t make sense, so I updated the videos in summer 2020 (the last batch was published in early August).
Worth Reading: The Trap of The Premature Senior
Here’s another riff on the “when you’re the smartest person in the room, change the room” theme: The Trap of The Premature Senior by inimitable Charity Majors. Enjoy!
Video: NetQ and Cumulus Linux Data Models
In the last part of his Cumulus Linux 4.0 Update Pete Lumbis talked about using NetQ to capture streaming telemetry and increase network observability, and the new model-driven configuration approach (including all the usual buzzwords like NETCONF, RPC, YAML, JSON, and OpenConfig) coming in 2020.
Renumbering Public Cloud Address Space
Got this question from one of the networking engineers “blessed” with rampant clueless-rush-to-the-cloud.
I plan to peer multiple VNet from different regions. The problem is that there is not any consistent deployment in regards to the private IP subnets used on each VNet to the point I found several of them using public IP blocks as private IP ranges. As far as I recall, in Azure we can’t re-ip the VNets as the resource will be deleted so I don’t see any other option than use NAT from offending VNet subnets to use my internal RFC1918 IPv4 range. Do you have a better idea?
The way I understand Azure, while you COULD have any address range configured as VNet CIDR block, you MUST have non-overlapping address ranges for VNet peering.
Do We Need LFA or FRR for Fast Failover in ECMP Designs?
One of my readers sent me a question along these lines:
Imagine you have a router with four equal-cost paths to prefix X, two toward upstream-1 and two toward upstream-2. Now let’s suppose that one of those links goes down and you want to have link protection. Do I really need Loop-Free Alternate (LFA) or MPLS Fast Reroute (FRR) to get fast (= immediate) failover or could I rely on multiple equal-cost paths to get the job done? I’m getting different answers from different vendors…
Please note that we’re talking about a very specific question of whether in scenarios with equal-cost layer-3 paths the hardware forwarding data structures get adjusted automatically on link failure (without CPU reprogramming them), and whether LFA needs to be configured to make the adjustment happen.