Fast Failover: The Challenge
Sometimes you’re asked to design a network that will reroute around a failure in milliseconds. Is that feasible? Maybe. Is it simple? Absolutely not.
In this series of blog posts we’ll start with the basics, explore the technologies that you can use to reach that goal, and discover one or two unexpected rabbit holes.
New Content: VMware NSX-T 3.0 Update
When VMware NSX-T 3.0 came out, I planned to do an update session of the VMware NSX Technical Deep Dive webinar along the lines of what I did for AWS Networking a few weeks ago. However, it turned out that most of the new features didn’t take more than a bullet or two on an existing slide, or at most a new slide.
Covering them in a live session and then slicing-and-dicing the resulting recording simply didn’t make sense, so I updated the videos in summer 2020 (the last batch was published in early August).
Worth Reading: The Trap of The Premature Senior
Here’s another riff on the “when you’re the smartest person in the room, change the room” theme: The Trap of The Premature Senior by inimitable Charity Majors. Enjoy!
Video: NetQ and Cumulus Linux Data Models
In the last part of his Cumulus Linux 4.0 Update Pete Lumbis talked about using NetQ to capture streaming telemetry and increase network observability, and the new model-driven configuration approach (including all the usual buzzwords like NETCONF, RPC, YAML, JSON, and OpenConfig) coming in 2020.
Renumbering Public Cloud Address Space
Got this question from one of the networking engineers “blessed” with rampant clueless-rush-to-the-cloud.
I plan to peer multiple VNet from different regions. The problem is that there is not any consistent deployment in regards to the private IP subnets used on each VNet to the point I found several of them using public IP blocks as private IP ranges. As far as I recall, in Azure we can’t re-ip the VNets as the resource will be deleted so I don’t see any other option than use NAT from offending VNet subnets to use my internal RFC1918 IPv4 range. Do you have a better idea?
The way I understand Azure, while you COULD have any address range configured as VNet CIDR block, you MUST have non-overlapping address ranges for VNet peering.
Do We Need LFA or FRR for Fast Failover in ECMP Designs?
One of my readers sent me a question along these lines:
Imagine you have a router with four equal-cost paths to prefix X, two toward upstream-1 and two toward upstream-2. Now let’s suppose that one of those links goes down and you want to have link protection. Do I really need Loop-Free Alternate (LFA) or MPLS Fast Reroute (FRR) to get fast (= immediate) failover or could I rely on multiple equal-cost paths to get the job done? I’m getting different answers from different vendors…
Please note that we’re talking about a very specific question of whether in scenarios with equal-cost layer-3 paths the hardware forwarding data structures get adjusted automatically on link failure (without CPU reprogramming them), and whether LFA needs to be configured to make the adjustment happen.
MUST READ: How to troubleshoot routing protocols session flaps
Did you ever experience an out-of-the-blue BGP session flap after you were running that peering for months? As Dmytro Shypovalov explains in his latest blog post, it’s always MTU (just kidding, of course it’s always DNS, but MTU blackholes nonetheless result in some crazy behavior).
New Ansible for Networking Engineers Content
When restructuring our online courses we decided to make the video content that was previously part of Ansible online course available with Standard ipSpace.net Subscription.
If you haven’t enrolled into our automation online course (which always included the extra bits) you’ll find the following additional content in our Ansible for Networking Engineers webinar:
Video: Cisco SD-WAN Onboarding Process
After describing Cisco SD-WAN architecture and routing capabilities, David Penaloza focused on the onboarding process and tasks performed by the Cisco SD-WAN solution (encryption, tunnel establishment, and device onboarding) in it’s so-called Orchestration Plane.
Don't Lift-and-Shift Your Enterprise Spaghetti into a Public Cloud
Jon Kadis spent most of his life working on enterprise networks, and sadly found out that even changing jobs and moving into a public cloud environment can’t save you from people trying to lift-and-shift enterprise IT kludges into a greenfield environment.
Here’s what he sent me:
Building Secure Layer-2 Data Center Fabric with Cisco Nexus Switches
One of my readers is designing a layer-2-only data center fabric (no SVI interfaces on switches) with stringent security requirements using Cisco Nexus switches, and he wondered whether a host connected to such a fabric could attack a switch, and whether it would be possible to reach the management network in that way.
Do you think it’s possible to reach the MANAGEMENT PLANE from the DATA PLANE? Is it valid to think that there is a potential attack vector that someone can compromise to source traffic from the front of the device (ASIC) through the PCI bus across the CPU to the across the PCI bus to the Platform Controller Hub through the I/O card to spew out the Management Port onto that out-of-band network?
My initial answer was “of course there’s always a conduit from the switching ASIC to the CPU, how would you handle STP/CDP/LLDP otherwise”. I also asked Lukas Krattiger for more details; here’s what he sent me:
Worth Reading: The Shared Irresponsibility Model in the Cloud
A long while ago I wrote a blog post along the lines of “it’s ridiculous to allow developers to deploy directly to a public cloud while burdening them with all sorts of crazy barriers when deploying to an on-premises infrastructure,” effectively arguing for self-service approach to on-premises deployments.
Not surprisingly, the reality is grimmer than I expected (I’m appalled at how optimistic my predictions are even though I always come across as a die-hard grumpy pessimist), as explained in The Shared Irresponsibility Model in the Cloud by Dan Hubbard.
For more technical details, watch cloud-focused ipSpace.net webinars, in particular the Cloud Security one.
Yet Again: CLI or API
Carl Montanari recently published an interesting blog post on the punditry of network APIs (including hilarious fact that “SNMP is also an API”), and as someone sent me a link to that post he commented “it reminds me of a few blog posts you wrote a while ago”.
Speaking of those blog posts… last July I was getting bored and put together a list of interesting blog posts I published on that topic. Enjoy!
Podcast: State of Multi-Cloud Networking
In mid-September Ethan Banks invited me to chat about multi-cloud networking in the Day Two Cloud podcast. It was just a few weeks after Corey Quinn published a fantastic Multi-Cloud is the Worst Practice rant, which perfectly matched my observations, so I came well prepared ;)
Weird: Wrong Subnet Mask Causing Unicast Flooding
When I still cared about CCIE certification, I was always tripped up by the weird scenario with (A) mismatched ARP and MAC timeouts and (B) default gateway outside of the forwarding path. When done just right you could get persistent unicast flooding, and I’ve met someone who reported average unicast flooding reaching ~1 Gbps in his data center fabric.
One would hope that we wouldn’t experience similar problems in modern leaf-and-spine fabrics, but one of my readers managed to reproduce the problem within a single subnet in FabricPath with anycast gateway on spine switches when someone misconfigured a subnet mask in one of the servers.