Disaster Recovery Faking, Take Two
An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.
Here’s what he did:
Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.
That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…
How Did We End With 1500-Byte MTU?
A subscriber sent me this intriguing question:
Is it not theoretically possible for Ethernet frames to be 64k long if ASIC vendors simply bothered or decided to design/make chipsets that supported it? How did we end up in the 1.5k neighborhood? In whose best interest did this happen?
Remember that Ethernet started as a shared-cable 10 Mbps technology. Transmitting a 64k frame on that technology would take approximately 50 msec (or as long as getting from the East Coast to the West Coast). Also, Ethernet had no tight media access control like Token Ring, so it would be possible for a single host to transmit multiple frames without anyone else getting airtime, resulting in unacceptable delays.
How Do You Provision a 500-Switch Network in a Few Days?
TL&DR: You automate the whole process. What else do you expect?
During the Tech Field Day Extra @ Cisco Live Europe 2019 we were taken on a behind-the-stage tour that included a chat with people who built the Cisco Live network, and of course I had to ask how they automated the whole thing. They said “well, we have the guy that wrote the whole system onsite and he’ll be able to tell you more”. Turns out the guy was my good friend Andrew Yourtchenko who graciously showed the system they built and explained the behind-the-scenes details.
Must read: Shades of Lock-in
Gregor Hohpe published an excellent series on Martin Fowler’s web site focusing on various aspects of lock-in. If nothing else, you SHOULD read the shades of lock-in part, and combine it with my thoughts on lock-in in data center networking.
Video: Retransmissions and Flow Control in Computer Networks
Grouping the features needed in a networking stack in a bunch of layered modules is a great idea. Unfortunately, you could place several essential features like error recovery, retransmission, and flow control in several different layers, from the data link layer dealing with individual network segments, to the transport layer dealing with reliable end-to-end transmissions.
Where should we put those modules? As always, the correct answer is it depends, in this particular case, on transmission reliability, latency, and bandwidth cost. You’ll find more details in the Retransmissions and Flow Control part of How Networks Really Work webinar.
Automation Solution: Network Health State Report
How nice would it be to have a fabric health dashboard displaying a summary of numerous parameters you’re interested in (number of operational uplinks, number of BGP sessions…) for every switch in your fabric.
I’m positive you could hack something together using the customization capabilities of your favorite network management system… or you could write a simple data gathering solution like Stephen Harding did while attending the Building Network Automation Solutions online course.
VMware NSX Killed My EVPN Fabric
I had an interesting discussion with someone running VMware NSX on top of VXLAN+EVPN fabric a while ago. That’s a pretty common scenario considering:
- NSX’s insistence on having all VXLAN uplink from the same server in the same subnet;
- Data center switching vendors being on a lemming-like run praising EVPN+VXLAN;
- The reluctance of non-FAANG environments to connect a server to a single switch.
Apart from the weird times when someone started tons of new VMs, his fabric was running well.
The Cost of Disruptiveness and Guerrilla Marketing
A Docker networking rant coming from my good friend Marko Milivojević triggered a severe case of Deja-Moo, resulting in a flood of unpleasant memories caused by too-successful “disruptive” IT vendors.
Imagine you’re working for a startup creating a cool new product in the IT infrastructure space (if you have an oversized ego you would call yourself “disruptive thought leader” on your LinkedIn profile) but nobody is taking you seriously. How about some guerrilla warfare: advertising your product to people who hate the IT operations (today we’d call that Shadow IT).
Optimizing Environment Setup in Ansible Playbooks
Have you ever seen an Ansible playbook where 90% of the code prepares the environment, and then all the work is done in a few template and assemble modules? Here’s an alternative way of getting that done. Is it better? You tell me ;)
Worth Reading: Anycast DNS in Enterprise Networks
Anycast (advertising the same IP address from multiple servers/locations) has long been used to implement scale-out public DNS services (the whole root DNS system runs on massive anycast), but it’s not as common in enterprise networks.
The blog posts written by Tom Bowles should get you there. He started with the idea and described his implementation using Infoblox DNS.
Want to know even more? I covered numerous load balancing mechanisms including anycast in Data Centers Infrastructure for Networking Engineers webinar.
Redundant BGP Connectivity on a Single ISP Connection
A while ago Johannes Weber tweeted about an interesting challenge:
We want to advertise our AS and PI space over a single ISP connection. How would a setup look like with 2 Cisco routers, using them for hardware redundancy? Is this possible with only 1 neighboring to the ISP?
Hmm, so you have one cable and two router ports that you want to connect to that cable. There’s something wrong with this picture ;)
Network Automation Beyond Configuration Templating
Remember Nicky Davey describing how he got large DMVPN deployment back on track with configuration templating? In his own words…:
Configuration templating is still as big win a win for us as it was a year ago. We have since expanded the automation solution, and reading the old blog post makes me realise how far we have come. I began working with this particular customer in May 2017, so 2 years now. At that time the new WAN project was on the horizon and the approach to network configuration was entirely manual.
Here’s how far he got in the meantime:
New Content: Azure Networking and Automation Source-of-Truth
Last week I covered network security groups, application security groups and user-defined routes in the second live session of Azure Networking webinar.
We also had a great guest speaker on the Network Automation course: Damien Garros explained how he used central source-of-truth based on NetBox and Git to set up a network automation stack from the grounds up.
Recordings are already online; you’ll need Standard ipSpace.net Subscription to access the Azure Networking webinar, and Expert ipSpace.net Subscription to access Damien’s presentation. Azure Networking webinar is also part of our new Networking in Public Clouds online course.
Changing Cisco IOS BGP Policies Based on IP SLA Measurements
This is a guest blog post by Philippe Jounin, Senior Network Architect at Orange Business Services.
You could use track objects in Cisco IOS to track route reachability or metric, the status of an interface, or IP SLA compliance for a long time. Initially you could use them to implement reliable static routing (or even shut down a BGP session) or trigger EEM scripts. With a bit more work (and a few more EEM scripts) you could use object tracking to create time-dependent static routes.
Cisco IOS 15 has introduced Enhanced Object Tracking that allows first-hop router protocols like VRRP or HSRP to use tracking state to modify their behavior.
Networking in Public Clouds - New ipSpace.net Online Course
I have exciting news I’d love to share with you: we’re launching a new online course focused on networking in public clouds starting in February 2020 (I’ve been mulling over this idea and polishing the concept for almost 18 months, and finally it all came together ;)
With Go To The Cloud becoming the answer to all questions (regardless of what the question is), you can find tons of materials describing various aspects of public clouds, so you might wonder why I decided to enter the fray. The answer is simple: with everyone being focused on developers, there’s not much that an infrastructure engineer could use to help him survive when the developers move on and he’s left to manage whatever they put in place.