Category: design
You're Responsible for Resiliency of Your Public Cloud Deployment
Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.
Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.
Public Cloud Cannot Change the Laws of Physics
Listening to public cloud evangelists and marketing departments of vendors selling over-the-cloud networking solutions or multi-cloud orchestration systems, you could start to believe that migrating your workload to a public cloud would solve all your problems… and if you’re gullible enough to listen to them, you’ll get the results you deserve.
Unfortunately, nothing can change the fundamental laws of physics, networking, or application architectures:
Do We Need Math in Networking?
I should have known better, but I couldn’t resist being pulled into a Twitter spat around the question “whether networking engineers need to know something about math” a long while ago.
Before going into the details, let’s start with Wikipedia definition: “Engineering is the use of scientific principles to design and build machines, structures, and other things” including “specific emphasis on particular areas of applied mathematics, applied science, and types of application”.
So feel free to believe that you don’t need any math or other science (because there’s very little science behind what we do in networking) in your job, in which case you might want to stop reading… but then at least please think twice about your job title.
Figure Out What Problem You’re Trying to Solve
A long while ago I got into an hilarious Tweetfest (note to self: don’t… not that I would ever listen) starting with:
Which feature and which Cisco router for layer2 extension over internet 100Mbps with 1500 Bytes MTU
The knee-jerk reaction was obvious: OMG, not again. The ugly ghost of BRouters (or is it RBridges or WAN Extenders?) has awoken. The best reply in this category was definitely:
I cannot fathom the conversation where this was a legitimate design option. May the odds forever be in your favor.
A dozen “this is a dumpster fire” tweets later the problem was rephrased as:
You Don't Need IP Renumbering for Disaster Recovery
This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:
Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.
Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.
Disaster Recovery and Failure Domains
One of the responses to my Disaster Recovery Faking blog post focused on failure domains:
What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?
I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.
Tuning BGP Convergence in High-Availability Firewall Cluster Design
Two weeks ago Nicola Modena explained how to design BGP routing to implement resilient high-availability network services architecture. The next step to tackle was obvious: how do you fine-tune convergence times, and how does BGP convergence compare to the more traditional FHRP-based design.
Worth Reading: Resilience Engineering
Starting with my faking disaster recovery tests blog post Terry Slattery wrote a great article delving into the intricacies of DR testing, types of expected disasters, and resilience engineering. As always, a fantastic read from Terry.
Using BGP for Firewall High Availability: Design and Software Upgrades
Remember the “BGP as High Availability Protocol” article Nicola Modena wrote a few months ago? He finally found time to extend it with BGP design considerations and a description of a seamless-and-safe firewall software upgrade procedure.
Stretched VLANs and Failing Firewall Clusters
After publishing the Disaster Recovery Faking, Take Two blog post (you might want to read that one before proceeding) I was severely reprimanded by several people with ties to virtualization vendors for blaming virtualization consultants when it was obvious the firewall clusters stretched across two data centers caused the total data center meltdown.
Let’s chase that elephant out of the room first. When you drive too fast on an icy road and crash into a tree who do you blame?
- The person who told you it’s perfectly OK to do so;
- The tire manufacturer who advertised how safe their tires were?
- The tires for failing to ignore the laws of physics;
- Yourself for listening to bad advice
For whatever reason some people love to blame the tires ;)
Worth Reading: How Complex Systems Fail
I don’t remember who pointed me to the excellent How Complex Systems Fail document. It’s almost like RFC1925 – I could quote it all day long, and anyone dealing with large mission-critical distributed systems (hint: networks) should read it once a day ;))
Enjoy!
Disaster Recovery Faking, Take Two
An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.
Here’s what he did:
Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.
That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…
Must read: Shades of Lock-in
Gregor Hohpe published an excellent series on Martin Fowler’s web site focusing on various aspects of lock-in. If nothing else, you SHOULD read the shades of lock-in part, and combine it with my thoughts on lock-in in data center networking.
Worth Reading: Anycast DNS in Enterprise Networks
Anycast (advertising the same IP address from multiple servers/locations) has long been used to implement scale-out public DNS services (the whole root DNS system runs on massive anycast), but it’s not as common in enterprise networks.
The blog posts written by Tom Bowles should get you there. He started with the idea and described his implementation using Infoblox DNS.
Want to know even more? I covered numerous load balancing mechanisms including anycast in Data Centers Infrastructure for Networking Engineers webinar.
Redundant BGP Connectivity on a Single ISP Connection
A while ago Johannes Weber tweeted about an interesting challenge:
We want to advertise our AS and PI space over a single ISP connection. How would a setup look like with 2 Cisco routers, using them for hardware redundancy? Is this possible with only 1 neighboring to the ISP?
Hmm, so you have one cable and two router ports that you want to connect to that cable. There’s something wrong with this picture ;)