Category: high availability
Considering VMware’s enrapturement with vMotion the following news (reported by Salman Naqvi in a comment to my blog post) was clearly inevitable:
I was surprised to learn that LIVE vMotion is supported between on-premise and Vmware on AWS Cloud
What’s more interesting is how did they manage to do it?
Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.
Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.
Listening to public cloud evangelists and marketing departments of vendors selling over-the-cloud networking solutions or multi-cloud orchestration systems, you could start to believe that migrating your workload to a public cloud would solve all your problems… and if you’re gullible enough to listen to them, you’ll get the results you deserve.
Unfortunately, nothing can change the fundamental laws of physics, networking, or application architectures:
This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:
Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.
Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.
One of the responses to my Disaster Recovery Faking blog post focused on failure domains:
What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?
I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.
Two weeks ago Nicola Modena explained how to design BGP routing to implement resilient high-availability network services architecture. The next step to tackle was obvious: how do you fine-tune convergence times, and how does BGP convergence compare to the more traditional FHRP-based design.
Remember the “BGP as High Availability Protocol” article Nicola Modena wrote a few months ago? He finally found time to extend it with BGP design considerations and a description of a seamless-and-safe firewall software upgrade procedure.
After publishing the Disaster Recovery Faking, Take Two blog post (you might want to read that one before proceeding) I was severely reprimanded by several people with ties to virtualization vendors for blaming virtualization consultants when it was obvious the firewall clusters stretched across two data centers caused the total data center meltdown.
Let’s chase that elephant out of the room first. When you drive too fast on an icy road and crash into a tree who do you blame?
- The person who told you it’s perfectly OK to do so;
- The tire manufacturer who advertised how safe their tires were?
- The tires for failing to ignore the laws of physics;
- Yourself for listening to bad advice
For whatever reason some people love to blame the tires ;)
An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.
Here’s what he did:
Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.
That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…
The March 2019 Packet Pushers Virtual Design Clinic had to deal with an interesting question:
Our server team is nervous about full-scale DR testing. So they have asked us to stretch L2 between sites. Is this a good idea?
The design clinic participants were a bit more diplomatic (watch the video) than my TL&DR answer which would be: **** NO!
Let’s step back and try to understand what’s really going on:
Christoph Jaggi sent me this observation during one of our SD-WAN discussions:
The centralized controller is another shortcoming of SD-WAN that hasn’t been really addressed yet. In a global WAN it can and does happen that a region might be cut off due to a cut cable or an attack. Without connection to the central SD-WAN controller the part that is cut off cannot even communicate within itself as there is no control plane…
A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller?”
A good friend of mine who prefers to stay A. Nonymous for obvious reasons sent me his “how I lost my data center to a broadcast storm” story. Enjoy!
Small-ish data center with several hundred racks. Row of racks supported by an end-of-row stack. Each stack with 2 x L2 EtherChannels, one EC to each of 2 core switches. The inter-switch link details don’t matter other than to highlight “sprawling L2 domains."
VLAN pruning was used to limit L2 scope, but a few VLANs went everywhere, including the management VLAN.
We all know about catastrophic headline-generating failures like AWS East-1 region falling apart or a major provider being down for a day or two. Then there are failures known only to those who care, like losing a major exchange point. However, I’m becoming more and more certain that the known failures are not even the tip of the iceberg - they seem to be the climber at the iceberg summit.
Every time I’m running a data center-related workshop I inevitably get pulled into stretched VLAN and stretched clusters discussion. While I always tell the attendees what the right way of doing this is, and explain the challenges of stretched VLANs from all perspectives (application, database, storage, routing, and broadcast domains) the sad truth is that sometimes there’s nothing you can do.
In those sad cases, I can give the workshop attendees only one advice: face the reality, and figure out how badly you might fail. It’s useless pretending that you won’t get into a split-brain scenario - redundant equipment just makes it less likely unless you over-complicated it in which case adding redundancy reduces availability. It’s also useless pretending you won’t be facing a forwarding loop.
One of the attendees of the Building Next-Generation Data Center online course solved the build small data center fabric challenge with Virtual Chassis Fabric (VCF). I pointed out that I would prefer not to use VCF as it uses centralized control plane and is thus a single failure domain.
Here are his arguments for using VCF: