Building network automation solutions

9 module online course

Start now!

Category: high availability

You're Responsible for Resiliency of Your Public Cloud Deployment

Enterprise environments usually implement “mission-critical” applications by pushing high-availability requirements down the stack until they hit networking… and then blame the networking team when the whole house of cards collapses.

Most public cloud providers are not willing to play the same stupid blame-shifting game - they live or die by their reputation, and maintaining a stable service is their highest priority. They will do their best to implement a robust and resilient infrastructure, but will not do anything that could impact its stability or scalability… including the snake oil the virtualization and networking vendors love to sell to their gullible customers. When you deploy your application workloads into a public cloud, you become responsible for the resiliency of your own application, and there’s no magic button that could allow you to push the problems down the stack.

read more see 1 comments

Public Cloud Cannot Change the Laws of Physics

Listening to public cloud evangelists and marketing departments of vendors selling over-the-cloud networking solutions or multi-cloud orchestration systems, you could start to believe that migrating your workload to a public cloud would solve all your problems… and if you’re gullible enough to listen to them, you’ll get the results you deserve.

Unfortunately, nothing can change the fundamental laws of physics, networking, or application architectures:

read more add comment

You Don't Need IP Renumbering for Disaster Recovery

This is a common objection I get when trying to persuade network architects they don’t need stretched VLANs (and IP subnets) to implement data center disaster recovery:

Changing IP addresses when activating DR is hard. You’d have to weigh the manageability of stretching L2 and protecting it, with the added complexity of breaking the two sites into separate domains [and subnets]. We all have apps with hardcoded IP’s, outdated IPAM’s, Firewall rules that need updating, etc.

Let’s get one thing straight: when you’re doing disaster recovery there are no live subnets, IP addresses or anything else along those lines. The disaster has struck, and your data center infrastructure is gone.

read more add comment

Disaster Recovery and Failure Domains

One of the responses to my Disaster Recovery Faking blog post focused on failure domains:

What is the difference between supporting L2 stretched between two pods in your DC (which everyone does for seamless vMotion), and having a 30ms link between these two pods because they happen to be in different buildings?

I hope you agree that a single broadcast domain is a single failure domain. If not, let agree to disagree and move on - my life is too short to argue about obvious stuff.

read more add comment

Stretched VLANs and Failing Firewall Clusters

After publishing the Disaster Recovery Faking, Take Two blog post (you might want to read that one before proceeding) I was severely reprimanded by several people with ties to virtualization vendors for blaming virtualization consultants when it was obvious the firewall clusters stretched across two data centers caused the total data center meltdown.

Let’s chase that elephant out of the room first. When you drive too fast on an icy road and crash into a tree who do you blame?

  • The person who told you it’s perfectly OK to do so;
  • The tire manufacturer who advertised how safe their tires were?
  • The tires for failing to ignore the laws of physics;
  • Yourself for listening to bad advice

For whatever reason some people love to blame the tires ;)

read more see 10 comments

Disaster Recovery Faking, Take Two

An anonymous (for reasons that will be obvious pretty soon) commenter left a gem on my Disaster Recovery Test Faking blog post that is way too valuable to be left hidden and unannotated.

Here’s what he did:

Once I was tasked to do a DR test before handing over the solution to the customer. To simulate the loss of a data center I suggested to physically shutdown all core switches in the active data center.

That’s the right first step: simulate the simplest scenario (total DC failure) in a way that is easy to recover (power on the switches). It worked…

read more see 11 comments

Disaster Recovery Test Faking: Another Use Case for Stretched VLANs

The March 2019 Packet Pushers Virtual Design Clinic had to deal with an interesting question:

Our server team is nervous about full-scale DR testing. So they have asked us to stretch L2 between sites. Is this a good idea?

The design clinic participants were a bit more diplomatic (watch the video) than my TL&DR answer which would be: **** NO!

Let’s step back and try to understand what’s really going on:

read more see 5 comments

Impact of Controller Failures in Software-Defined Networks

Christoph Jaggi sent me this observation during one of our SD-WAN discussions:

The centralized controller is another shortcoming of SD-WAN that hasn’t been really addressed yet. In a global WAN it can and does happen that a region might be cut off due to a cut cable or an attack. Without connection to the central SD-WAN controller the part that is cut off cannot even communicate within itself as there is no control plane…

A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller?”

read more see 4 comments

Real-Life Data Center Meltdown

A good friend of mine who prefers to stay A. Nonymous for obvious reasons sent me his “how I lost my data center to a broadcast storm” story. Enjoy!


Small-ish data center with several hundred racks. Row of racks supported by an end-of-row stack. Each stack with 2 x L2 EtherChannels, one EC to each of 2 core switches. The inter-switch link details don’t matter other than to highlight “sprawling L2 domains."

VLAN pruning was used to limit L2 scope, but a few VLANs went everywhere, including the management VLAN.

read more see 3 comments

How Common Are Data Center Meltdowns?

We all know about catastrophic headline-generating failures like AWS East-1 region falling apart or a major provider being down for a day or two. Then there are failures known only to those who care, like losing a major exchange point. However, I’m becoming more and more certain that the known failures are not even the tip of the iceberg - they seem to be the climber at the iceberg summit.

read more see 10 comments

Decide How Badly You Want to Fail

Every time I’m running a data center-related workshop I inevitably get pulled into stretched VLAN and stretched clusters discussion. While I always tell the attendees what the right way of doing this is, and explain the challenges of stretched VLANs from all perspectives (application, database, storage, routing, and broadcast domains) the sad truth is that sometimes there’s nothing you can do.

You’ll find a generic version of that explanation in Building Active-Active and Disaster Recovery Data Centers webinar. Every few months I might be available for an onsite version of that same discussion, or you could engage one of the other ExpertExpress consultants.

In those sad cases, I can give the workshop attendees only one advice: face the reality, and figure out how badly you might fail. It’s useless pretending that you won’t get into a split-brain scenario - redundant equipment just makes it less likely unless you over-complicated it in which case adding redundancy reduces availability. It’s also useless pretending you won’t be facing a forwarding loop.

read more see 2 comments

To Centralize or not to Centralize, That’s the Question

One of the attendees of the Building Next-Generation Data Center online course solved the build small data center fabric challenge with Virtual Chassis Fabric (VCF). I pointed out that I would prefer not to use VCF as it uses centralized control plane and is thus a single failure domain.

In case you’re interested in data center fabric architecture options, check out this section in the Data Center Fabric Architectures webinar.

Here are his arguments for using VCF:

read more see 4 comments
Sidebar