How Hard Is It to Think about Failures?

Mr. A. Anonymous, frequent contributor to my blog posts left this bit of wisdom comment on the VMware NSX Update blog post:

I don't understand the statement that "whole NSX domain remains a single failure domain" because the 3 NSX controllers are deployed in the site with primary NSX manager.

I admit I was a bit imprecise (wasn’t the first time), but is it really that hard to ask oneself “what happens if the DCI link fails?

It’s amazing how many people continue to believe in infallibility of redundant architectures years after they stop believing in Santa Claus or Tooth Fairy.

Having redundant links (or routers or switches) doesn’t mean that your setup cannot fail, it only means that you might have reduced the probability of the failure. In practice, you might have reduced the reliability of your system because you made it more complex and thus harder to understand, configure and monitor.

Terry Slattery is an unending source of war stories about people who thought they had a redundant system but didn’t… like running HSRP on only one of the redundant routers.

Anyway, assuming that we agree there’s a non-zero chance of a DCI link failure, let’s consider the next question: what happens in the distributed NSX deployment described above when the DCI link fails? The ESXi hosts in the remote data center lose connectivity to the controller cluster, which means that they can no longer adapt to any topology change, VM adds or moves being the most trivial ones.

Behind the Scenes

If you still want to move forward with this design, it makes sense to understand the particulars (keeping in mind they might change across software releases, potentially making your design implodable-on-failure).

This is the feedback I got from Dmitri Kalintsev:

In a single vCenter deployment, all DRS and VM start operations for the isolated DC would also cease (but the existing VMs would continue to run), so you should be fine from the overlay network connectivity perspective. You're still exposed in case of host failures with subsequent HA restarts.

ARP suppression should not be factor, since connections from ESXi hosts to NSX controllers are TCP, and hosts would not be trying to consult controller for ARP suppression if TCP connection is down.

With site-local vCenter, DRS / VM start operations are possible, and if executed will likely lead to problems with connectivity in case when link to site with Controllers is down.

This could be taken care of (by setting DRS to something other than full auto, and starting VMs only via something like vRealize Automation which lives in the primary DC), but it's something that will need to be thought out beforehand.

Distributed logical routers (DLR) may experience problems at the site disconnected from the Controller Cluster. In some cases DLR-originated (ARP) and DLR-routed traffic may fail to reach destinations due to DLR's VNI join failure caused by loss of Controller connectivity.

DLR instance on an ESXi host has to join the VNI (VXLAN segment) of the target VM if it wants to forward the traffic to the destination IP address, and that process is traffic- and not topology-driven. If you want to understand the underlying problems, read this blog post (and follow all the links); packet walks from the Overlay Virtual Networking webinar would also be useful.

Also, any routing topology changes learned by isolated site's DLR's Control VM dynamically via BGP or OSPF won't be sent to DLR kernel module on hosts, since this process relies on Controllers


  1. Ivan:

    I wanted to add a similar story, not related to DC. For years and years (and even still) most people deploying firewalls do so in a redundant fashion. They set up high-speed state sync between two boxes. If you were to ask any of them why they do this, they almost invariably say, "Because redundancy."

    This is the truth, though. Most firewall failures have been the result of of high availability features on the firewall. If one of the firewall vomits all over the place, it almost invariably screws up the high availability features. When this is discussed with firewall folk or the vendor, we eventually get around to a statement like this: "These features are really about failures that happen around the firewall, not for failures in the firewalls themselves." That is, if a switch link or switch, or router dies, then traffic will find it's way to the opposite firewall and everybody will be happy.

    The issue with this is that the firewall failures that do happen because of these features tend to be things like configuration changes to the firewalls, port scans on a common subnet between them, routing events when the firewalls are routing, etc.

    It has been my recommendation for some time to just not use these features on firewalls. Build a redundant infrastructure in the dedicated forwarding stuff around the firewalls. Ensure that traffic will tend to find it's way back to the same firewall as much as possible. Without these HA features enabled, firewalls have much lower fail rates. Overall everyone will be happier.

    Personally, I put firewall state sync features in the same bucket as ISSU features: Wishful thinking at best, destructive nonsense at worst.

    aka @Cloudtoad
    1. Amen to this, ive long since argued the point that HA on firewall is an exercise in fate sharing, not a fine example of how we should be managing failover between devices.

      Yes we need redundancy, no we don't need our backup device sharing the same fate as our primary because user error or some undocumented feature (bug) wrote rubbish all over our config and (as you aptly put) vomited all over itself or trashes the state table.

      Id much rather wear the risk of having a pair of independent firewalls and manage the rulebase independently (SDN use case here to orchestrate?) and rely on L3 to deal with my redundancy, sure ill lose state and have a momentary interruption, but ill have isolated the failure domain and if the business accepts the risk, where's the harm in having an outage for a few seconds as L3 reconverges, as opposed to trying to fix broken/misconfigured firewalls?
    2. I can agree with a lot of this. You do create problems elsewhere though. Maintaining consistent rule-sets across independent firewalls can be a small nightmare in itself. It really just depends on what level of redundancy you're designing for here. The same applies to say... load balancers.
    3. "Maintaining consistent rule-sets across independent firewalls can be a small nightmare"

      That's true as long as you configure rulesets manually. You don't have that problem if you generate them from a template, and deploy them automatically.
    4. Some firewall vendors allow state sync between independent firewalls, which can be centrally managed so that the firewall policy is consistent. You can then allow your surrounding devices to determine which firewall path to take and the only thing to worry about is the state sync
    5. And what about "next gen" FW features which make decisions not just on transit traffic properties, but also on data extrapolated from an external source like LDAP server and correlated to the packets transiting the FW? Seems like the same shared fate issue exists, maybe even worse.
  2. Ivan, a read-only network is not really a failed network especially that you are not a big fan of VMotion. :) Applications will be still working giving time to change a mode of NSX or redeploy a cluster of controllers.

    An another discussion is what is a failure domain? Is it when the failure’s impact in one part of a network or Data Center is propagated to an another part? Actually you can have a non-zero probability in any solution under a common administration. Even in a BGP-only solution someone can inject an inappropriate subnet by mistake causing an outage in all Data Centers. You can say that NSX can cause an outage by design not because of a human mistake. IMHO not really. Under a common administration one mistake in a prefix policy can cause a fate sharing even in the BGP-based Data Centers (without SDN). So in BGP a mistake is also propagated. Unless there are different admins and policies which just decreases probability. So there is a better control in BGP but does it mean that your setup cannot fail? Yes, it can. Does it mean that the BGP-based DC is a single failure domain? Partially yes. A case with a Youtube prefix hijack proves that at the end the Internet is also a single failure domain. Of course there is a lower failure probability of BGP than L2 VLAN extension or SDN solution but still there is. What do you think?

    Thank you for your interesting posts!
    Kind regards,

    1. Piotr,

      Thanks for a very elaborate answer. I think Dmitri provided enough in-depth information on what might fail in what scenario for the readers to form their own opinion.

      And yes, there is only one absolute guarantee in life (as we know it so far), everything else has a non-zero failure rate. However, you _could_ protect yourself against certain failures (even though you don't) and you _can't_ protect yourself against certain other failures. In the BGP case you mention, it was not a BGP failure, but a negligence on part of the upstream provider who had no BGP filters toward their customers, so you really can't compare the two. Even the best tool can fail when used improperly.

  3. To add to a previous post, maybe you should consider solutions in two categories of a single domain:
    1. A level of impact of a data plane domain.
    2. A level of impact of a control plane domain.

    In the first categories there are L2 extensions, VLANs, VPLS, OTV, VXLAN, etc. In the second category there will be the NSX cluster of controllers, BGP, etc. Every technology has its own propability. So a failure domain in VXLAN is much smaller and segmented comparing to extending VLAN natively as DCI. In the control plane category BGP has a lower probability than others.
Add comment