Layer-2 Network Is a Single Failure Domain

This topic has been on my to-write list for over a year and its working title was phrased as a question, but all the horror stories you’ve shared with me over the last year or so (some of them published in my blog) have persuaded me that there’s no question – it’s a fact.

If you think I’m rephrasing the same topic ad nauseam, you’re right, but every month or so I get an external trigger that pushes me back to the same discussion, this time an interesting comment thread on Massimo Re Ferre’s blog.

There are numerous reasons why we’re experiencing problems with transparently bridged Ethernet networks, ranging from challenges inherent in the design of Spanning Tree Protocol (STP) and suboptimal implementations of STP, to flooding behavior inherent in transparent bridging.

You can solve some of these issues with novel technologies like SPB (802.1aq) or TRILL, but you can’t change the basic fact: once you get a loop in a bridged network, and a broadcast packet caught in that loop (and flooded all over the network every time it’s forwarded by a switch), you’re toast.

SPB aficionados will tell me loops cannot happen in SPB networks because of RPF checks. Just wait till you hit the first interesting software bug or an IS-IS race condition.

Yes, there’s storm control, and you can deploy it on every link in your network, but the single circulating broadcast packet (and its infinite copies) will trigger storm control on all switches, and prevent other valid broadcasts (for example, ARP requests) from being propagated, effectively causing a DoS attack on the whole layer-2 domain. Furthermore, the never-ending copies of the same broadcast packets delivered to the CPU of every single switch in the layer-2 domain will eventually start interfering with the control-plane protocols, causing further problems.

The obvious conclusion: transparently bridged network (aka layer-2 network or VLAN) is a single failure domain.

Why am I telling you this (again and again)?

Some people think that you experience bridging-related problems only if you’re big enough, but everything is going to be fine if you have less than a thousand VMs, less than a hundred servers, less than ten switches … or whatever other number you come up with to pretend you’re safe. That’s simply not true – I’ve seen a total network meltdown in a (pretty small) data center with three (3) switches.

The only difference between a small(er) and big(ger) data center is that you might not care if your small data center goes offline for an hour or so, but if you do, then you simple have to split it up into multiple layer-2 domains connected through layer-3 switches (or load balancers or firewalls if you so desire).

If you’re serious about the claims that you have mission-critical applications that require high availability (and everyone claims they have them), then you simply have to create multiple availability zones in your network, and spread multiple copies of the same application across them. As Amazon proved, even multiple availability zones might not be good enough, but having them is infinitely better than having a single failure domain.

The usual counterarguments

This is what I usually hear after presenting the above sad facts to data center engineers: “there’s nothing we can do”, “but our users require unlimited VM mobility”, “our applications won’t work otherwise” and a few similar ones. These are all valid claims, but as always in life, you have to face the harsh reality: either you do it right (and everyone accepts the limitations of doing it right), or you’ll pay for it in the future.

Other options?

As always (in the IT world), there’s always the third way: use MAC-over-IP network virtualization (in form of VXLAN, NVGRE or STT). Once these technologies get widely adopted and implemented in firewalls and load balancers (or we decide to migrate from physical to virtual appliances), they’ll be an excellent option. In the meantime, you have to choose the lesser evil (whatever you decide it is).

More information

You probably know you’ll find a lot more information in my data center and virtualization webinars, but there’s also a book I would highly recommend to anyone considering more than just how to wire a bunch of switches together – Scalability Rules is an awesome collection of common-sense and field-tested scalability rules (including a number of networking-related advices not very dissimilar from what I’m always telling you). Finally, if you’d like to have my opinion on your data center design, check out the ExpertExpress service.

27 comments:

  1. ...and, because it's a widely recognized problem, work is being done on putting up some kludges to make it better. Some examples from ALU land:

    STP Loop guard: http://www.alcatelunleashed.com/viewtopic.php?f=190&t=18462&start=20#p66200

    MAC Move: http://lucent-info.com/html/93-0107-08-04-H/7450%20Services%20Guide/wwhelp/wwhimpl/common/html/wwhelp.htm#href=services_con_vpls.12.20.html&single=true

    I bet other Vendors have something similar. ;)

    ReplyDelete
    Replies
    1. Sure they do and a lot of people bet such features always work flawlessly ;)

      Delete
    2. Everything is a compromise - you don't have to solve the problem to solve the problem. You only need to get it to where the risk is acceptable.

      Sometimes, when your faith in your Vendor's new features is somewhat shaky, that can mean "just run spanning tree and be cautious". ;)

      Delete
  2. I wonder if we'll ever be able to modify Ethernet to make hosts register their MAC/IP entries and just remove broadcast.

    http://www.cs.cmu.edu/~acm/papers/myers-hotnetsIII.pdf

    Does IPv6 with IGMP snooping switches require flooding or could it be removed?

    ReplyDelete
    Replies
    1. IGMP snooping (actually, MLD) reduces the flooding scope for multicast destination MAC addresses. Broadcast and unknown unicast flooding is not affected.

      Delete
    2. Yes, but my point was does an IPv6 only network with MLD need broadcast and unknown unicast flooding enabled at all? Or could those functions be disabled?

      Delete
    3. There might still be applications using broadcast. Assuming those eventually become extinct and you're running IPv6-only network, you could still experience TCAM overflows that would then require unicast flooding to ensure end-station reachability. Also, you'd have to ensure the IPv6 ND timeouts are lower than the MAC aging timeouts ... but in theory, it could be done.

      Delete
  3. Sounds familiar ;-)

    That picture is right up there with the swiss army knife!

    ReplyDelete
  4. Ivan,

    As usual, quality post but regarding your statement, "...then you simple have to split it up into multiple layer-2 domains connected through layer-3 switches", would you care to elaborate? I work in datacenters but on the facilities side and I have been burned many times by industrial devices that have poor/limited tcp/ip stack or in some cases, devices not able to route back to their server leaving me with having to span layer two across a couple switches. I have implemented storm control but as you mentioned, that may not be enough to stop a meltdown. I'm curious how I can overcome that hurdle while maintaining your recommendation about splitting the layer-2 domain through layer-3 switches.

    ReplyDelete
    Replies
    1. The answer is there is no way to overcome the hurdle. Create several interfaces on the server each interface on its own vlan.

      Delete
  5. Well, I don't quite get something. So we state that L2 network is a single failure domain. Alright. But now we get the same L2 network wrapped and tunneled over IP, and it's no longer a single failure domain? :) Have we magically agreed on fixing the flooding behavior, which is the actual root cause of L2 scalability limitation, in any of these standards?

    ReplyDelete
    Replies
    1. My thoughts exactly ... :)

      Delete
    2. L2 network is still a single failure domain, even if it's wrapped in IP (that's why using VXLAN or NVGRE for long-distance stretched clusters makes no sense), but at least the underlying transport is not.

      As for "fixing the flooding behavior", Nicira got pretty far (VXLAN and NVGRE have just inserted another layer of abstraction and resurrected IP MC) and can do either headend replication or replication in dedicated nodes.

      The only one that decided to go all the way and kill flooding was Amazon; everyone else is too concerned about precious enterprise craplications that rely on L2 flooding in one stupid way or another.

      Delete
    3. I can see how VXLAN/NVGRE may *narrow* flooding, but can you really kill it? e.g. VM instantiation still involves G-ARP AFAIK...

      Delete
    4. VXLAN or NVGRE cannot kill flooding because they have no control plane (although Dell did announce an ARP helper appliance @ Interop, so who knows what will happen to NVGRE).

      Nicira's NVP is a totally different story. They might not be totally there yet, but the architecture does allow that.

      Delete
  6. As a reminder, routing is not a magic solution. IGP is also single fault domain, just wait until you hit bug in IS-IS implementation or have bad luck with packet corruption. Been there, done that.

    ReplyDelete
    Replies
    1. The difference is that routing control plane is, well, more controlled with regards to information flooding :) So in theory, you could reduce the disruption risks, if designed and operated properly.

      This being said, control plain failure examples are always epic. Especially at the Internet scale :)

      Delete
  7. I'm not sure I agree anymore. I run a single tenant data center for a huge company (60K plus users)

    I can have a huge network that is broken up into a huge amount of small L2 broadcast domains all connected to the same core switch pair (or aggregate if you're a Cisco guy)

    If one of those tiny L2 broadcast domains loop, then your core switches lock up, and your whole network goes down. "single circulating broadcast packet (and its infinite copies) will trigger storm control on ALL SWITCHES, and prevent other valid broadcasts"

    I've tested various loops scenarios in a large scale network (300+ TOR switches and a pair of Cisco 7Ks). I've found storm control doesn't work in 10Gb networks with Cisco FEX. I've found port-security does work well although increases trouble ticets(opex). I've found that default COPP works awesomely in keeping your Nexus 7Ks alive so you can find the loop. I've found my best bet is to configure the network to prevent loops and not try and configure around loops. And screw the ideals of preventing loops by telling your cabling crew to cable properly!! That will work for 6 months or a year until they forget again.

    So who cares how big your L2 domains are? And if you have the same aggregate switch pair (everyone does) then it doesnt matter how many load balanceers or firewall instances you have. I'd say your chances are equally the same taking out your data center. In fact....if you cable differently for smaller L2 domains then I'd say your chances go up! But You do lose mobility and scalability the smaller you make your L2 domains.

    I dont even want to talk about mac-in-mac or mac-in-multicast. No one is there yet.

    I dont even want to talk about STP replacement. No one is there yet.

    I'd perhaps consider multiple aggregate layer switches (if the 7Ks had the capacity for more than four VDC). That limitation makes VDC useless except for your Development instance.

    Ivan, btw, will you be in San Diego this month?

    ReplyDelete
    Replies
    1. Would A cheat to win solution to the cable guy problem be to leave unused ports disabled?

      Delete
    2. Will,

      Perhaps if you use different core switch pairs for each L2 domain, you'll be able to avoid fate-sharing in this case. It's more expensive, and not as fancy as VDCs, but it should meet your requirements.

      I personally don't have the budget to implement multiple N7K pairs in my DC, so I can't take this advice either :(

      Jeremy

      Delete
  8. pbb-te / pbt avoid these problems by turning off broadcasts (all, including unknown mac flooding) and forcing the switches to use an outboard control plane to provision all mac forwarding tables.

    I'm sure openflow has a similar magic unicorn fart type answer to this problem....

    Now if only someone could solve the problem in practice as well as theory.

    ReplyDelete
    Replies
    1. If I understood this right, it is not about Broadcast alone.

      Its about errors which may occur (software or manual or just race conditions), that can cause loops.

      -Vishwas

      Delete
  9. Ivan,

    Thanks for calling out storm control as the preemptive DoS attack that it is.

    That feature is just a turd in my opinion:

    1) The granularity is terrible - it's not a throttle, it just counts bytes over a 1 second (I think?) interval, and then throws data away for the remainder of the second once the threshold has been hit. Awful, unless you set it to 10% with the expectation that your server ports will be deaf to some protocols for 90% of every second...

    2) Who am I (as the network admin) to decide when a server has sent "too many" broadcast or multicast frames? I've never seen anything along these lines codified in an SLA given to the user/server/application community. Accordingly, if a craplousy business application is built to work exclusively with broadcast frames, and it runs into storm control, guess what feature is going to be switched off?

    ReplyDelete
    Replies
    1. 2) when they take out a data center

      Delete
  10. Ivan,

    I agree with most of what you say.

    TRILL does have loop prevention by adding a Hop Count, as the last measure of breaking loops.

    Tunneling can cause loops too BTW. This is something that has been raised with IPv6. Have a look at:
    http://tools.ietf.org/html/rfc6324

    -Vishwas

    ReplyDelete
    Replies
    1. Exactly my thinking.

      TRILL is actually closer to routing than to bridging, therefore it's a L2 protocol that incorporates most of the benefits of L3 protocols. TTLs prevent infinite loops, and in Cisco's FabricPath you get conversational MAC learning, which eases the burden of having to learn all addresses in your L2 domain.

      Still, I wonder how is it that SAN admins got away with deploying two independent and separate networks, and LAN admins did not. It is evident that the first hop L2 connection is a single failure domain, no matter how resilient it is.

      It's time Operating Systems and applications start supporting dual LAN designs so we can effectively protect this single failure domain.

      Delete
    2. @Pablo: You can have dual LAN design any time you want - just configure a loopback interface on the server and run a routing protocol with the network ... or configure the load balancer with both server addresses in the server pool.

      There's no technological reason you can't do it ... apart from the limitations of broken socket API and missing session layer in TCP stack ;)

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.