Layer-2 Network Is a Single Failure Domain
This topic has been on my to-write list for over a year and its working title was phrased as a question, but all the horror stories you’ve shared with me over the last year or so (some of them published in my blog) have persuaded me that there’s no question – it’s a fact.
If you think I’m rephrasing the same topic ad nauseam, you’re right, but every month or so I get an external trigger that pushes me back to the same discussion, this time an interesting comment thread on Massimo Re Ferre’s blog.
There are numerous reasons why we’re experiencing problems with transparently bridged Ethernet networks, ranging from challenges inherent in the design of Spanning Tree Protocol (STP) and suboptimal implementations of STP, to flooding behavior inherent in transparent bridging.
You can solve some of these issues with novel technologies like SPB (802.1aq) or TRILL, but you can’t change the basic fact: once you get a loop in a bridged network, and a broadcast packet caught in that loop (and flooded all over the network every time it’s forwarded by a switch), you’re toast.
SPB aficionados will tell me loops cannot happen in SPB networks because of RPF checks. Just wait till you hit the first interesting software bug or an IS-IS race condition.
Yes, there’s storm control, and you can deploy it on every link in your network, but the single circulating broadcast packet (and its infinite copies) will trigger storm control on all switches, and prevent other valid broadcasts (for example, ARP requests) from being propagated, effectively causing a DoS attack on the whole layer-2 domain. Furthermore, the never-ending copies of the same broadcast packets delivered to the CPU of every single switch in the layer-2 domain will eventually start interfering with the control-plane protocols, causing further problems.
The obvious conclusion: transparently bridged network (aka layer-2 network or VLAN) is a single failure domain.
Why am I telling you this (again and again)?
Some people think that you experience bridging-related problems only if you’re big enough, but everything is going to be fine if you have less than a thousand VMs, less than a hundred servers, less than ten switches … or whatever other number you come up with to pretend you’re safe. That’s simply not true – I’ve seen a total network meltdown in a (pretty small) data center with three (3) switches.
The only difference between a small(er) and big(ger) data center is that you might not care if your small data center goes offline for an hour or so, but if you do, then you simple have to split it up into multiple layer-2 domains connected through layer-3 switches (or load balancers or firewalls if you so desire).
If you’re serious about the claims that you have mission-critical applications that require high availability (and everyone claims they have them), then you simply have to create multiple availability zones in your network, and spread multiple copies of the same application across them. As Amazon proved, even multiple availability zones might not be good enough, but having them is infinitely better than having a single failure domain.
The usual counterarguments
This is what I usually hear after presenting the above sad facts to data center engineers: “there’s nothing we can do”, “but our users require unlimited VM mobility”, “our applications won’t work otherwise” and a few similar ones. These are all valid claims, but as always in life, you have to face the harsh reality: either you do it right (and everyone accepts the limitations of doing it right), or you’ll pay for it in the future.
Other options?
As always (in the IT world), there’s always the third way: use MAC-over-IP network virtualization (in form of VXLAN, NVGRE or STT). Once these technologies get widely adopted and implemented in firewalls and load balancers (or we decide to migrate from physical to virtual appliances), they’ll be an excellent option. In the meantime, you have to choose the lesser evil (whatever you decide it is).
More information
You probably know you’ll find a lot more information in my data center and virtualization webinars, but there’s also a book I would highly recommend to anyone considering more than just how to wire a bunch of switches together – Scalability Rules is an awesome collection of common-sense and field-tested scalability rules (including a number of networking-related advices not very dissimilar from what I’m always telling you). Finally, if you’d like to have my opinion on your data center design, check out the ExpertExpress service.
STP Loop guard: http://www.alcatelunleashed.com/viewtopic.php?f=190&t=18462&start=20#p66200
MAC Move: http://lucent-info.com/html/93-0107-08-04-H/7450%20Services%20Guide/wwhelp/wwhimpl/common/html/wwhelp.htm#href=services_con_vpls.12.20.html&single=true
I bet other Vendors have something similar. ;)
Sometimes, when your faith in your Vendor's new features is somewhat shaky, that can mean "just run spanning tree and be cautious". ;)
http://www.cs.cmu.edu/~acm/papers/myers-hotnetsIII.pdf
Does IPv6 with IGMP snooping switches require flooding or could it be removed?
That picture is right up there with the swiss army knife!
As usual, quality post but regarding your statement, "...then you simple have to split it up into multiple layer-2 domains connected through layer-3 switches", would you care to elaborate? I work in datacenters but on the facilities side and I have been burned many times by industrial devices that have poor/limited tcp/ip stack or in some cases, devices not able to route back to their server leaving me with having to span layer two across a couple switches. I have implemented storm control but as you mentioned, that may not be enough to stop a meltdown. I'm curious how I can overcome that hurdle while maintaining your recommendation about splitting the layer-2 domain through layer-3 switches.
As for "fixing the flooding behavior", Nicira got pretty far (VXLAN and NVGRE have just inserted another layer of abstraction and resurrected IP MC) and can do either headend replication or replication in dedicated nodes.
The only one that decided to go all the way and kill flooding was Amazon; everyone else is too concerned about precious enterprise craplications that rely on L2 flooding in one stupid way or another.
Nicira's NVP is a totally different story. They might not be totally there yet, but the architecture does allow that.
This being said, control plain failure examples are always epic. Especially at the Internet scale :)
I can have a huge network that is broken up into a huge amount of small L2 broadcast domains all connected to the same core switch pair (or aggregate if you're a Cisco guy)
If one of those tiny L2 broadcast domains loop, then your core switches lock up, and your whole network goes down. "single circulating broadcast packet (and its infinite copies) will trigger storm control on ALL SWITCHES, and prevent other valid broadcasts"
I've tested various loops scenarios in a large scale network (300+ TOR switches and a pair of Cisco 7Ks). I've found storm control doesn't work in 10Gb networks with Cisco FEX. I've found port-security does work well although increases trouble ticets(opex). I've found that default COPP works awesomely in keeping your Nexus 7Ks alive so you can find the loop. I've found my best bet is to configure the network to prevent loops and not try and configure around loops. And screw the ideals of preventing loops by telling your cabling crew to cable properly!! That will work for 6 months or a year until they forget again.
So who cares how big your L2 domains are? And if you have the same aggregate switch pair (everyone does) then it doesnt matter how many load balanceers or firewall instances you have. I'd say your chances are equally the same taking out your data center. In fact....if you cable differently for smaller L2 domains then I'd say your chances go up! But You do lose mobility and scalability the smaller you make your L2 domains.
I dont even want to talk about mac-in-mac or mac-in-multicast. No one is there yet.
I dont even want to talk about STP replacement. No one is there yet.
I'd perhaps consider multiple aggregate layer switches (if the 7Ks had the capacity for more than four VDC). That limitation makes VDC useless except for your Development instance.
Ivan, btw, will you be in San Diego this month?
Perhaps if you use different core switch pairs for each L2 domain, you'll be able to avoid fate-sharing in this case. It's more expensive, and not as fancy as VDCs, but it should meet your requirements.
I personally don't have the budget to implement multiple N7K pairs in my DC, so I can't take this advice either :(
Jeremy
I'm sure openflow has a similar magic unicorn fart type answer to this problem....
Now if only someone could solve the problem in practice as well as theory.
Its about errors which may occur (software or manual or just race conditions), that can cause loops.
-Vishwas
Thanks for calling out storm control as the preemptive DoS attack that it is.
That feature is just a turd in my opinion:
1) The granularity is terrible - it's not a throttle, it just counts bytes over a 1 second (I think?) interval, and then throws data away for the remainder of the second once the threshold has been hit. Awful, unless you set it to 10% with the expectation that your server ports will be deaf to some protocols for 90% of every second...
2) Who am I (as the network admin) to decide when a server has sent "too many" broadcast or multicast frames? I've never seen anything along these lines codified in an SLA given to the user/server/application community. Accordingly, if a craplousy business application is built to work exclusively with broadcast frames, and it runs into storm control, guess what feature is going to be switched off?
I agree with most of what you say.
TRILL does have loop prevention by adding a Hop Count, as the last measure of breaking loops.
Tunneling can cause loops too BTW. This is something that has been raised with IPv6. Have a look at:
http://tools.ietf.org/html/rfc6324
-Vishwas
TRILL is actually closer to routing than to bridging, therefore it's a L2 protocol that incorporates most of the benefits of L3 protocols. TTLs prevent infinite loops, and in Cisco's FabricPath you get conversational MAC learning, which eases the burden of having to learn all addresses in your L2 domain.
Still, I wonder how is it that SAN admins got away with deploying two independent and separate networks, and LAN admins did not. It is evident that the first hop L2 connection is a single failure domain, no matter how resilient it is.
It's time Operating Systems and applications start supporting dual LAN designs so we can effectively protect this single failure domain.
There's no technological reason you can't do it ... apart from the limitations of broken socket API and missing session layer in TCP stack ;)