VLANs and Failure Domains Revisited
My friend Christoph Jaggi, the author of fantastic Metro Ethernet and Carrier Ethernet Encryptors documents, sent me this question when we were discussing the Data Center Fabrics Overview workshop I’ll run in Zurich in a few weeks:
When you are talking about large-scale VLAN-based fabrics I assume that you are pointing towards highly populated VLANs, such as VLANs containing 1000+ Ethernet addresses. Could you provide a tipping point between reasonably-sized VLANs and large-scale VLANs?
It's not the number of hosts in the VLAN but the span of a bridging domain (VLAN or otherwise).
Before We Start
Please note that I'm looking at the problem purely from the data center perspective - transport VLANs offered by Metro Ethernet or VPLS service providers are a totally different story for several reasons:
- There's a difference between providing transport service and being responsible for the whole infrastructure (and all stupidities people do on top of it);
- Sensible people don't bet their whole IT infrastructure on a single service provider (those that do eventually get the results they deserve). Failure in the transport network is thus not as critical as a data center failure;
- Sensible people isolate their internal networks from transport network failures by using routing functionality between their LAN and WAN networks. Some data center architects happily extend a single (v)LAN network across a WAN network.
What Could Possibly Go Wrong?
There are two failure scenarios I often see when people come to me after experiencing a data center meltdown:
- bridging loop caused by a host (or VM);
- bridging loop on the fabric edge - from something as stupid as technicians plugging TX fiber in RX port or connecting two ports to see if the fiber is OK (sometimes coupled with device misconfiguration) to software bugs in MLAG implementations.
The first one is annoying, the second one is catastrophic, as the ToR switches easily do packet flooding at wire speed.
Back to the Failure Domains
In any case, anyone that's part of the same VLAN gets affected, and if someone (in their infinite wisdom) configured all VLANs on all server-facing ports because that's easier than actually talking with the server/virtualization team or deploying a VLAN provisioning solution, every server gets impacted.
Furthermore, every link that the affected VLAN crosses has to carry the unnecessary traffic. Not a big deal if you have a 10Gbps bridging loop at the network edge and 40 Gbps fabric links, but a major disaster if your bridging domain includes 1Gbps or 10Gbps links between data centers.
It’s Not a Numbers Game
Christoph continued his question with two example:
1 VLAN = 1 failure domain
100 VLANs = 100 separate failure domains
Not necessarily. The 100 VLANs touch all the core links, so there’s at least partial overlap between them (a bridging loop in a VLAN that results in saturation of DCI link will kill all other VLANs traversing that link), and if the networking team configured every VLAN on every server-facing port to make their lives easier, the whole data center fabric becomes a single failure domain.
1 VLAN - 100 MAC-adresses = I am OK
1 VLAN - 1000 MAC addresses = I am close to the limit
1 VLAN = 2000 MAC adresses = I am looking for trouble
1 VLAN = 3000+ MAC adresses = I will get into trouble
That’s the traditional view of the problem: bridging doesn’t scale because hosts are chatty and more hosts in a VLAN result in more flooding overhead.
In reality, you will get into trouble with many IP nodes (hosts or routers) in the same VLAN (that’s why some large layer-2 fabrics use ARP sponges), but the most common causes of network meltdowns that I see are the bridging loops.
1) how switch vendors implement mac table in cam and what happens on table row overflow can be an issue. one vendor implements cam table as a list of buckets with a set number of entries per bucket. macs are hashed into buckets and the switch experiences fault when a bucket gets full. this can be outage inducing for affected macs that cannot be reliably learned. (not all vendors malfunction, but this is something one should be aware of).
2) switch vendors do not engineer switches for extreme scale. make sure to ask vendors the gritty details of implementation of mac and arp tables when in your environments. some vendors next hop tables are inefficient when utilizing lag (port-channel) for example, and stated maximums cannot be hit. some vendors support next hop table reprogramming for shared next hops. this can be helpful if you have many ips per mac, but of little use if you have many macs.
3) as has been commented here and on packet-pushers, nerd nobs aren't always the best idea. resist the temptation to turn on ill advised knobs to allow for engineering a larger layer 2; things like BUM disable. it makes life difficult when new hosts are on the network or in the face of switch reboot.
4) mac acls in relation to item 3 do not scale easily unless you have automation going for you.
5) network meltdown isn't always limited to overloading dci or uplink connections; one could easily put in a pps limit for BUM traffic. in the face of network loops, if arp replies are looped through the network, the mac address of the destination host might get programmed out incorrect ports in switching gear. game over.