Transparent Bridging (aka L2 Switching) Scalability Issues
Stephen Hauser sent me an interesting question after the Data Center fabric webinar I did with Abner Germanow from Juniper:
A common theme in your talks is that L2 does not scale. Do you mean that Transparent (Learning) Bridging does not scale due to its flooding? Or is there something else that does not scale?
As is oft the case, I’m not precise enough in my statements, so let’s fix that first:
There are numerous layer-2 protocols, but when I talk about layer-2 (L2) scalability in data center context, I always talk about Ethernet bridging (also known under its marketing name switching), more precisely, transparent bridging that uses flooding of broadcast, unknown unicast, and multicast frames (I love the BUM acronym) to compensate for lack of host-to-switch- and routing (MAC reachability distribution) protocols.
Large transparently bridged Ethernet networks face three layers of scalability challenges:
Dismal control plane protocol (Spanning Tree Protocol in its myriad incarnations), combined with broken implementations of STP kludges. Forward-before-you-think behavior of Cisco’s PortFast and lack of CPU protection on some of the switches immediately come to mind.
TRILL (or a proprietary TRILL-like implementation like FabricPath) would solve most of the STP-related issues once implemented properly (ignoring STP does not count as properly scalable implementation in my personal opinion). However, we still have limited operational experience and some vendors implementing TRILL might still face a steep learning curve before all the loop detection/prevention and STP integration features work as expected.
Flooding of BUM frames is an inherent part of transparent bridging and cannot be disabled if you want to retain its existing properties that are relied upon by broken software implementations.
Every broadcast frame flooded throughout a L2 domain must be processed by every host participating in that domain (where L2 domain means a transparently bridged Ethernet VLAN or equivalent). Ethernet NICs do perform some sort of multicast filtering, but it’s usually hash-based and not ideal (for more information, read multicast-related blog posts written by Chris Marget).
Finally, while Ethernet NICs usually ignore flooded unicast frames (those frames still eat the bandwidth on every single link in the L2 domain, including host-to-switch links), servers running hypervisor software are not that fortunate. The hypervisor requirements (number of unicast MAC addresses within a single physical host) typically exceed the NIC capabilities, forcing hypervisors to put physical NICs in promiscuous mode. Every hypervisor host thus has to receive, process, and oft ignore every flooded frame. Some of those frames have to be propagated to one or more VMs running in that hypervisor and further processed by them (assuming the frame belongs to the proper VLAN).
In a typical every-VLAN-on-every-access-port design, every hypervisor host has to processes every BUM frame generated anywhere in the L2 domain (regardless of whether its VMs belong to the VLAN generating the flood or not).
You might be able to make bridging scale better if you’d implement fully IP-aware L2 solution. Such a solution would have to include ARP proxy (or central ARP servers), IGMP snooping and a total ban on other BUM traffic. TRILL as initially envisioned by Radia Perlman was moving in that direction and got thoroughly crippled and force-fit into the ECMP bridging rathole by the IETF working group.
Lack of addressing hierarchy is the final stumbling block. Modern data center switches (most of them using the same hardware) support up to 100K MAC addresses, so other problems will probably kill you way before you reach this milestone.
Finally, every L2 domain (VLAN) is a single failure domain (primarily due to BUM flooding). There are numerous knobs you can try to tweak (storm control, for example), but you cannot change two basic facts:
- A software glitch in a switch that causes a forwarding (and thus flooding) loop involving core links will inevitably cause a network-wide meltdown (due to lack of TTL field in L2 headers);
- A software glitch (or virus/malware/you-name-it), or uncontrolled flooding started by any host or VM attached to a VLAN will impact all other hosts (or VMs) attached to the same VLAN, as well as all core links. A bug resulting in broadcasts will also impact the CPU of all layer-3 (IP) switches with IP addresses configured in that VLAN.
You can use storm control to reduce the impact of an individual VM, but even the market leader might have a problem or two with this feature.
More information
If you got this far, you really should read two more blog posts: Bridging and Routing – Is There a Difference? and Bridging and Routing – Part II. You might also be interested in the history of the switching buzzword.
As always, I have a few data center webinars that can help you: Cloud Computing Networking (particularly the Will It Scale part of it), Data Center Fabric Architectures and Data Center 3.0 for Networking Engineer. All three webinars are part of the yearly subscription.
To get full benefits of IPv6 ND, you need switches with MLD snooping.