Bridges: a Kludge that Shouldn't Exist
During the last weeks I tried hard to sort out my thoughts on routing and bridging; specifically, what’s the difference between them and why you should use routing and not bridging in any large-scale network (regardless of whether it happens to be cramped into a single building called Data Center).
My vague understanding of layer 2 (Data Link layer) of the OSI model was simple: it was supposed to provide frame transport between neighbors (a neighbor is someone who is on the same physical medium as you are); layer 3 (Network layer) was supposed to provide forwarding between distant end nodes. Somehow the bridges did not fit this nice picture.
As I was struggling with this ethereally geeky version of a much older angels-on-a-pin problem, Greg Ferro of EtherealMind.com (what a coincidence, isn’t it) shared a link to a GoogleTalk given by Radia Perlman, the author of the Spanning Tree Protocol and co-author of TRILL. And guess what – in her opening minutes she said “Bridges don’t make sense. If you do packet forwarding, you should do it on layer 3”. That’s so good to hear; I’m not crazy after all.
Radia listed several shortcomings of layer-2 forwarding (@ 8:30 in her talk, long explanations are mine; if you think they are wrong, don’t blame her):
- Layer-3 addresses have topological significance. The layer-3 addressing space is hierarchical, so you can summarize and create addressing and routing hierarchies. Layer-2 addressing space is flat; from the perspective of a forwarding device, the layer-2 addresses are spread randomly across the network.
- Layer-2 protocols don’t have a hop count, so you can’t detect a forwarding loop. You have to invent all sorts of fixes (like STP) to prevent the loops from ever happening.
- Layer-2 protocols don’t support fragmentation or path MTU discovery, so it’s impossible to do forwarding across segments with mismatched MTUs ... but then Data Link layer was never designed to handle that task.
- Layer-2 protocols lack router-to-host error messages (like ICMP). Yet again, it was never a layer-2 task, but you need them if you want to have a routable protocol.
I would add one more very significant drawback: layer-3 forwarding is deterministic; you know where the destination is supposed to be. Layer-2 forwarding is guesswork; if you don’t know where the destination is, you just send a copy of the frame in all directions, just to make sure it will eventually hit the target.
OK, now we know bridges shouldn’t exist, yet they do. What went wrong? Radia explains that as well.
Approximately 30 years ago when Ethernet just started to appear, Radia was working on DECnet, one of the first truly well-designed networking protocols. DECnet was used to connect minicomputers manufactured by Digital Equipment Corporation (DEC) over WAN links and later over Ethernet (DECnet Phase IV). In those days, there were no PCs and users were using text terminals connected to computers with RS-232 cables. Using regular cables, RS-232 connections could be only a few meters long (and you had a serial port in your minicomputer for every user attached to it).
DEC solved this problem by developing the first terminal servers. Obviously they had to be low(er) cost (nobody would buy another minicomputer just to connect terminals to it) with very limited RAM and CPU capabilities. Radia suggested using DECnet (@ 7:20), but the development team decided to create their own protocol (LAT) which had no layer 3 because “their customer would never want to talk between Ethernet segments” (@ 7:40).
I can understand the need for a separate protocol. DECnet was a behemoth. I was “fortunate” enough to have it running on MS-DOS as Pathworks; it consumed half of the then-available 640K of RAM (that’s kilo-bytes ... for those of you who think your laptop needs 4GB of RAM). What I cannot understand is the very limited perspective of LAT developers ... but then our industry is filled with relics and wrong decisions that we still have to live with decades later (my favorites: lack of session layer in TCP/IP and the IP multihoming).
Anyhow, LAT was implemented and it was not routable ... and then all of a sudden the customers wanted to have terminal servers on one Ethernet segment talking to computers connected to another Ethernet segment. Instead of telling the LAT people to get lost (not an option, LAT-based terminal servers represented 10% of DEC’s revenue) or redesign their broken protocol, the problem was rephrased as “we need a magic box that can connect two Ethernet segments” (@ 10:00).
The answer could be any of these:
- You need a router and a routable protocol (member of an ISO/OSI standard committee);
- You can’t route non-routable protocols; I can write a paper proving that (academic researcher with a life-long tenure);
- Told you so (Network Zen Master);
- Let’s build a workaround (MacGyver answer worthy of a true networking engineer).
The rest is history: they surged ahead, built the bridge, and Radia designed the STP. A masterpiece of engineering, but still a kludge fixing a problem that shouldn’t have existed in the first place. The networking landscape was changed forever… and not necessarily for the better.
Even now, in the days of where uPnP and Zeroconf have been available for a while (though never widely accepted) using ethernet "bridging" (defined as single-linke extension) is the simplest and cheapest way of hooking network devices with minimum (initial) headache. At least without hiring a networking expert :) And in *some* situations this works just fine.
The main problem (and benefit at the same time!) of any ad-hoc technology is low manageability. I would like to stress out that it IS possible to develop a scalable multi-access link technology (though TRILL is not an example :). There are Ethernet "modifications" that scale perfectly in the sense that they may accomodate tens of thousands of nodes on a single virtual link without any problems. However, it's almost impossible to properly manage an unstructured ad-hoc network; neither is possible to implement a complex security policy.
Is there a solution? Well yes, create another overlay structure on top of the ad-hoc topology (e.g. require all nodes to register, authenticate and join a community/group) and use it for management. Works, but removes the only single benefit that ad-hoc bridging had! :)
The simplicity of Ethernet connection provisioning is still an important feature. That is why service providers prefer offering L2 connectivity. Provisioning could become easy, automated, quick, even through multiple service provider sections. Metro Ethernet Forum has some specifications for this... L3 connectivity provisioning between a LAN and a WAN is more complicated, time and resource consuming.
It is possible to design resilient and scalable L2 network. It takes care, discipline and foresight, just like any other engineering task.
^ Agreed. The original 802.1D was embarrassingly slow and limited. It took the better part of a decade to evolve into something half-decent.
Next, do not forget that the bridges' hardware at that time was severely limited. It's easy to be smart and creative when you have 4GB of RAM and a CPU that's faster than yesterday's mainframes. When all you have is 64KB of RAM and a 4MHz CPU (or something equivalent), your options are "somewhat" limited. Designing a protocol that has very low CP+RAM requirements and almost-constant complexity is good engineering.
Last but not least, don't blame STP for the monstrosity of transparent bridging. STP is just a protocol that finds a loop-free topology. If it's too slow for you, you can always fine-tune the timers. "Original" OSPF, IS-IS or BGP are also embarrassingly slow. Just imagine: 40 seconds just to discover your neighbor is down (or 3 minutes in BGP case) ... unbelievable. What were those idiots thinking? Oh, maybe, just maybe, they were forced to run those protocols on 64 kbps links instead of 10Gbps ones.
The limitations are not inherent in the protocol, but in the way people implement.
A ~400 site network a couple of my ex-colleagues designed, built (around 2002) and operated for about 6 years for one of key customers. Two "head offices" and ~400 branches, running between 10 and 100 Mbit/s each. HO sites were connected using 8-link Gigabit Ethernet channels. Purely switched network, using only MST+ and LAG, nothing else. Equipment-wise - a bunch of 4500s at larger sites, 3550s serving groups of closely located branches and media converters at the branches themselves. *Not one* major outage (and yes, the network was fully alarmed for connectivity *and* performance, using IP SLA). Maybe lucky.
That is probably the most prominent example. Since then we've all gone VPLS, so no more true blue switched Ethernet. That said, the local incumbent still operates their Carrier Ethernet platform in the true blue fashion. 7600s, 3550s, 3750s, etc. They also didn't have too many misfortunes. Until about couple weeks ago, when they did. Apparently due to a failed card somewhere. Yes, many, *many* angry and some *very* angry customers, as yes, these networks are not too easy to troubleshoot, when they "go", so it took some hours to get the network back. :)