Bridges: a kludge that shouldn't exist

During the last weeks I tried hard to sort out my thoughts on routing and bridging; specifically, what’s the difference between them and why you should use routing and not bridging in any large-scale network (regardless of whether it happens to be cramped into a single building called Data Center).

My vague understanding of layer 2 (Data Link layer) of the OSI model was simple: it was supposed to provide frame transport between neighbors (a neighbor is someone who is on the same physical medium as you are); layer 3 (Network layer) was supposed to provide forwarding between distant end nodes. Somehow the bridges did not fit this nice picture.

As I was struggling with this ethereally geeky version of a much older angels-on-a-pin problem, Greg Ferro of EtherealMind.com (what a coincidence, isn’t it) shared a link to a GoogleTalk given by Radia Perlman, the author of the Spanning Tree Protocol and co-author of TRILL. And guess what – in her opening minutes she said “Bridges don’t make sense. If you do packet forwarding, you should do it on layer 3”. That’s so good to hear; I’m not crazy after all.

Radia listed several shortcomings of layer-2 forwarding (@ 8:30 in her talk, long explanations are mine; if you think they are wrong, don’t blame her):

  • Layer-3 addresses have topological significance. The layer-3 addressing space is hierarchical, so you can summarize and create addressing and routing hierarchies. Layer-2 addressing space is flat; from the perspective of a forwarding device, the layer-2 addresses are spread randomly across the network.
  • Layer-2 protocols don’t have a hop count, so you can’t detect a forwarding loop. You have to invent all sorts of fixes (like STP) to prevent the loops from ever happening.
  • Layer-2 protocols don’t support fragmentation or path MTU discovery, so it’s impossible to do forwarding across segments with mismatched MTUs ... but then Data Link layer was never designed to handle that task.
  • Layer-2 protocols lack router-to-host error messages (like ICMP). Yet again, it was never a layer-2 task, but you need them if you want to have a routable protocol.

I would add one more very significant drawback: layer-3 forwarding is deterministic; you know where the destination is supposed to be. Layer-2 forwarding is guesswork; if you don’t know where the destination is, you just send a copy of the frame in all directions, just to make sure it will eventually hit the target.

OK, now we know bridges shouldn’t exist, yet they do. What went wrong? Radia explains that as well.

Approximately 30 years ago when Ethernet just started to appear, Radia was working on DECnet, one of the first truly well-designed networking protocols. DECnet was used to connect minicomputers manufactured by Digital Equipment Corporation (DEC) over WAN links and later over Ethernet (DECnet Phase IV). In those days, there were no PCs and users were using text terminals connected to computers with RS-232 cables. Using regular cables, RS-232 connections could be only a few meters long (and you had a serial port in your minicomputer for every user attached to it).

DEC solved this problem by developing the first terminal servers. Obviously they had to be low(er) cost (nobody would buy another minicomputer just to connect terminals to it) with very limited RAM and CPU capabilities. Radia suggested using DECnet (@ 7:20), but the development team decided to create their own protocol (LAT) which had no layer 3 because “their customer would never want to talk between Ethernet segments” (@ 7:40).

I can understand the need for a separate protocol. DECnet was a behemoth. I was “fortunate” enough to have it running on MS-DOS as Pathworks; it consumed half of the then-available 640K of RAM (that’s kilo-bytes ... for those of you who think your laptop needs 4GB of RAM). What I cannot understand is the very limited perspective of LAT developers ... but then our industry is filled with relics and wrong decisions that we still have to live with decades later (my favorites: lack of session layer in TCP/IP and the IP multihoming).

Anyhow, LAT was implemented and it was not routable ... and then all of a sudden the customers wanted to have terminal servers on one Ethernet segment talking to computers connected to another Ethernet segment. Instead of telling the LAT people to get lost (not an option, LAT-based terminal servers represented 10% of DEC’s revenue) or redesign their broken protocol, the problem was rephrased as “we need a magic box that can connect two Ethernet segments” (@ 10:00).

The answer could be any of these:

  • You need a router and a routable protocol (member of an ISO/OSI standard committee);
  • You can’t route non-routable protocols; I can write a paper proving that (academic researcher with a life-long tenure);
  • Told you so (Network Zen Master);
  • Let’s build a workaround (a true geek answer).

The rest is history: they surged ahead and built the bridge and Radia designed the STP. A masterpiece of engineering, but still a kludge fixing a problem that shouldn’t have existed in the first place. The networking landscape was changed forever ... and not necessarily for the better.

If you’d like to get a more comprehensive overview of the roles of routing and bridging in modern data centers and Service Provider networks, register for my Data Center 3.0 for Networking Engineers or Market trends in Service Provider networks webinars or buy their recordings.

12 comments:

  1. Petr Lapukhov12 July, 2010 08:22

    A few words in defense of bridging :) I would not say that bridging, interpreted as multiaccess link "extension" is *all* bad (what method is used for such extension is not important at the moment). "Bridging" defined as above has one huge benefit - simple ad-hoc deployment, which was so important in 80s and 90s.

    Even now, in the days of where uPnP and Zeroconf have been available for a while (though never widely accepted) using ethernet "bridging" (defined as single-linke extension) is the simplest and cheapest way of hooking network devices with minimum (initial) headache. At least without hiring a networking expert :) And in *some* situations this works just fine.

    The main problem (and benefit at the same time!) of any ad-hoc technology is low manageability. I would like to stress out that it IS possible to develop a scalable multi-access link technology (though TRILL is not an example :). There are Ethernet "modifications" that scale perfectly in the sense that they may accomodate tens of thousands of nodes on a single virtual link without any problems. However, it's almost impossible to properly manage an unstructured ad-hoc network; neither is possible to implement a complex security policy.

    Is there a solution? Well yes, create another overlay structure on top of the ad-hoc topology (e.g. require all nodes to register, authenticate and join a community/group) and use it for management. Works, but removes the only single benefit that ad-hoc bridging had! :)

    ReplyDelete
  2. Dmitri Kalintsev12 July, 2010 10:10

    STP is *not* a masterpiece of engineering, by a long shot.

    It is possible to design resilient and scalable L2 network. It takes care, discipline and foresight, just like any other engineering task.

    ReplyDelete
  3. "STP is *not* a masterpiece of engineering, by a long shot."

    ^ Agreed. The original 802.1D was embarrassingly slow and limited. It took the better part of a decade to evolve into something half-decent.

    ReplyDelete
  4. Ivan Pepelnjak12 July, 2010 18:15

    Transparent bridging is obviously a failure (more about that in another post). In my opinion, it's impossible to design resilient & scalable L2 network without introducing so many features that it's in the end indistinguishable from a L3 network. If you believe otherwise, please show me an example.

    Next, do not forget that the bridges' hardware at that time was severely limited. It's easy to be smart and creative when you have 4GB of RAM and a CPU that's faster than yesterday's mainframes. When all you have is 64KB of RAM and a 4MHz CPU (or something equivalent), your options are "somewhat" limited. Designing a protocol that has very low CP+RAM requirements and almost-constant complexity is good engineering.

    Last but not least, don't blame STP for the monstrosity of transparent bridging. STP is just a protocol that finds a loop-free topology. If it's too slow for you, you can always fine-tune the timers. "Original" OSPF, IS-IS or BGP are also embarrassingly slow. Just imagine: 40 seconds just to discover your neighbor is down (or 3 minutes in BGP case) ... unbelievable. What were those idiots thinking? Oh, maybe, just maybe, they were forced to run those protocols on 64 kbps links instead of 10Gbps ones.

    ReplyDelete
  5. I disagree, STP is an elegant and highly functional protocol that is a masterpiece of design. What is also true, is that it was designed for Z80 CPU's and very low memory and hasn't been upgraded like many other protocols.

    The limitations are not inherent in the protocol, but in the way people implement.

    ReplyDelete
  6. Petr Lapukhov12 July, 2010 20:08

    Speaking of bridging I wonder why no one mentioned PBB/PBT and connection-oriented Ethernet in general :)

    ReplyDelete
  7. Ivan Pepelnjak12 July, 2010 20:38

    OK, I'll pick up that challenge. But first we need some baseline stuff ... coming in a few days.

    ReplyDelete
  8. Blake Erickson13 July, 2010 03:59

    Thanks for this post and for the link to the Radia video. I'll definitely watch it. I'm currently studying for my SWITCH exam and it was nice to hear some positive things about STP.

    ReplyDelete
  9. Dmitri Kalintsev14 July, 2010 13:36

    > please show me an example

    A ~400 site network a couple of my ex-colleagues designed, built (around 2002) and operated for about 6 years for one of key customers. Two "head offices" and ~400 branches, running between 10 and 100 Mbit/s each. HO sites were connected using 8-link Gigabit Ethernet channels. Purely switched network, using only MST+ and LAG, nothing else. Equipment-wise - a bunch of 4500s at larger sites, 3550s serving groups of closely located branches and media converters at the branches themselves. *Not one* major outage (and yes, the network was fully alarmed for connectivity *and* performance, using IP SLA). Maybe lucky.

    That is probably the most prominent example. Since then we've all gone VPLS, so no more true blue switched Ethernet. That said, the local incumbent still operates their Carrier Ethernet platform in the true blue fashion. 7600s, 3550s, 3750s, etc. They also didn't have too many misfortunes. Until about couple weeks ago, when they did. Apparently due to a failed card somewhere. Yes, many, *many* angry and some *very* angry customers, as yes, these networks are not too easy to troubleshoot, when they "go", so it took some hours to get the network back. :)

    ReplyDelete
  10. Ivan Pepelnjak14 July, 2010 15:49

    @Dmitri: I must admit, I'm totally impressed. Would you be willing to share more information with me?

    ReplyDelete
  11. Dmitri Kalintsev14 July, 2010 23:36

    Yes, of course. Could we take it offline? I assume you can see my email address, to which the notifications from this blog go?

    ReplyDelete
  12. Dmitri Kalintsev14 July, 2010 23:43

    We probably have different definitions of elegance and function. For a bit of a perspective, mine includes the ability to fail in a safe manner and an ability to facilitate fault finding with reasonable ease. Judging by the real life examples of what I have seen to date, the STP is none of that, irrespective of how much bad poetry is written to describe it's workings. :)

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.