What Is Layer-2 and Why Do We Need It?

I’m constantly ranting against large layer-2 domains; recently going as far as saying “we don’t really need all that stuff.” Unfortunately, the IP+Ethernet mentality is so deeply ingrained in every networking engineer’s mind that we rarely ever stop to question its validity.

Let’s fix that and start with the fundamental question: What is Layer-2?

I don’t know whether they still teach OSI model in baseline networking courses (they should), but if you’re lucky enough to have heard about it, this is probably the picture you’ve seen:

Bottom layers of the OSI stack

Bottom layers of the OSI stack

Layer-1 (physical layer) is easy to explain: it defines the connectors, voltages and encoding scheme needed to pass a bunch of zeroes and ones between adjacent devices.

Networking devices working at layer-1 convert zeroes and ones between different voltages or transmission media. A few examples: modems (the traditional ones, not the thing that connects your WiFi to cable network, that one is at least a bridge if not a router), media converters (fiber-to-copper converters are still reasonably popular), hubs (you have to be pretty old to have real-life experience with them) and Media Access Units (Token Ring anyone?).

Layer-2 is where things get complex. It was initially defined as the layer that allows adjacent network devices to exchange frames. Every layer-2 technology has to define at least these components:

  • How do you group zeroes and ones provided by layer-1 into frames (proper layer-2 terminology for packets);
  • Start-of-frame indication (the receiver has to know a new frame is coming), sometimes also know as frame synchronization;
  • End-of-frame indication, which can be either a special bit sequence or frame length encoded somewhere else in the frame;
  • Error correction mechanism in case the physical layer cannot guarantee error-free transmission of zeroes and ones (they usually don’t);

Have you noticed that I haven’t mentioned layer-2 addresses (known as MAC addresses in Ethernet)? There’s a simple reason for that: sometimes you don’t need them. You only need layer-2 addresses when you have more than two devices attached to the same physical network, like we used to have in the old cable-based Ethernet networks:

Emulating coax cable with Ethernet gear

Emulating coax cable with Ethernet gear

Use of point-to-point links is the primary reason for lack of layer-2 addresses in Fibre Channel networks (regardless of the violent disagreement I get every time I mention this).

The first time you truly need unique addresses is layer-3, which should provide end-to-end packet delivery across the network.

Now let’s answer some interesting questions:

Why do we have MAC addresses in Ethernet frames? Because the original Ethernet used a coax cable with numerous devices attached to the same physical medium.

Why do we still use MAC addresses in Ethernet frames? Because IEEE always wanted to keep everything backward compatible with the original Ethernet. 100 Mbps Ethernet still supported hubs (think of them as cable extenders), and 10GE is the first Ethernet technology that doesn’t have half-duplex support.

Only a single sender can transmit over a shared medium (cable, WiFi frequency) at the same time – we call that half-duplex transmission. Both end nodes can transmit at the same time on a bidirectional point-to-point link with two unidirectional paths – a full-duplex transmission.

Do we still need layer-2? In many cases, the answer is no. Every device that uses software-based forwarding can act as a layer-3 forwarding device (properly known as router but called layer-3 switch by almost everyone). Hardware in many high-speed forwarding devices (particularly switches deployed in data centers) already supports layer-3 forwarding.

Why are we still using layer-2? Because every vendor (apart from Amazon and initial heroic attempts by Hyper-V Network Virtualization team) thinks they need to support really bad practices that originated from the thick yellow coax cable environment, like protocols without layer-3 (and thus no usable end-to-end addresses) or solutions that misuse the properties of shared medium in ways nobody ever envisioned.

Finally, why is everyone using frame format from 40 year old technology? Because nobody wants to change the device drivers in every host deployed in a data center (or in the global Internet).

Want to Know More?

Read these blog posts and watch the How Networks Really Work webinar

23 comments:

  1. You're in this field for too long, Ivan. You're seeing the flaws in everything.
    That being said, a very interesting point. I never looked at it that way, and now I understand it better.
  2. Connect your servers with a /31 to the local L3 device and either use the /31 local address as source or advertise a loopback and source from there. Need to move a server? No problem. You can even anycast like this. Nice and scalable
  3. One concern (historically) was the burn of IPv4 addresses. Even /30s or /31s in conjunction with RFC1918 wasn't enough for truly large environments. IP Unnumbered helps, when/where supported ...

    Additionally, legacy applications (think old school MS SLB) sometimes *required* broadcast reachability. Bad decisions from the get-go, but sometimes reality fails to achieve the Best, or even a Good, answer(s).

    IPv6 really makes this easy - Link Local addresses are already there, and GUAs are (effectively) limitless as well.

    And moving forward, methinks IP fabrics will take everything else over (atleast in the data center space).


    /TJ
    Replies
    1. 10/8 gives you 16.7 million IPs. Half this for /31 and you can still address 8.3 million servers.

      However I agree with your v6 statement
  4. You're saying what I have thought dozens of times. I just haven't been courageous enough to try to design something better than ethernet ��
  5. Pretend you are a device receiving a stream of bits. After you receive some inter-frame spacing bits, whatever comes next is the 2nd layer; whether that is Ethernet, native IP, CLNS/CLNP, whatever. Perhaps the question should be "Are we using the right layer 2 protocol?" rather than "Why are we still using layer 2?"
  6. Let's ask another question: do we really need Layer 3 as seen in TCP/IP model?

    Named Data Networking:
    http://named-data.net/project/archoverview/
  7. It seems that Cisco is interested in NDN (a replacement for TCP/IP):

    http://www.networkworld.com/article/2602109/lan-wan/ucla-cisco-more-join-forces-to-replace-tcpip.html
  8. First of all I'm with you!. But as someone who works to reduce layer 2 sprawl with customers daily I would say the security discussion is missing. Microsegmentation can replace (and improve on) the isolation of VLANs but that means you need a microsegmentation architecture to reduce layer 2. Microsegmentation (for now) is difficult to implement without virtualization or similar abstraction of compute. Also, even if you are super progressive with Docker, nesting and traditional virtualization everywhere; there will be some edge network devices and hypervisors that need to be placed on *some kind of layer 2 segment. This does not mean we have to use the legacy protocols of the past. Rather I think it means we need to revisit how to implement a progressive, reduced layer 2 footprint where appropriate and eliminate it wherever it is not necessary.

    On the flip side, I would ask why layer 2 overlays (VxLAN et. al) are necessary at all, as given this argument it seems we are bringing some of the crutches of the past into the virtualized world. Why not rely on microsegmentation and be done with it?
    Replies
    1. Speaking of IPv6 microsegmentation ;) https://www.youtube.com/watch?v=2zvrzgGzyYw
    2. Layer 2 is the best to connect MPLS to SDH or DWDM or Mivrowve because it make line for trouble shooting
  9. I recall reading about IP directly over DWDM at one point back in early 00s.
    Replies
    1. Yeah, there was that... and I remember IP-over-SONET (POS?).
  10. l2 is faster? every l3 hop adds delay?
    faster for storage is better?
    Replies
    1. L2 lookup is almost exactly as fast as L3 lookup these days... at least on decent hardware. OK, maybe there's a few nsec difference if your hardware supports cut-through switching.
  11. oh.. ok, thank you.
  12. Seems to me that for both servers and clients, particularly running legacy operating systems, the hardened stack runs over L2 using an Ethernet NIC. While developing and hardening a new stack for Linux is feasible, for the legacy OS's I'd argue any new stack is just not going to happen.

    The last two times I was involved (however tangentially) in alternate Layer 2's for the data center -- Fibre Channel and InfiniBand -- did not change the outcome of the competition of networks in the 1980s and early 1990s, which was that Ethernet (by which I mean IEEE 802.1, 3, 11) won.

    Concur that the L3 boundary is moving to the first hop Access switch (the ToR) in some leading edge applications. Note that this adds a lot of participants to the Layer 3 routing protocols, leading to at least some centralized control (regardless of whether it's called SDN, or a route reflector, or some other name) not unlike the centralized control which would have been needed for large Layer 2 installations.

    There is room, particularly in specialized applications like supercomputers, for innovation at Layer 2. I don't mean arbitrarily framing packet headers differently, I mean actually making a big difference for some class of application, or taking a lot of cost out of the solution. A compelling enough design could be mainstreamed, but it's tough to dislodge Ethernet.
  13. Is IP mobility that big a deal really? Are there enough use cases for maintaining TCP sessions with VM mobility?
    Even if it were the case, if you look at it, you would typically need it for VMs acting as servers (client/server context). Can we not solve this with a good service discovery solution or even a load balancer, in which case really the load balancer alone needs to know how to get to the service?
    Replies
    1. We need to think about this problem not as network people, but as application users. I have a user somewhere out on the Internet, using an application in my data center. I have to create the illusion that my application never goes down, particularly when my application is serving ads and I don't get paid if they don't show up.

      This translates to a set of interesting -- I almost might say "grand challenge" -- problems for networking. VM migration is useful; keeping a TCP session alive is sometimes useful; being able to migrate a (block of) IPs from one data center to another live is useful. But these are tools in a toolbox, not solutions in their own right.

      The "solution" is that my gmail window never hangs; my google maps application never hangs; my netflix movie never hangs and has to be restarted; Amazon never goes away, or wipes my shopping cart, or hangs in a way that my web browser or mobile application needs to be closed and restarted.
    2. Completely agree on the "solution" aspect. I want to understand how this translates to IP mobility at the back end. Invariably the application that serves the user requests is behind a load balancer and IP mobility might not be that big deal for this (North/South traffic). I suspect it is more relevant for East-West traffic. Is that true?
    3. Remember, I'm an old server and storage guy with only a few years' experience in networking: take the network side of this answer with a grain of salt.

      North/South: once the client has resolved the server to a single IP address, that IP address needs to keep responding to the client, regardless of particular servers / load balancers / routers / electrical transformers finding themselves engulfed in flames or otherwise offline.

      If the protocol is stateless (meaning the application was written in the last decade rather than in the 1990s) then it doesn't matter who's home, just somebody has to be.

      From an East/West perspective, if a collection of services which depend on each other is being live migrated, say to new hardware (which could be in a different data center) then keeping both IP's and MAC addresses intact will be important. My guess is Google is able to open new capacity and then shut down old capacity, so doesn't do such migrations. Again, traditional Enterprise is very different: there will be a single instance of an application, with resources spent to make it reliable: very difficult to shut down the SAP back end to migrate it.

      It could be that the key east-west case is consolidated storage. Remember a couple of years ago when a network partition in a storage chunk of an Amazon data center caused a bunch of storage servers to simultaneously think they were the single surviving copy of data, and start emergency replication, bringing not just that storage chunk but the AWS instances depending on that storage to their knees? Yes, relative to people who'd done big disk arrays that was a novice specification oversight in an exception path in a piece of software. Back to the point: with that as backdrop, 5 years from now, with AWS live, migrate all of the contents of that chunk of storage to new hardware so the obsolete (not to mention worn-out) disks can be junked. Regardless of whether IP addresses are preserved, the East-West server to storage traffic has to stay live during the entire migration.
  14. Besides, while advertising host routes (including /31s for that matter) is technically feasible, all the routers would need to know all the routes which might not be practical in a large datacenter because of TCAM limitations. One of the reasons network virtualization might be a good approach.
    Replies
    1. Watch my IPv6 Microsegmentation presentation for an explanation of why that's not a problem: http://blog.ipspace.net/2015/04/video-ipv6-microsegmentation.html

      See also this blog post on host routes and ARP: http://blog.ipspace.net/2014/02/this-is-not-host-route-youre-looking-for.html

      Finally, here's a follow-up article to this one: http://blog.ipspace.net/2015/04/rearchitecting-l3-only-networks.html

      Hope this helps,
      Ivan
Add comment
Sidebar