Why Do Internet Exchanges Need Layer-2?

My tweet about the latest proof of my layer-2 = single failure domain claim has raised numerous questions about the use of bridging (aka switching) within Internet Exchange Points (IXP). Let’s see why most IXPs use L2 switching and why L2 switching is the simplest solution to the problem they’re solving.

What is an Internet Exchange Point?

This section is a gross oversimplification intended for readers who have never been exposed to this topic. Please listen to the Packet Pushers Show 24 for a more in-depth IXP discussion.

Quick summary: IXP is the place where ISPs (member of the IXP) exchange traffic.

Only a few very large transit providers are considered Tier 1 networks (see Renesys blog for yearly updates), everyone else has to buy transit to the rest of the Internet from one or more of the bigger fish in the pond.

The smaller providers are thus interested in minimizing the amount of transit traffic and peering agreements are usually a good mechanism. However, there are usually tens or hundreds of ISPs operating in a given geographical area, and private peering between them would result in an N-square full mesh problem. It’s thus in interest of almost everyone to meet in a common place, connect to a shared infrastructure – Internet Exchange Point – and exchange traffic.

How does an IXP work?

To keep things simple, let’s gloss over the details, and assume that every ISP participating in an IXP brings its own router to premises owned by IXP, and connects it to a shared network infrastructure.

Each ISP has its own AS number and uses BGP to exchange routes with other ISPs. ISPs might decide to peer with everyone, or with a select set of peers, and accept all routes or just a few routes from their peers.

An ISP can also decide to implement local transit agreements across an IXP infrastructure, or prefer routes from one of the peering partners over routes received from another peering partner.

In the example in the following diagram, AS 3 receives two paths toward AS X, one from AS 2, one from AS 4. It might prefer the route through AS 4, whereas AS 1 cannot use that route, since it’s not peering with AS 4 (unless AS 2 is willing to provide transit services).

To summarize: each ISP participating in an IXP might have its own BGP routing policy, resulting in an individualized view of the local parts of the Internet.

IP- or MAC-based forwarding in an IXP?

In the previous diagrams, the IXP infrastructure was drawn as a symbolic Ethernet cable, and some very early IXPs were actually implemented that way, using either thick coax or Ethernet hubs.

Today we could use L2 or L3 switches to implement the IXP infrastructure. Ethernet-based IXP design is obvious and simple (while we’re still glossing over details): all ISP routers connect to a switched LAN.

We all know bridging doesn’t scale, so one might want to implement IP-based IXP infrastructure – all ISP routers would be connected to an IXP router, exchange BGP routes with it, and potentially still run BGP between themselves to support various routing policies.

This scenario might work as long as all ISPs share the same routing policy. IP uses hop-by-hop destination-only forwarding (tunnels are obvious exceptions triggering the scholastic Is MPLS Tunneling problem), and thus it’s impossible for the IXP router to forward packets from different ISPs based on their preferred routing policy.

Going back to our example: if the IXP router decides to prefer route to AS X going through AS 2, it’s impossible for AS 3 to forward the packets toward AS X through AS 4. While the router in AS 3 might decide to prefer the path advertised by AS 4, once the IP packets leave it and arrive at the IXP router, the IXP router will make its own independent forwarding decision and send the packets to AS 2.

Conclusion: Internet Exchange Points are one of the rare scenarios where large L2 domains actually make sense, and once they grow and get distributed across multiple locations (example: AMS-IX, LINX), they get exposed to the same set of problems all large L2 networks face, including occasional meltdowns.

26 comments:

  1. It is probably also worth mention that IXPs try do it as cheaply as possible, while having to deal with large amounts of high speed links.

    And we all know how much high speed router interfaces are.

    Just my 2c.

    ReplyDelete
    Replies
    1. But the modern high-end data center switches all support L3 switching and all of them have at least 16K IPv4 routes and BGP support. Oh, wait ... ;))

      Delete
  2. Reuben Farrelly12 July, 2012 13:39

    Yeah - but then even a 2921 with a bit of extra DRAM and an IPBASE featureset can handle 200 MBit/sec or so of traffic (remember there's no NAT or anything - just raw routing/CPU required for this function). A router dedicated to a peering link can also be run without a default route, which adds somewhat to the security and makes it harder for other peering participants to use you as a default gateway to the Internet.

    In terms of routes, yes this would be too much for an entry level floor switch like a 3560, but hardly a big job for a modern software based router like the ISR G2's.

    As a point of reference the ISP I operate pulls about 1/3 of our total traffic from the PIPE Sydney Peering here in Australia, and we get just a shade over 10,000 IPv4 routes.

    The savings we make from that particular peering link instead of purchasing more upstream capacity saves the cost of probably half a dozen 2900s each month alone. So it's really a no brainer, and if I had to put a dedicated router in to do this function there would be no issue justifying the capital spend.

    If you aren't saving enough money from peering to at least fund the very modest hardware required then it's probably marginal in so far as it even being worth your while setting up in the first place.

    ReplyDelete
    Replies
    1. Reuben, the PIPE peering network is unusual in that it is Multi Lateral Peering Agreement IXP. Even though that could in theory allow them to run a pure L3 exchange it is still a layer 2 network running over a Broocade MLX-e core. The Routers PIPE runs do not get in the traffic path.

      Delete
  3. ah just throw a fabric over it and be done with it(sarcasm)

    ReplyDelete
    Replies
    1. Thank you for your comment ;-)

      Delete
  4. l2vpn is the best solution for IXP.
    simplified routing, as all they need to do is bridge few mac(router) over large l2 domain.
    and still have TE to handle their huge traffic volume.

    ReplyDelete
    Replies
    1. Except that the transport (assuming you are talking about MPLS L2VPn)won't be as solid as pure L2 ethernet switch, and you will have extra overhead (2+ MPLS labels + MAC)

      Delete
    2. I believe the Juniper LAN at LINX mentioned in that Register article is an L2VPN infrastructure. It was built using Juniper MX and PTX equipment from the presentations I've seen.

      Delete
  5. Ivan, as always you raise great points. I do feel however that certain facets need pointing out: With all its shortcomings, a bridged Ethernet environment imo yields the best ratio of price/ease of deployment/market penetration. Combined with a carefully thought architecture (VPLS + tools that minimize BUM floods, for instance arpsponge, L2 ACLs etc.) the faults do not outweigh the merits. 497 connected members and counting can attest to that :-) I may work at AMS-IX but honest to $deity I'm not trying to self-praise as a recommendation.

    Also, If I read you correctly, the IP-based IXP you mention sounds pretty close to a route server service. Difference is, it still uses the "unscalable" L2 domain, but it *does* provide server based filtering for members to peruse, usually either via BGP communities or IRRdb expressions, so they can pretty much apply distinct routing policies.

    ReplyDelete
    Replies
    1. I know you run a tight (and well-oiled) shop, so the drawbacks of L2 don't hurt that much. Also, I would assume large IXPs carry significant amount of routes, so you could no longer use DC L3 switches (even if you'd want to) but way more expensive carrier-grade routers.

      The real difference between IP-based IXP and route server service is that the route server is optional. I can decide not to use a route server, use it as a backup mechanism, or augment it with direct peerings, whereas IP-based IXP enforces a common routing policy in the forwarding plane.

      Delete
    2. Yes, so even then, capex/opex comes into play. The simpler the hardware, the cheaper it is (at least most of the time) and the easier it is to troubleshoot and maintain.

      Delete
    3. Ivan, daydreamer - great discussion. I have wondered why this problem isn't solved by having the peers connect to each other using eBGP over GRE tunnels. That way:
      1. The IXP can provide an "L3 mesh" which is a lot more stable
      2. The IP prefix scale requirements on the IXP router would be pretty small (would mimic the IGP scale which is well within the bounds supported by most DC switches)

      Now perhaps the availability of a robust BGP stack and GRE support on DC switches does not leave the IXPs with a lot of vendor choice, but surely the option is worth exploring.

      Delete
    4. GRE wouldn't help. Not all high-speed boxes support GRE in hardware, and you'd end up with tons of tunnels that would have to be configured on all routers participating in the IX. Definitely not as scalable as bridging ;)

      Delete
  6. Why don't IXP's just use VLANs for each peer? Say ISP A wants to peer with ISP C so the IXP says OK that interconnection will be vlan 8. That way they don't have large l2 domains. Or is that a bad idea also?

    ReplyDelete
    Replies
    1. You just created an N-square problem. How many VLANs would you need in AMS-IX with ~500 members?

      Delete
    2. Well, that's what some of the IXP's on this side of atlantic do. Might have something to do with business model though.

      Delete
    3. Ivan, there's also the expectation of physical availability that can be played with. Who says cloud number 7 needs to be available at room 8? How many clouds with 4k VLANs do you need in single room? :)

      Delete
    4. @liiwi: What you're describing sounds more like a set of colocated private peerings than a public peering. Interesting.

      Delete
  7. As I remember, AMS-IX is migrating or migrated to VPLS. Maxim from AMS-IX told that on ENOG meeting: http://www.enog.org/presentations/enog-3/25-future_ix_eng.pdf

    Kirill

    ReplyDelete
    Replies
    1. I believe LINX in London completed such a re-engineering project prior to the Olympics... http://www.nanog.org/meetings/nanog54/presentations/Wednesday/Cobb.pdf

      Delete
    2. Kirill, thank you for a link to my presentation. ;)

      It didn't mention in the presentation but AMS-IX has been using MPLS/VPLS for the last 3 years.

      Delete
  8. How about an IX-local IGP with EBGP multihop on top?
    That could be brought to scale, right?

    ReplyDelete
    Replies
    1. How would the interim (IXP-owned) router know where to send the traffic?

      Remember: in an IP network, each router performs an independent L3 lookup on the destination IP address.

      Delete
  9. Both AMS-IX and LINX don't use "classical" bridging: internally they use VPLS over MPLS with TE tunnels. Of course that don't fix all issues of a "big bridging domain" but adds enormous scalability without using buzzwords TRILL/SDN/OpenFlow/etc

    ReplyDelete
  10. Magical spells really work!! I never thought there were still honest, genuine, trustworthy and very powerful spell casters until i met the spiritual helper, MERUJA OWO. last week he did a love spell for me and it worked effectively and now he just casted another healing spell for my friend who has fibroid and family problem and now she is totally free and she is presently the happiest person on earth, she keeps thanking me all day..
    I just thought it would be good to tell the whole world about his good work and how genuine he is, i wasn't thinking i could get any help because of my past experiences with other fake casters who could not bring my husband back to me and they all promised heaven and earth and all they are able to do is ask for more money all the time until i met with this man. he does all spells, Love spells, money spells, lottery spells e.t.c i wish i can save every one who is in those casters trap right now because i went though hell thinking and hoping they could help me.i recommend MERUJA OWO for any kind of help you want.
    his email address is: nativedoctor101@live.com
    if you want to ask me anything my e-mail is: dorispinto1001@gmail.com
    Kind Regards!

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.