Could We Build an IXP on Top of VXLAN Infrastructure?

Wednesday, March 28, 2018 11:19 +0200

Could We Build an IXP on Top of VXLAN Infrastructure?

Andy sent me this question:

I'm currently playing around with BGP & VXLANs and wondering: is there anything preventing from building a virtual IXP with VXLAN? This would be then a large layer 2 network - but why have nobody build this to now, or why do internet exchanges do not provide this?

There was at least one IXP that was running on top of VXLAN. I wanted to do a podcast about it with people who helped them build it in early 2015 but one of them got a gag order.

In the meantime, several IXPs deployed VXLAN in production including:

INEX (they also open-sourced their management software) – pointer provided by Anonymous, more information from Nick Hillard in the comments;
LONAP – pointer provided by Blake, more information from Will Hargrave in the comments;
Equinix in several metro fabrics.

Want to know why you need L2 network to run an IXP? I wrote about that in 2012.

This leads me to another topic: IXPs are mostly local, nobody did yet span up one layer 2 VLAN throughout whole America or Europe. I've tried finding some information, but I don't know what I am missing. What prevents somebody from building such a large layer 2 network?

Point-to-point layer-2 networks spanning continents have been a reality since (at least) Frame Relay days, and there’s at least one SP offering L2-over-VXLAN across US and they might be using EVPN as the control plane. The trick to make these things work is to keep the L2 domain small and to minimize the impact of potential stupidities or bad hair day on either customer network or transport infrastructure.

Large L2 domains spanning continents or countries? It has been tried many times before, and failed miserably every single time. I’m positive someone will try to do it again now that you can move VMs across the continent.

Of course, latency may be an issue, but if you have a quite flat design STP should not be your problem ...

How about the fact that a single endpoint could bring down the whole network with a broadcast storm? All it takes is a broken NIC.

Keep in mind that even the regular broadcast caused by ARP gets so damaging in large L2 domains that people like AMS-IX had to deploy ARP Sponge to limit its damage.

Long story short: Friends don’t let friends build large layer-2 domains, more so if the said domain spans more than a single site. Or as Ethan Banks said once, nuked earth is not a nice sight.

Want to know more?

Lukas Krattiger and myself will talk about multi-site and multi-pod data center fabrics (and how to build them in a relatively sane way) in another live session of Leaf-and-Spine Fabric Architectures webinar on March 29th;
You’ll find even more information about data center fabrics in the Designing and Building Data Center Fabrics online course;
Dinesh Dutt will talk about EVPN-with-VXLAN details in the second part of EVPN Technical Deep Dive webinar on April 5th.

Recent posts in the same categories

VXLAN

Internet

WAN

11 comments:

Anonymous 28 March 2018 12:56

Whether it's VXLAN (probably BGP EVPN) or VPLS (probably Kompella) or PBB or what have you, you still suffer the full mesh scale problem. In my opinion you can't fix that with a route server because it would become the central chokepoint. Especially with 1 Terabit/s links.

Anonymous 28 March 2018 16:50

The premise here is not correct, most large IXP operators have deployed VXLAN or VPLS for their Layer-2 domain already. EVPN may be in use in newer IXPs as well but I haven't seen any specific cases of this myself.

Replies

Ivan Pepelnjak 28 March 2018 17:12

VPLS - agreed, many.

VXLAN - would love to hear about more of them. Any pointers you can share?

Anonymous 28 March 2018 18:29

So it's time to create some value for the community. I just did a 5 minutes Google research. INEX migrated to VXLAN, see here: http://docs.ixpmanager.org/features/layer2-addresses/ . They also created a so called IXP Manager with Saltstack (NAPALM) and it has also a nice Web GUI https://www.ixpmanager.org/media/2017/201709-nlnog2017-automation-with-ixp-manager.pdf. So we can now create our won IXP.

Ivan Pepelnjak 28 March 2018 18:32

"I just did a 5 minutes Google research." << touche ;)

Love the ixpmanager you found. Thanks a million!

Blake 29 March 2018 17:32

It's well known that the LoNAP IXP fabric is VXLAN on Arista R-series (but provisioned via automation, not BGP EVPN):
e.g.
http://www.trefor.net/2016/05/25/lonap-ripe/

Will Hargrave 30 March 2018 10:52

LONAP CTO here!

We're a mid-sized IXP (~200 networks connected, ~3Tbit connected capacity) who have been running VXLAN on Arista for around two years now, with great results. This is in a 'flood and learn' config with Head-End-Replication (HER) - i.e we are replicating BUM at the edge to all other edge nodes. We are doing some testing with EVPN-on-VXLAN although it is worth noting it doesn't have some of the compelling advantages for an IXP as it does for a L2/L3 datacentre network.

Our friends over at INEX are in a similar setup for their primary LAN, which I think they deployed during 2017.

We started down this road in mid-2015 after a failed deployment of VPLS/MPLS with another vendor, and it swiftly became clear that a 'datacentre class' leaf-spine architecture with something like VXLAN was the way to go for a growing IXP of our size. ECMP has let us scale easily from n*10G to n*100G in the core and with VXLAN, the imposition of entropy on the source UDP port means intermediate network elements can effectively loadbalance the traffic.

As regards the topic of large l2 networks 'spanning the globe', I think we need to take a step back from technology and look at human and commercial factors. By far the most popular model for IXP charging is a low flat-rate per-port model. It is more difficult to keep a control on costs if you have expensive leased capacity there, which is why successful IXPs keep to the metro where they can scale easily and avoid competing with their own members.

Moreover there is an expectation among network operators that the endpoint of their BGP session across the fabric is relatively nearby. Long-stretched L2 domains are unpopular among many as they mess up this assumption, cause hairpinning and thus bad enduser experience. There is a role for stretched IXP model - i.e. 'reseller' programs and the like, under controlled conditions. But most operators prefer to meet over a fast, local fabric in the metro.

LONAP in her 21 years of existence has seen many such 'global IX' operators come and go. :)

Replies

Ivan Pepelnjak 30 March 2018 11:19

Thanks a million for your feedback - and the business perspective of why long-distance IXPs make little sense.

Nick Hilliard 30 March 2018 10:53

vxlan generally works better for ixps than vpls. There are some major wins in terms of compexity reduction, including no requirement for lsp management and out-of-the-box load-balancing over parallel links in the ixp core. This is messy to handle in vpls because most silicon out there won't inspect L4 headers when creating n-tuple hashes for load-balancing. So you end up either having to use expensive hardware or else building multiple lsps between each PE, with traffic load-balanced by the PE on a per-LSP basis. This doesn't scale well because you end up with O(PE^2) LSPs on your network

EVPN control plane for vxlan isn't ready for production networks yet. Hopefully soon.

Replies

Ivan Pepelnjak 30 March 2018 11:21

Hi Nick! So nice to hear from you after a long long time.

Thanks for the feedback (and yet again: I love your software).

Marian Ďurkovič 15 April 2018 11:27

Exactly those reasons lead us to build SIX.SK on top of TRILL back in 2014. Simple, straightforward and stable for almost 4 years. One bonus point - BUM traffic is in TRILL distributed natively, without the need for central replication.

Add comment