Layer-3-Only Data Center Networks with Cumulus Linux on Software Gone Wild
With the advent of layer-3 leaf-and-spine data center fabrics, it became (almost) possible to build pure layer-3-only data center networks… if only the networking vendors would do the very last step and make every server-to-ToR interface a layer-3 interface. Cumulus decided to do just that.
Episode 38 of Software Gone Wild with Dinesh Dutt from Cumulus is a deep dive into that functionality, covering everything from the concepts to load balancing on Linux and vSphere, routing requirements of Linux hosts, IP multicast issues, history lessons learned in the times of Mobile ARP, uRPF challenges of Cumulus Linux, and even hardware offloads being planned for Linux kernel.
A few more blog posts you want to read to understand what we were talking about include:
- Introducing Redistribute Neighbor (Cumulus Networks blog)
- Redistribute Neighbor (Cumulus Networks knowledge base article)
- Re-architecting layer-3-only networks
- ARP processing in layer-3-only networks
- Reinventing CLNS with layer-3-only forwarding
- This is not the host route you’re looking for
- VRRP, Anycast, Fabrics and Optimal Forwarding
- Arista EOS VARP behind the Scenes
- Mobile ARP in Enterprise Networks
- So You Need ISSU on Your ToR Switch?
If you’re still lost after listening to the podcast and reading these blog posts (yes, it’s a heavy topic), please write a comment (or send me an email), and don’t forget to include the approximate time in the podcast where we lost you.
Finally, just in case you don’t know who Tony Li is – here’s a bit longer profile.
Therefore ToR 1 only has MAC address of host interface 1 (the one connected to ToR 1) in ARP cache and ToR 2 only MAC address of host interface 2 (the one connected to ToR 2). Inbound traffic that gets to ToR 1 (for that particular host) is always forwarded to host interface 1 and traffic that gets to ToR 2 is always forwarded to host interface 2. Traffic is actually load balanced on spine switches (when ToR are leafs or leaf switches when we have 3-tier clos - spine, leaf, ToR) between both ToRs with ECMP which means that is also load balanced between both host interfaces.
Outbound traffic is of course load balanced on the host itself between both interfaces with ECMP.
Correct me if I am wrong :)
A) MLAG (or equivalent)
B) VXLAN traffic pinned to one uplink
So you could build a L3-only network assuming you're ok with NSX using one of the uplinks at any time.
BTW, I covered the vDS uplink options in great details in VMware Deep Dive webinar:
http://www.ipspace.net/VSphere_6_Networking_Deep_Dive
Ivan, what's your beef with VMware's "notify" feature using RARP? It doesn't seem to me that it matters what they put in those post-vMotion MAC table helpers, does it?
The hypervisor is not in a position to send a gratuitous ARP, because it doesn't necessarily know what IPs are in use by the OS. Heck, it's only *barely* in a position to send the RARP, because the it knows what unicast MACs are loaded in the vNIC interest registers. If the guest OS was something that leverages promiscuous mode (like ESX does), then even the RARP would be impossible!
Anyhow, if someone sends a gratuitous ARP when the VM is moved, the adjacent physical switch could create a host route for the moved IP address (assuming that's what you want to do). With RARP there's nothing you can do.
Is your beef with the RARP that it doesn't help in the scenario covered in the show, because it doesn't populate the ARP table?
I'm thinking that this whole scheme only works because of a curious Linux optimization: A Linux host (or switch!) answering an ARP query *adds* an ARP entry for the requestor at the same time as answering the query. See here (lines 728-733):
http://lxr.free-electrons.com/source/net/ipv4/arp.c
When the host speaks, it queries for the gateway, which populates the switch ARP table because of the optimization, and can then be redistributed.
If, on the other hand, the host never speaks (say, it's a server), then nobody can *ever* talk to it, because there's no route to it anywhere in the L3 domain.
Firstly, thanks for an awesome episode and for getting right down into the nitty gritty technical detail. This was the most fun I've had with a technical podcast for quite a while. :-)
Now to some questions:
1. Given that there are host-level modifications required, wouldn't it be a simpler and more easily-accessible solution to run OSPF/BGP + BFD on the host? (I've pinged Dinesh about this on Twitter, but would be happy to hear from anyone who has the Cumulus BFD implementation working on a server-side distro.)
2. Ivan, what was your issue with the hardware offload the use of netlink? You seemed to imply that it's a bit hacky, but netlink is just an API which is used to talk between kernel & userspace networking.
3. Could you solve the multicast issue by using dual link-local multicast networks alongside the globally-addressible IP on the same physical links?
Thanks in advance,
Paul
#1 - For whatever weird reason neither server admins nor network admins fancy running a routing protocol between a server and a ToR switch. That was a SOP 20 years ago (every IBM mainframe connected to IP network was running RIP or OSPF in those days), but times have changed.
#2 - If I understood the architecture correctly, a userspace process (Quagga, for example) tells the Linux kernel to modify (for example) a forwarding table, and the Cumulus daemon listens to that conversation and implements the same change in hardware (or not).
I would prefer a more explicit architecture (which I understand is not available at the moment) where the Cumulus daemon would have a say in the process (like: I can't do that, please rollback the change).
Or maybe I got it all wrong...
#3 - you'd have to ask someone that actually knows something about Multicast ;)