Layer-3-Only Data Center Networks with Cumulus Linux on Software Gone Wild

With the advent of layer-3 leaf-and-spine data center fabrics, it became (almost) possible to build pure layer-3-only data center networks… if only the networking vendors would do the very last step and make every server-to-ToR interface a layer-3 interface. Cumulus decided to do just that.

Episode 38 of Software Gone Wild with Dinesh Dutt from Cumulus is a deep dive into that functionality, covering everything from the concepts to load balancing on Linux and vSphere, routing requirements of Linux hosts, IP multicast issues, history lessons learned in the times of Mobile ARP, uRPF challenges of Cumulus Linux, and even hardware offloads being planned for Linux kernel.

A few more blog posts you want to read to understand what we were talking about include:

If you’re still lost after listening to the podcast and reading these blog posts (yes, it’s a heavy topic), please write a comment (or send me an email), and don’t forget to include the approximate time in the podcast where we lost you.

Finally, just in case you don’t know who Tony Li is – here’s a bit longer profile.

Latest blog posts in BGP in Data Center Fabrics series


  1. Hi Ivan, firstly, thanks a lot of very informative articles. I have benefited immensely from you sharing your knowledge. Secondly, a different request. Would it be possible to provide a link for *printable* form of the post? Say, a pdf version or something?
    1. What's wrong with print-as-PDF from the browser?
  2. If the same IP address is assigned to both interfaces then how does inbound traffic to host work? The ToR will ARP for the host IP address and one of the interface wins? So Inbound traffic is limited to one interface and outbound uses both interfaces?
    1. If I understand it right, each host interface is in a separate L2 domain because every connection between a ToR and host interface interface is a different L2 domain (this is the point of L3 only network right?).

      Therefore ToR 1 only has MAC address of host interface 1 (the one connected to ToR 1) in ARP cache and ToR 2 only MAC address of host interface 2 (the one connected to ToR 2). Inbound traffic that gets to ToR 1 (for that particular host) is always forwarded to host interface 1 and traffic that gets to ToR 2 is always forwarded to host interface 2. Traffic is actually load balanced on spine switches (when ToR are leafs or leaf switches when we have 3-tier clos - spine, leaf, ToR) between both ToRs with ECMP which means that is also load balanced between both host interfaces.

      Outbound traffic is of course load balanced on the host itself between both interfaces with ECMP.

      Correct me if I am wrong :)
    2. @Jan: You absolutely got it. Thanks for answering this one!
  3. Is it possible to use this in vSphere+NSX scenario? I think it would make sense to have pure L3 fabric with NSX but I am not familiar with all the details of NSX. If I understand correctly you still need L2 within a rack and MLAG on ToRs if you want dual homed hosts because VMware does not allow you to configure unnumbered interfaces with default route through both of them the way you can do it on Linux. Or is there a way to do it?
    1. It depends. NSX for vSphere rides on top of vDS, and so you have two options:

      A) MLAG (or equivalent)
      B) VXLAN traffic pinned to one uplink

      So you could build a L3-only network assuming you're ok with NSX using one of the uplinks at any time.

      BTW, I covered the vDS uplink options in great details in VMware Deep Dive webinar:
  4. I wonder how they handle failure modes where link stays up, but traffic doesn't pass in one direction. Without regular protocol keepalives, it seems like a BFD-like feature is appropriate here. Of course, "mode=1" bonding has the same problem, so maybe it's not a priority.

    Ivan, what's your beef with VMware's "notify" feature using RARP? It doesn't seem to me that it matters what they put in those post-vMotion MAC table helpers, does it?

    The hypervisor is not in a position to send a gratuitous ARP, because it doesn't necessarily know what IPs are in use by the OS. Heck, it's only *barely* in a position to send the RARP, because the it knows what unicast MACs are loaded in the vNIC interest registers. If the guest OS was something that leverages promiscuous mode (like ESX does), then even the RARP would be impossible!
    1. Hyper-V and KVM (supposedly?) send ARPs. The only thing they have to do is to sniff packets sent from the VM to find its IP address. Obviously that doesn't work quite so well for some network services VMs.

      Anyhow, if someone sends a gratuitous ARP when the VM is moved, the adjacent physical switch could create a host route for the moved IP address (assuming that's what you want to do). With RARP there's nothing you can do.
    2. I think it's naive to believe that a hypervisor can *really know* what's going on IP-wise inside a VM it's hosting, and pretending that it can gets dangerous. What if the VM has lots of IP addresses?

      Is your beef with the RARP that it doesn't help in the scenario covered in the show, because it doesn't populate the ARP table?

      I'm thinking that this whole scheme only works because of a curious Linux optimization: A Linux host (or switch!) answering an ARP query *adds* an ARP entry for the requestor at the same time as answering the query. See here (lines 728-733):

      When the host speaks, it queries for the gateway, which populates the switch ARP table because of the optimization, and can then be redistributed.

      If, on the other hand, the host never speaks (say, it's a server), then nobody can *ever* talk to it, because there's no route to it anywhere in the L3 domain.
  5. Hi Ivan, Bob, & Dinesh,

    Firstly, thanks for an awesome episode and for getting right down into the nitty gritty technical detail. This was the most fun I've had with a technical podcast for quite a while. :-)

    Now to some questions:

    1. Given that there are host-level modifications required, wouldn't it be a simpler and more easily-accessible solution to run OSPF/BGP + BFD on the host? (I've pinged Dinesh about this on Twitter, but would be happy to hear from anyone who has the Cumulus BFD implementation working on a server-side distro.)

    2. Ivan, what was your issue with the hardware offload the use of netlink? You seemed to imply that it's a bit hacky, but netlink is just an API which is used to talk between kernel & userspace networking.

    3. Could you solve the multicast issue by using dual link-local multicast networks alongside the globally-addressible IP on the same physical links?

    Thanks in advance,
    1. Hi Paul,

      #1 - For whatever weird reason neither server admins nor network admins fancy running a routing protocol between a server and a ToR switch. That was a SOP 20 years ago (every IBM mainframe connected to IP network was running RIP or OSPF in those days), but times have changed.

      #2 - If I understood the architecture correctly, a userspace process (Quagga, for example) tells the Linux kernel to modify (for example) a forwarding table, and the Cumulus daemon listens to that conversation and implements the same change in hardware (or not).

      I would prefer a more explicit architecture (which I understand is not available at the moment) where the Cumulus daemon would have a say in the process (like: I can't do that, please rollback the change).

      Or maybe I got it all wrong...

      #3 - you'd have to ask someone that actually knows something about Multicast ;)
Add comment