Do We Need Complex Data Center Switches for VMware NSX Underlay

Got this question from one of subscribers:

Do we really need those intelligent datacenter switches for underlay now that we have NSX in our datacenter? Now that we have taken a lot of the intelligence out of our underlying network, what must the underlying network really provide?

Reading the marketing white papers the answer would be IP connectivity… but keep in mind that building your infrastructure based on information from vendor white papers usually gives you the results your gullibility deserves.

The Basics

If you’re building a new VMware NSX-based infrastructure, you’d usually go for a leaf-and-spine fabric and connect hypervisor hosts (acting as compute, network, or management nodes) to the leaf switches.

VMware NSX deployments are limited to ~1000 hypervisor hosts (not sure I would push the limits, but that’s a different story), or ~20 leaf switches - a perfect fit for a simple leaf-and-spine fabric.

VMware NSX-V and NSX-T are completely different products, but their scalability limits, and underlay connectivity requirements are almost the same. For more details, watch VMware NSX Technical Deep Dive webinar.

Reading the white papers claiming that you don’t need more than IP connectivity, you might go for a very simple design:

  • Layer-3-only fabric running a single routing protocol (BGP if you want to be hip, even though OSPF or IS-IS would do just fine).
  • Single IP subnet per leaf switch, resulting in an extremely simple and robust network.
You’ll find more details in Leaf-and-Spine Fabric Architectures webinar.

And once you start congratulating yourself on coming up with such a lovely design, the ugly reality intervenes.

What We Really Need

NSX design guides recommended having three or four isolated forwarding domains in your data center fabric, providing complete isolation between management, kernel (vMotion), storage, VXLAN/Geneve, and user traffic. Welcome to the complex world of VLANs, VRFs, or ACLs.

It’s always wise to keep user traffic separate from management or storage traffic… but the separation inevitably results in a more complex fabric design. As always, it’s all about tradeoffs.

With a few hundred hypervisor hosts you can’t afford to lose all hypervisor hosts connected to a leaf switch, so you’d almost always go for a redundant design, connecting each hypervisor host to two ToR switches.

In the ideal world your life would be simple:

  • Each hypervisor host would have a loopback interface, and would send and receive overlay (VXLAN/Geneve) traffic from that IP address;
  • Having a loopback interface as the source of overlay traffic would make the IP addresses on the physical uplinks irrelevant (from the overlay traffic forwarding perspective);
  • Hosts would run a routing protocol with the network and advertise the loopback IP address. Add BFD to the mix and you have a simple, stable, and fast-converging solution that uses nothing more than IP routing.
I know a few OpenStack deployments using this design and they work like a charm (no surprise there).

The ideal world described above rests on two assumptions:

  • The networking and the server teams work together and cooperate on the infrastructure design;
  • The solution you’re using was created by someone who considered the overall complexity of the whole system.

The first assumption might be true in some environments. If you decided to use VMware NSX, the second one unfortunately isn’t.

It Gets Worse and Worse

VMware decided to use the same old bag of tricks from the days of their guerrilla marketing to implement NSX underlay connectivity:

  • VXLAN (or Geneve) traffic is sent from VMkernel interfaces;
  • VMkernel interfaces are tied to port groups which can be associated with multiple physical uplinks to implement redundant connectivity;
  • ESXi hosts can change the active port group uplink at any time, resulting in IP address move from one physical uplink to another. The best you could hope for (from the networking perspective) is to get a Gratuitous ARP message when the move is made.
I described ESXi port groups and uplinks in vSphere 6 Networking Deep Dive webinar, and NSX connectivity requirements in VMware NSX Technical Deep Dive webinar.

End result: if you want to have ESXi hypervisors running NSX redundantly connected to leaf switches, you have to support fast IP address mobility, and there are exactly two ways to do that:

  • Stretch a VLAN across leaf (ToR) switches;
  • Create host routes based on ARP/GARP messages, and redistribute them into a routing protocol… while making sure you don’t get duplicate routes after an IP address move (the details are left as an exercise for the reader).

The currently-fashionable way to implement IP address mobility in a data center environment, and support several VRFs in the data center fabric, is to use VXLAN overlays (to transport Ethernet frames belonging to a single VLAN across underlay IP fabric) combined with EVPN control plane (because why not).

You’ll find everything you want to know about EVPN in EVPN Technical Deep Dive webinar.

Some environments might add traditional MLAG or EVPN-based multihoming to the mix… and we’re back to square one - the data center fabric remains as complex as it ever was, we just added another layer of abstraction and complexity on top of that. Great job!

Add comment