VXLAN scalability challenges
VXLAN, one of the first MAC-over-IP (overlay) virtual networking solutions is definitely a major improvement over traditional VLAN-based virtual networking technologies … but not without its own scalability limitations.
VXLAN was first implemented in Nexus 1000v, which presents itself as a Virtual Distributed Switch (vDS) to VMware vCenter. A single Nexus 1000V instance cannot have more than 64 VEMs (vSphere kernel modules), limiting the Nexus 1000v domain to 64 hosts (or approximately two racks of UCS blade servers).
It’s definitely possible to configure the same VXLAN NVI and IP multicast address on different Nexus 1000v switches (either manually or using vShield Manager), but you cannot vMotion a VM out of the vDS (that Nexus 1000v presents to vCenter).
VXLAN on Nexus 1000v is thus a great technology if you want to implement HA/DRS clusters spread across multiple racks or rows (you can do it without configuring end-to-end bridging), but falls way short of the “deploy any VM anywhere in the data center” holy grail.
VXLAN is also available in VMware’s vDS switch ... but can only be managed through vShield Manager. vDS can span 500 hosts (the vMotion domain is ~8 times bigger than if you use Nexus 1000V), and supposedly vShield Manager configures VXLAN segments across multiple vDS (using the same VXLAN VNI and IP multicast address on all of them).
IP multicast scalability issues
VXLAN floods layer-2 frames using IP multicast (Cisco has demonstrated unicast-only VXLAN but there’s nothing I could touch on their web site yet), and you can either manually associate an IP multicast address with a VXLAN segment, or let vShield Manager do it automatically (using IP multicast addresses from a single configurable pool).
The number of IP multicast groups (together with the size of the network) obviously influences the overall VXLAN scalability. Here are a few examples:
One or few multicast groups for a single Nexus 1000v instance. Acceptable if you don’t need more than 64 hosts. Flooding wouldn’t be too bad (not many people would put more than a few thousand VMs on 64 hosts) and the core network would have a reasonably small number of (S/*,G) entries (even with source-based trees the number of entries would be linearly proportional to the number of vSphere hosts).
Many virtual segments in large network with a few multicast groups. This would make VXLAN as “scalable” as vCDNI. Numerous virtual segments (and consequently numerous virtual machines) would map into a single IP multicast address (vShield Manager uses a simple wrap-around IP multicast address allocation mechanism), and vSphere hosts would receive flooded packets for irrelevant segments.
Use per-VNI multicast group. This approach would result in minimal excessive flooding but generate large amounts of (S,G) entries in the network.
The size of the multicast routing table would obviously depend on the number of hosts, number of VXLAN segments, and PIM configuration – do you use shared trees or switch to source tree as soon as possible … and keep in mind that Nexus 7000 doesn’t support more than 32000 multicast entries and Arista’s 7500 cannot have more than 4000 multicast routes on a linecard.
VXLAN has no flooding reduction/suppression mechanisms, so the rules-of-thumb from RFC 5556 still apply: a single broadcast domain should have around 1000 end-hosts. In VXLAN terms, that’s around 1000 VMs per IP multicast address.
However, it might be simpler to take another approach: use shared multicast trees (and hope the amount of flooded traffic is negligible ;), and assign anywhere between 75% and 90% of (lowest) IP multicast table size on your data center switches to VXLAN. Due to vShield Manager’s wraparound multicast address allocation policy, the multicast traffic should be well-distributed across all the whole allocated address range.
VXLAN is also mentioned in the Introduction to Virtual Networking webinar and described in details in the VXLAN Technical Deep Dive webinar. You’ll find some VXLAN use cases in Cloud Computing Networking webinar. All three webinars are available with the yearly subscription… and if you need design help/review or a second opinion, check the ExpertExpress service.
Many customers limit vMotion to 32 physical hosts since that is the largest size of a cluster. It has been validated you can vMotion between clusters under a given set of conditions, but do [large] customers do this? I wonder myself about the holy grail. What's your take? What are you seeing?
As Cisco continues to increase scale for the 1000V to catch up to VMware, is it needed for *most* customers? Rather if vMotion is contained to cluster size, would it not be advantageous to maintain 2, 3, or even 4 VSMs on the 1000V to reduce single points of failure for the virtual network? Once could argue either way, but what is your take?
v2 of this post - I'm anxiously waiting to see what Cisco did for non-multicast VXLAN ;)
Larger-scale vMotion - while vMotion outside of a DRS cluster is not automatic (you have to trigger it manually), people use it for coarse-grained resource allocation (if a cluster becomes overloaded, it's pretty easy to move a whole app stack somewhere else) or prior to large-scale maintenance activities ... and then there's the long-distance unicorn-riding variant ;)
As for "what most customers need" - 80+% of them are probably fine with a single cluster or two. That's 60 servers; if you buy high-end gear, you could pack few thousand VMs onto them. More than enough in many cases, unless you're going down the full-VDI route.
Multiple NX1KV instances per DC is obviously a good idea, but keep in mind that
A) you cannot vMotion a running VM across them;
B) Configuration changes made in one vDS are not propagated to another one, so you need an automation layer on top (could be of the cut-and-paste variant :D ).
With talking to Cisco a few weeks back, they were recommending to use DCNM actually and a Master VSM interestingly enough. Per them, "Create Master VSM with all the needed profiles and network configuration. Use the running config to create exact config across all other Nexus 1000V VSMs. Changes made to master VSM can be replicated to all other VSMs."
Replicated might mean manual scripting as you call the cut-and-paste variant without DCNM.
Besides 60 host Nexus 1000V limitation, the other painful limit is 2048 ports per switch which means less than 34 VMs per host - hardly cloud scale.
vMotion across distributed virtual switches is (currently) not possible but you also need shared storage which you most likely will not have across many clusters. That means you can deploy VM (but not migrate) anywhere in the datacenter.
Which I believe is an even more interesting use case for large customer with scattered free resources here and there but very unflexible vlan assignments to those islands ("oh that cluster is pretty idle why don't you deploy your new VM there?", "yeah but the VLAN that I need is only available in this overloaded cluster")-ish.