NVGRE – because one standard just wouldn’t be enough
Two weeks after VXLAN (backed by VMware, Cisco, Citrix and Red Hat) was launched at VMworld, Microsoft, Intel, HP & Dell published NVGRE draft (Arista and Broadcom are cleverly sitting on both chairs) which solves the same problem in a slightly different way.
If you’re still wondering why we need VXLAN and NVGRE, read my VXLAN post (and the one describing how VXLAN, OTV and LISP fit together).
It’s obvious the NVGRE draft was a rushed affair, its only significant and original contribution to knowledge is the idea of using the lower 24 bits of the GRE key field to indicate the Tenant Network Identifier (but then, lesser ideas have been patented time and again). Like with VXLAN, most of the real problems are handwaved to other or future drafts.
The way to obtain remote VM MAC to physical IP mapping will be covered in a different draft (section 3.1). VXLAN specifies the use of IP multicast to flood within the virtual segment and relies on dynamic MAC learning.
The NVGRE approach is actually more scalable than the VXLAN one because it does not mandate the use of flooding-based MAC address learning. Even more, NVGRE acknowledges that there might be virtual L2 networks that will not use flooding at all (like Amazon EC2).
Mapping between TNI and IP multicast addresses will be specified in a future version of this draft. VXLAN “solves” the problem by delegating it to the management layer.
IP fragmentation (due to oversized VXLAN/NVGRE frames). NVGRE draft at least acknowledges the problem and indicates that a future version might use Path MTU Discovery to detect end-to-end MTU size and reduce the intra-virtual-network MTU size for IP packets.
VXLAN ignores the problem and relies on jumbo frames. This might be a reasonable approach assuming VXLAN would stay within a Data Center (keep dreaming, vendors involved in VXLAN are already peddling long-distance VXLAN snake oil).
ECMP-based load balancing is the only difference between NVGRE and VXLAN worth mentioning. VXLAN uses UDP encapsulation and pseudo-random values in UDP source port (computed by hashing parts of the inner MAC frame), resulting in automatic equal-cost load balancing in every device that uses 5-tuple to load balance.
GRE is harder to load balance, so the NVGRE draft proposes an interim solution using multiple IP addresses per endpoint (hypervisor host) with no details on the inter-VM-flow-to-endpoint-IP-address mapping. The final solution?
The Key field may provide additional entropy to the switches to exploit path diversity inside the network. One such example could be to use the upper 8 bits of the Key field to add flow based entropy and tag all the packets from a flow with an entropy label.
OK, might even work. But do the switches support it? Oh, don’t worry ...
A diverse ecosystem play is expected to emerge as more and more devices become multitenancy aware.
I know they had to do something different from VXLAN (creating another UDP-based scheme and swapping two fields in the header would be a too-obvious me-too attempt), but wishful thinking like this usually belongs to another type of RFCs.
Summary
Two (or more) standards solving a single problem seems to be the industry norm these days. I’m sick and tired of the obvious me-too/I’m-different/look-who’s-innovating ploys. Making matters worse, VXLAN and NVGRE are half-baked affairs today.
VXLAN has no control plane and relies on IP multicast and flooding to solve MAC address learning issues, making it less suitable for very large scale or inter-DC deployments.
NVGRE has the potential to be a truly scalable solution: it acknowledges there might be need for networks not using flooding, and at least mentions the MTU issues, but it has a long way to go before getting there. In its current state, it’s worse than VXLAN because it’s way more underspecified.