Could MPLS-over-IP replace VXLAN or NVGRE?

A lot of engineers are concerned with what seems to be frivolous creation of new encapsulation formats supporting virtual networks. While STT makes technical sense (it allows soft switches to use existing NIC TCP offload functionality), it’s harder to figure out the benefits of VXLAN and NVGRE. Scott Lowe wrote a great blog post recently where he asked a very valid question: “Couldn’t we use MPLS over GRE or IP?” We could, but we wouldn’t gain anything by doing that.

RFC 4023 specifies two methods of MPLS-in-IP encapsulation: MPLS label stack on top of IP (using IP protocol 137) and MPLS label stack on top of GRE (using MPLS protocol type in GRE header). We could use either one of these and use either the traditional MPLS semantics or misuse MPLS label as virtual network identifier (VNI). Let’s analyze both options.

Misusing MPLS label as VNI

In theory, one could use MPLS-over-IP or MPLS-over-GRE instead of VXLAN (or NVGRE) and use the first MPLS label as the VNI. While this might work (after all, NVGRE reuses GRE key as VNI), it would not gain us anything. The existing equipment would not recognize this “creative” use of MPLS labels, and we still wouldn’t have the control plane and would have to rely on IP multicast to emulate virtual network L2 flooding.

The MPLS label = VNI approach would be totally incompatible with existing MPLS stacks and would thus require new software in virtual-to-physical gateways. It would also go against the gist of MPLS – labels should have local significance (whereas VNI has network-wide significance) and should be assigned independently by individual MPLS nodes (egress PE-routers in MPLS/VPN case).

It’s also questionable whether the existing hardware would be able to process MAC-in-MPLS-in-GRE-in-IP packets, which would be the only potential benefit of this approach. I know that some (expensive) linecards in Catalyst 6500 can process IP-in-MPLS-in-GRE packets (as do some switches from Juniper and HP), but can it process MAC-in-MPLS-in-GRE? Who knows.

Finally, like NVGRE, MPLS-over-GRE or MPLS-over-IP framing with MPLS label being used as the VNI lacks entropy that could be used for load balancing purposes; existing switches would not be able to load balance traffic between two hypervisor hosts unless each hypervisor hosts would use multiple IP addresses.

Reusing existing MPLS protocol stack

Reusing MPLS label as VNI buys us nothing; we’re thus better off using STT or VXLAN (at least equal-cost load balancing works decently well). How about using MPLS-over-GRE the way it was intended to be used – as part of the MPLS protocol stack? Here we’re stumbling across several major roadblocks:

  • No hypervisor vendor is willing to stop supporting L2 virtual networks because they just might be required for “mission-criticalcraplications running over Microsoft’s Network Load Balancing, so we can’t use L3 MPLS VPN.
  • There’s no usable Ethernet-over-MPLS standard. VPLS is a kludge (= full mesh of pseudowires) and alternate approaches (draft-raggarwa-mac-vpn and draft-ietf-l2vpn-evpn) are still on the drawing board.
  • MPLS-based VPNs require decent control plane, including control-plane protocols like BGP, and that would require some real work on hypervisor soft switches. Implementing an ad-hoc solution like VXLAN based on doing-more-with-less approach (= let’s push the problem into someone else’s lap and require IP multicast in network core) is cheaper and faster.

Summary

Using MPLS-over-IP/GRE to implement virtual networks makes marginal sense, does not solve the load balancing problems NVGRE is facing, and requires significant investment in the hypervisor-side control plane if you want to do it right. I don’t expect to see it implemented any time soon (although Nicira could do it pretty quickly should they find a customer who would be willing to pay for it).

16 comments:

  1. I haven't deepdived into this topic, so i'm probably throwing something around here, but couldn't MPLS-TP be used to emulate a VNI? Also mVPN technology is evolving. If mLDP is used, a MP2MP can be setup and no IP Multicast is required for L2 flooding. Also PBB-EVPN could solve the multihoming?

    ReplyDelete
  2. Why not just do plain MPLS?

    D

    ReplyDelete
    Replies
    1. See the "scalability" paragraph in this blog post: http://blog.ioshints.info/2012/03/mplsvpn-in-data-center-maybe-not-in.html

      Delete
    2. Thank you for redirecting

      Delete
  3. Hi Dirk,
    Interesting points. I agree mLDP can potentially perform a similar role with a L3 VPN setup. VMkernel in a host can act as a PE router with the IP network running LDP. However, it will also require the VMkernel to run BGP. I think even PBB-VPLS would be a viable option. We are running PBB-VPLS in a service provider environment.

    ReplyDelete
  4. The benefit of doing MPLS is obvious: Immediate unification of virtual compute and networking using a standard, mature and fairly well understood protocol.

    Also, you don't have to run LDP or VPNv4 on the hypervisor, a controller could do that.

    ReplyDelete
    Replies
    1. Totally agree. As I wrote "Nicira could do it pretty quickly should they find a customer who would be willing to pay for it" ... maybe I was not explicit enough ;)

      Delete
  5. Nice job Ian of extracting out the benefits and challenges of the RFC4023 solutions that exist, and are being deployed today, which I am a fan of, especially in the WAN for branch back-haul.

    A very good add-on discussion to the post would be looking at LISP, and the abstracted control-plane it offers. Once multicast is supported, there is no reason why L2-over-IP could not be leveraged. It natively uses IP-over-IP (UDP), and has the control-plane to scale (analogous to DNS). An interesting topic for sure as LISP evolves.

    ReplyDelete
  6. As far as I understand, VXLAN, NVGRE and any tunneling protocol that use global ID in the data plane cannot support PVLAN functionality. I'm told the way to solve this issue is to use edge virtual firewall (like iptables or vGW) which can be difficult if the subnet address space isn't carved up in maskable blocks (or ranges) by user groups.

    Furthermore, if edge virtual firewalls are inevitably required to do basic network-level segmentation, then I see no reason why a private cloud with no overlapping address space needs any more than a single all spanning virtual network with edge virtual firewalls to implement all segmentation (network and application level granularity).

    ReplyDelete
    Replies
    1. Ouch. Good one. Let me mull this one over.

      Delete
  7. Another avenue to pursue regarding virtualized resources, OTV.
    It's a rather interesting time to see progression and maturity of network technology to create new solutions to the virtualized data centers.

    http://blog.ine.com/2012/08/17/otv-decoded-a-fancy-gre-tunnel/

    ..."From a high level overview, OTV is basically a layer 2 over layer 3 tunneling protocol. In essence OTV accomplishes the same goal as other L2 tunneling protocols such as L2TPv3, Any Transport over MPLS (AToM), or Virtual Private LAN Services (VPLS). For OTV specifically this goal is to take Ethernet frames from an end station, like a virtual machine, encapsulate them inside IPv4, transport them over the Data Center Interconnect (DCI) network, decapsulate them on the other side, and out pops your original Ethernet frame."

    ReplyDelete
    Replies
    1. OTV is interconnecting two L2 networks, not providing virtual networks like VXLAN or NVGRE. It does not fit into the same picture.

      Delete
  8. I prefer
    1/offload the network function from hypervisor to physical access switch, as what VM-FEX does
    2/ physical access switch support VXLAN
    3/ decouple the control plane layer to manage VXLAN.

    it'll solve the VM to VM, VM to physical, physical to physical traffic, and VLAN limitations for the whole data center, and decrease quantities of VXLAN switches.

    Cooper Wu/http://www.linkedin.com/pub/cooper-wu/4b/79a/bb

    ReplyDelete
    Replies
    1. While this makes sense from network architecture perspective, it tightly couples hypervisor and ToR switches and makes deployment/orchestration way more complex, so I don't expect to see this architecture widely deployed.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Thank you Ivan ,

      From cloud perspective , you'll disagree with my option, I believe you prefer putting all service within the virtualization framework ,nothing to do with physical network infrastructure.

      Network admin will be going mad , since all they can see are tunnel packets , they are of no help if there is problem.

      It's true that at current stage, there are limitations and complex to orchestrate and tightly couple hypervisor with ToR. : )

      If combining with O/F , hypervisor tightly couples with control plane servers/clusters , not ToR switch. it'll be more reasonable.

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.