MPLS/VPN in the Data Center? Maybe not in the hypervisors

A while ago I wrote that the hypervisor vendors should consider turning the virtual switches into PE-routers. We all know that’s never going to happen due to religious objections from everyone who thinks VLANs are the greatest thing ever invented and MP-BGP is pure evil, but there are at least two good technical reasons why putting MPLS/VPN (as we know it today) in the hypervisors might not be the best idea in very large data centers.

Please remember that we’re talking about huge data centers. If you have a few hundred physical servers, bridging (be it with VLANs or vCDNI) will work just fine.

This blog post was triggered by an interesting discussion I had with Igor Gashinsky during the OpenFlow Symposium. Igor, thank you for your insight!

Brief recap of the initial idea (you should also read the original blog post): hypervisor hosts should become PE-routers and use MP-BGP to propagate IP- or MAC-address reachability information. Hypervisor hosts implementing L3 forwarding could use RFC 4364 (with host routes for VMs), L2 hypervisor switches could use BGP MPLS Based MAC VPN.

And here are the challenges:

Scalability. MPLS/VPN requires a Label Switched Paths (LSP) between PE-routers. These paths could be signaled with LDP, in which case host routes to all PE-routers must be propagated throughout the network, or with MPLS-TE, in which case you have a full mesh (N-square) of tunnels and way too much state in the network core.

MPLS/VPN could also use IP or GRE+IP transport as defined in RFC 4023, in which case the scalability argument is gone.


MPLS/VPN requires flat address space, IP offers self-similar
aggregation capabilities (source: Wikipedia)

Eventual consistency of BGP. BGP was designed to carry humongous amount of routing information (Internet IPv4 routing table has more than 400000 routes), but it’s not the fastest-converging beast on this planet, and it has no transactional consistency. That might be fine if you’re starting and shutting down VMs (the amount of change is limited, and eventual consistency doesn’t matter for a VM going through the OS boot process), but not if you’re moving thousands of them in order to evacuate racks scheduled for maintenance.

Summary: MPLS/VPN was designed for an environment with a large number of routes and limited amount of routing information churn. Large-scale data centers offering “TCP clouds” (because some customers think that might result in high availability) just might be too dynamic for that.

Do we still need MPLS/VPN in the Data Center?

Sure we do, but not in the hypervisors. In many cases, we have to provide path isolation to the applications that don’t actually need L2 connectivity because they were written by people who understood how IP works (example: you might want to keep MySQL database servers strictly isolated from web servers).

Read the excellent blog posts written by Derick Winkworth to see how far you can push the MPLS/VPN or VRF designs.

MPLS/VPN is a perfect solution for that problem (Easy Virtual Networking might also work), but many engineers still use VLANs (even though L2 connectivity is not required) and risk the stability of their network because they’re not familiar with MPLS/VPN or because the gear they use doesn’t support it.

The usual final paragraph

If you’re a longtime reader, you know what’s coming next: one of my webinars describes typical enterprise MPLS/VPN use cases, another one data center technologies, and a third one the concepts used in virtualized networks (and it makes sense to get access to them all with the yearly subscription).

Finally, if you need help in your MPLS/VPN designs, do consider the ExpertExpress – I would love to be faced with a good MPLS/VPN design challenge or two.


… and if you're interested in self-referential topics, this book is a must-read!

13 comments:

  1. "...L2 hypervisor switches could use BGP MPLS Based MAC VPN."
    you're almost there :)
    but this one is better "draft-sajassi-l2vpn-pbb-evpn"

    ReplyDelete
  2. What if you used a trunk to the TOR switch and applied a VPLS xconnect to a sub-interface? You'd be limited to 4095 guests per server, but I think there are some other limiting factors in that case. That way you could put different servers in VPLS groups based on VLAN assignment by the vSwitch, but the TOR switch applies the labels and creates the LSPs for the entire rack.

    The only way BGP convergence would hurt you is if you're using a true MPLS VPN where routes are actually redistributed into BGP for each VRF. If you're using VPLS then LDP will just distribute the corresponding labels to their destination and BGP isn't required.

    Just a thought. Not sure how well it would scale.

    ReplyDelete
  3. Ivan Pepelnjak20 March, 2012 08:57

    At least two problems:

    * If you needed more than 4K virtual segments, you would have overlapping VLAN address spaces, which would be a nightmare to manage;
    * Automatic provisioning of such a solution doesn't exist. It would require tight coupling between hypervisors and ToR switches and although there are solutions along those lines, none of them is easily adaptable to new topologies.

    On the other hand, MPLS scaling would be an order of magnitude easier to achieve (as you need LSP per ToR switch, not per hypervisor host), but you'd still be without a control plane and rely on flooding to figure out where the VMs are.

    ReplyDelete
  4. What is different about SDN controllers that would suggest they will be better at handling the high churn or a significant disruptive event in large-scale data centers than a good BGP/MPLS implementation (using RRs, RT-constrain, etc)?

    ReplyDelete
  5. Ivan Pepelnjak01 April, 2012 03:22

    They can ensure transactional consistency (should one so desire) whereas BGP has eventual consistency (unless the number of updates is too high, see also: Internet).

    Disruptive event shouldn't really impact the hypervisors if they use MAC-over-IP encapsulation. Even if you lose tons of servers in one go, you won't restart those VMs on another server in a second (if at all).

    ReplyDelete
  6. Transactional consistency in SDN is not what I'm understanding from Casada's blog (ex: http://networkheresy.wordpress.com/2011/08/09/what-might-an-sdn-controller-api-look-like-and-should-we-standardize-it).

    ReplyDelete
  7. Agree with "MPLS/VPN could also use IP or GRE+IP transport as defined in RFC 4023, in which case the scalability argument is gone." Furthermore, E-VPN (http://tools.ietf.org/html/draft-ietf-l2vpn-evpn-00) can be used to populate the vswitch tables with MACs (among other things) and enable highly flexible topologies using RT combinations.

    ReplyDelete
  8. with pbb-evpn you are limited to B-MACs (~ vswitch tenant instance) being the leaves of your vpn topologies versus individual machines. hence topologies such as "private vlan" are not possible. also we're back to data plane mac learning across sites with pbb-evpn. the scaling advantages don't come free.

    ReplyDelete
  9. Trying to understand the transactional consistency requirement, and I can only see a need in a security context (make sure no traffic leaks out). But this is only needed if your isolation depends on ACLs as is the case in Amazon. Why is transactional consistency so important in an MPLS model, where the penalty of inconsistency is just some lost packets. Is this such a big deal when you are rebooting servers?

    ReplyDelete
    Replies
    1. You need transactional consistency when you move VMs. You wouldn't want to rely on best-effort eventually-consistent model like BGP in that case (particularly if you move a large number of VMs at once).

      Delete
    2. What about BGP is best effort?

      Delete
  10. Pardon me for my ignorance, what do you mean by "transactional consistency" mean in this article's context? Could you pls explain a little bit.

    ReplyDelete
    Replies
    1. When a VM is moved, every hypervisor participating in that virtual network should be updated before the move is complete, so that no traffic is sent to the VM's old attachment point.

      BGP cannot enforce that, as it has no transactional semantics (or barriers like OpenFlow).

      Delete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.