MPLS/VPN in the Data Center? Maybe not in the hypervisors
A while ago I wrote that the hypervisor vendors should consider turning the virtual switches into PE-routers. We all know that’s never going to happen due to religious objections from everyone who thinks VLANs are the greatest thing ever invented and MP-BGP is pure evil, but there are at least two good technical reasons why putting MPLS/VPN (as we know it today) in the hypervisors might not be the best idea in very large data centers.
Please remember that we’re talking about huge data centers. If you have a few hundred physical servers, bridging (be it with VLANs or vCDNI) will work just fine.
This blog post was triggered by an interesting discussion I had with Igor Gashinsky during the OpenFlow Symposium. Igor, thank you for your insight!
Brief recap of the initial idea (you should also read the original blog post): hypervisor hosts should become PE-routers and use MP-BGP to propagate IP- or MAC-address reachability information. Hypervisor hosts implementing L3 forwarding could use RFC 4364 (with host routes for VMs), L2 hypervisor switches could use BGP MPLS Based MAC VPN.
And here are the challenges:
Scalability. MPLS/VPN requires a Label Switched Paths (LSP) between PE-routers. These paths could be signaled with LDP, in which case host routes to all PE-routers must be propagated throughout the network, or with MPLS-TE, in which case you have a full mesh (N-square) of tunnels and way too much state in the network core.
MPLS/VPN could also use IP or GRE+IP transport as defined in RFC 4023, in which case the scalability argument is gone.
MPLS/VPN requires flat address space, IP offers self-similar
aggregation capabilities (source: Wikipedia)
Eventual consistency of BGP. BGP was designed to carry humongous amount of routing information (Internet IPv4 routing table has more than 400000 routes), but it’s not the fastest-converging beast on this planet, and it has no transactional consistency. That might be fine if you’re starting and shutting down VMs (the amount of change is limited, and eventual consistency doesn’t matter for a VM going through the OS boot process), but not if you’re moving thousands of them in order to evacuate racks scheduled for maintenance.
Summary: MPLS/VPN was designed for an environment with a large number of routes and limited amount of routing information churn. Large-scale data centers offering “TCP clouds” (because some customers think that might result in high availability) just might be too dynamic for that.
Do we still need MPLS/VPN in the Data Center?
Sure we do, but not in the hypervisors. In many cases, we have to provide path isolation to the applications that don’t actually need L2 connectivity because they were written by people who understood how IP works (example: you might want to keep MySQL database servers strictly isolated from web servers).
Read the excellent blog posts written by Derick Winkworth to see how far you can push the MPLS/VPN or VRF designs.
MPLS/VPN is a perfect solution for that problem (Easy Virtual Networking might also work), but many engineers still use VLANs (even though L2 connectivity is not required) and risk the stability of their network because they’re not familiar with MPLS/VPN or because the gear they use doesn’t support it.
you're almost there :)
but this one is better "draft-sajassi-l2vpn-pbb-evpn"
The only way BGP convergence would hurt you is if you're using a true MPLS VPN where routes are actually redistributed into BGP for each VRF. If you're using VPLS then LDP will just distribute the corresponding labels to their destination and BGP isn't required.
Just a thought. Not sure how well it would scale.
* If you needed more than 4K virtual segments, you would have overlapping VLAN address spaces, which would be a nightmare to manage;
* Automatic provisioning of such a solution doesn't exist. It would require tight coupling between hypervisors and ToR switches and although there are solutions along those lines, none of them is easily adaptable to new topologies.
On the other hand, MPLS scaling would be an order of magnitude easier to achieve (as you need LSP per ToR switch, not per hypervisor host), but you'd still be without a control plane and rely on flooding to figure out where the VMs are.
Disruptive event shouldn't really impact the hypervisors if they use MAC-over-IP encapsulation. Even if you lose tons of servers in one go, you won't restart those VMs on another server in a second (if at all).
BGP cannot enforce that, as it has no transactional semantics (or barriers like OpenFlow).