Whenever I write about vCloud Director Networking Infrastructure (vCDNI), be it a rant or a more technical post, I get comments along the lines of “What are the network guys going to do once the infrastructure has been provisioned? With vCDNI there is no need to keep network admins full time.”
Once we have a scalable solution that will be able to stand on its own in a large data center, most smart network admins will be more than happy to get away from provisioning VLANs and focus on other problems. After all, most companies have other networking problems beyond data center switching. As for disappearing work, we've seen the demise of DECnet, IPX, SNA, DLSw and multi-protocol networks (which are coming back with IPv6) without our jobs getting any simpler, so I'm not worried about the jobless network admin. I am worried, however, about the stability of the networks we are building, and that’s the only reason I’m ranting about the emerging flat-earth architectures.
In 2002 IETF published an interesting RFC: Some Internet Architectural Guidelines and Philosophy (RFC 3439) that should be a mandatory reading for anyone claiming to be an architect of solutions that involve networking (you know who you are). In the End-to-End Argument and Simplicity section the RFC clearly states: “In short, the complexity of the Internet belongs at the edges, and the IP layer of the Internet should remain as simple as possible.” We should use the same approach when dealing with virtualized networking: the complexity belongs to the edges (hypervisor switches) with the intervening network providing the minimum set of required services. I don’t care if the networking infrastructure uses layer-2 (MAC) addresses or layer-3 (IP) addresses as long as it scales. Bridging does not scale as it emulates a logical thick coax cable. Either get rid of most bridging properties (like packet flooding) and implement proper MAC-address-based routing without flooding, or use IP as the transport. I truly don’t care.
Reading RFC 3439 a bit further, the next paragraphs explain the Non-Linearity and Network Complexity. To quote the RFC: “In particular, the largest networks exhibit, both in theory and in practice, architecture, design, and engineering non-linearities which are not exhibited at smaller scale.” Allow me to paraphrase this for some vendors out there: “just because it works in your lab does not mean it will work at Amazon or Google scale.”
The current state of affairs is just the opposite of what a reasonable architecture would be: VMware has a barebones layer-2 switch (although it does have a few interesting features) with another non-scalable layer (vCDNI) on top of (or below) it. The networking vendors are inventing all sorts of kludges of increasing complexity to cope with that, from VN-Link/port extenders and EVB/VEPA to large-scale L2 solutions like TRILL, Fabric Path, VCS Fabric or 802.1aq, and L2 data center interconnects based on VPLS, OTV or BGP MAC VPN.
I don’t expect the situation to change on its own. VMware knows server virtualization is just a stepping stone and is already investing in PaaS solutions; the networking vendors are more than happy to sell you all the extra proprietary features you need just because VMware never implemented a more scalable solution, increasing their revenues and lock-in. It almost feels like the more “network is in my way” complaints we hear, the happier everyone is: virtualization vendors because the blame is landing somewhere else, the networking industry because these complaints give them a door opener to sell their next-generation magic (this time using a term borrowed from the textile industry).
Imagine for a second that VMware or Citrix would actually implement a virtualized networking solution using IP transport between hypervisor hosts. The need for new fancy boxes supporting TRILL or 802.1aq would be gone, all you would need in your data center would be high-speed simple L2/L3 switches. Clearly not a rosy scenario for the flat-fabric-promoting networking vendors, is it?
Is there anything you can do? Probably not much, but at least you can try. Sit down with the virtualization engineers, discuss the challenges and figure out the best way to solve problems both teams are facing. Engage the application teams. If you can persuade them to start writing scale-out applications that can use proper load balancing, most of the issues bothering you will disappear on their own: there will be no need for large stretched VLANs and no need for L2 data center interconnects. After all, if you have a scale-out application behind a load balancer, nobody cares if you have to shut down a VM and start it in a new IP subnet.