Building large-scale VLANs to support IaaS services is every data center designer’s nightmare and the low number of VLANs supported by some data center gear is not helping anyone. However, as Anonymous Coward pointed out in a comment to my Building a Greenfield Data Center post, service providers have been building very large (and somewhat stable) layer-2 transport networks for years. It does seem like someone is trying to reinvent the wheel (and/or sell us more gear).
A few disclaimers and caveats first:
The service providers don’t care about the end-to-end stability of your network. They provide you with a (hopefully stable) L2 transport you’ve asked for and limit your flooding bandwidth (be it broadcasts, multicasts or unknown unicasts). If you’re smart and connect routers to the L2 transport network, you’ll have a working solution (or not – just ask Greg Ferro about VPLS services). If you bridge your sites across a L2 transport network, you’ll eventually get a total network meltdown. In the data center, we don’t have the luxury of ignoring how well the servers or applications work.
Stable large L2 networks are hard to engineer. I’ve been talking with a great engineer who actually designed and built a large L2 Service Provider network. It took them quite a while to get all the details right (remember: STP was the only game in town) and make the network rock solid.
Connectivity is Service Providers’ core business and gets the appropriate focus and design/implementation budget. Networking in a data center is usually considered to be a minor (and always way too expensive) component.
However, regardless of the differences between service provider transport networks and data center networks, what we’re trying to do in every data center that offers IaaS services relying on dumb layer-2 hypervisor switches has been done numerous times in another industry. I know that learning from others never equals the thrills of figuring it all out on your own and that networking vendors need a continuous stream of reasons to sell you more (and different) boxes ... but maybe we should stop every now and then, look around, figure out whether someone else has already solved the problem that’s bothering us, and benefit from their experience.
Kurt Bales did exactly that a few days ago – trying to solve multi-tenancy issues that exceeded VLAN limitations of Nexus 5000, he decided to use service provider gear in his data center network. I know he was cheating – he has Service Provider background – but you should read his excellent post (several times) and if you agree with his approach, start looking around – explore what the service providers are doing, what the SP networking gear is capable of doing, and start talking to the vendors that were traditionally strong in L2 service provider market ... or you might decide to wait a few months for L3-encapsulating hypervisor switches (and as soon as Martin Casado is willing to admit what they’re doing I’ll be more than happy to blog about it).