Stretched VLANs: What Problem Are You Trying to Solve?

One of subscribers sent me this interesting question:

I am the network administrator of a small data center network that spans 2 buildings. The main building has a pair of L2/L3 10G core switches. The second building has a stack of access switches connected to the main building with 10G uplinks. This secondary datacenter has got some ESX hosts and NAS for remote backup and some VM for development and testing, but all the Internet connection, firewall and server are in the main building.

There is no routing in the secondary building and most of the VLANs are stretched. Do you think I must change that (bringing routing to the secondary datacenter), or keep it simple like it is now?

As always, it depends, this time on what problem are you trying to solve?

There are numerous valid reasons why someone would want to have their data center resources in multiple locations:

  • They ran out of space and had to expand into a different location1
  • They want to have a backup location in case something bad happens to the primary location
  • They want to build an active-active architecture2

Assuming our subscriber needs more space, the current approach works reasonably well as long as there’s enough bandwidth between the two locations, and as the two locations are effectively a single data center, we shouldn’t be too concerned about stretched VLANs. The setup might be good enough to fool an incompetent auditor into ticking off the “redundant location” box. Mission accomplished.

For whatever reason, most organizations want to have a second data center location for resiliency/redundancy purposes, in which case you should start thinking about failure scenarios – you wouldn’t want to be in a situation where a disaster happens, and the IT team discovers their backup plans have no chance of working. The first question to ask is thus: what do you plan to protect against?

Fire, flood, power outage caused by roadworks… the second location will save the day. One of the buildings can burn down, and as long as the other one is not affected, you’re good to go… assuming you have sufficient infrastructure in the second building. Unfortunately, in the specific case mentioned above, all Internet connections and firewalls are in the first building, so the second building makes absolutely no sense from the resiliency perspective.

However, I’ve seen more network meltdowns than burned buildings3, and a single VLAN is always a single packet forwarding failure domain. There are no safeguards like TTL checks or routing tables within a VLAN. A single endpoint going crazy can bring down the whole VLAN, and if you stretch a single VLAN across multiple locations, a single problem can bring down both locations. You could use modern technologies like VXLAN (and maybe EVPN) to eliminate some of the sore points, but not all – both locations will still be a single availability zone.

Want to do it right? I described the concepts in Designing Active-Active and Disaster Recovery Data Centers webinar, and focused on interesting details in numerous blog posts.

  1. I’ve met a customer that had two sites literally across the street, with several large fiber cables (with dozens of strands) connecting them. ↩︎

  2. Hopefully one that will work outside the PowerPoint sandbox. ↩︎

  3. That might have something to do with me being a network engineer and not a firefighter, but I’m digressing. ↩︎

Latest blog posts in Disaster Recovery series

Add comment