Large-scale L2 DCI: a true story

Friday, December 16, 2011 14:33 +0100

Large-scale L2 DCI: a true story

After a week of oversized articles, I’ll try to keep this one short. This is a true story someone recently shared with me (for obvious reasons I can’t tell you where it happened ... and no, I’m not making it up). Enjoy!

Unfortunately I was forced to implement L2 DCI for a "virtual DC" supporting between 4000 and 5000 physical servers with a 25:1 to 30:1 virtualization ratio (so a huge environment). Needless to say in less than 12 months of bringing it online the whole thing melted to a broadcast storm, yet my server team peers still can't understand why it's a bad idea :)

Do I have to add that there are usually no good reasons to build a L2 data center interconnect and plenty of time-tested layer-3 solutions including local and global load balancing and path separation with MPLS/VPN.

If you have a war story you’d like to share, send it over – I’ll keep all your personal details confidential (no, I’m not trying to turn this web site into NetworkLeaks, but reading about other people’s experiences can’t hurt).

15 comments:

federic0 16 December 2011 15:27

c'moooonnnn! we need at least some more details! is it a single broadcast domain? :) do they do vmotion over the DCI?

Ivan Pepelnjak 16 December 2011 15:29

As far as I understood they had a single broadcast domain =-X

vMotion over DCI? That's usually one of the major requirements, but then nobody uses it once they figure out how slow it is.

David Klebanov 16 December 2011 19:20

Hi Ivan,

Thanks for sharing. I know you don't favor L2 DCI and I would agree with you on some of the points, however it had also been proven to be an excellent method for DC migration in hot-cold (active-standby) manner, which eliminates the need for re-IPing and all the associated nightmares to go along with it. Due to VMWare's live vMotion characteristics and limitations doing such migrations in active-active manner is indeed tough.

I salute you for not disclosing real customer name, however I hope they know what they are doing. There are methods to provide L2 DCI while isolating each Data Center from STP perspective and the L3 to go along with it. If they had a broadcast storm that took down the entire environment, it is their specific design flaw, rather than a flawed concept :-)

TRiLL/Fabric Path, OTV and BPDU Filtering along with Global DNS LB, host route injection or LISP can provide much safer L2 DCI, than uncontrolled port channels or wrongly deployed pseudowires, that I suspect this customer might have been using.

My .02

Thanks,
David

nosx 17 December 2011 06:33

Have you looked at the draft from Huawei regarding their proxyarp+mpls l3vpn idea? It solves the broadcast, spanning-tree, etc issues I could think of.

Ivan Pepelnjak 17 December 2011 07:57

Not yet, thanks for the pointer. Any links?

BTW, the way you describe it, it seems to be LAM over MPLS/VPN. I hope I'm wrong :-E

nosx 17 December 2011 20:18

I wish i still had the pdf of that powerpoint deck from them, it was quite a hack. The gist of it was using proxy arp for local host reachability spanning the same subnet across 3+ datacenters, while at the same time hacking proxy-arp into BGP to advertise the individual /32's for hosts between the sites to intelligently deliver traffic to the right side (and proxy-arp at site 2 for a host in site 1 etc)...

Anyways ill keep digging to see if i can unearth a copy. The nice part was it used existing protocols and mechanisms. The down side was that it exploded the routing table with a bunch of worthless /32 entries. Instead of doing MAC routing like other vendors, they wanted to do host level IP routing.

nosx 17 December 2011 20:23

Here we go, i think this might be some bits of it:
http://www.nanog.org/meetings/nanog52/abstracts.php?pt=MTc2MiZuYW5vZzUy&nm=nanog52

I think i saw in the context of an an IETF presentation, but the high level concept is spelled out fairly well. Id love to see a blog post analyzing this so we can argue about that as well ;P

Compared to some of the L2 scaling technologies, a constrained /32 routing strategy with proxyarping might just be the lesser of many evils in same-subnet DCI space.

Abstract:
Virtual Subnet (VS) provides a scalable IP-only L2VPN service by reusing the proven BGP/MPLS IP VPN [RFC4364] and ARP proxy [RFC925][RFC1027] technologies. VS could be used for interconnecting geographically dispersed data center locations at scale. In contrast with the existing VPLS solution, VS alleviates the broadcast storm impact on the network performance to a great extent by splitting the otherwise whole ARP broadcast and unknown unicast flooding domain associated with an IP subnet that has been extended across the MPLS/IP backbone, into multiple isolated parts per data center location, besides, the MAC table capacity demand on CE switches is greatly reduced due to the usage of ARP proxy.

nosx 17 December 2011 20:25

Also, the standards track doc http://tools.ietf.org/html/draft-xu-virtual-subnet-06

Anonymous-DC-Guy 19 December 2011 02:44

David,

I am the person that provided this story to Ivan, so I will attempt to provide some more detail.

Firstly, the technical network staff involved are extremely competent and the "cream of the crop" in the region this story happened in. So no issues there, however they were overruled by a higher authority when advising not to implement L2 DCI (the server guys had done a great sales job about how great it would be to vMotion crap all over the place).

TRILL/OTV/LISP etc were not available when the network was built, so port channels built over pseudowires to do all the bundling. STP isolation between DC's would not have helped the situation.

What caused the loop and storm? Misconfiguration in VMware on an "edge" switch port - yes I am aware vSwitch etc has forms of split horizon but the server guys still managed to make a monumental stuff up that brought both DC's crashing down (all switch CPU's went over 99%) through some creative back end bridging.

There was nothing more we could have done to mitigate against this type of problem, especially considering the loop was in an end host that doesn't run STP. Storm control could have contained it better, but as an example look at the latest generation of blades for the HP C7000 - with heavy virtualization you could have theoretically 700+ VM's running on a pair of uplinks - storm control causes its own problems for every host on that interface. Nexus 1000v fixes this to an extent of course by providing per vNIC storm control capabilities.

So with current technologies available, I agree with Ivan that the concept has some huge technical flaws. Yes the incident was caused by misconfiguration, but we are humans and mistakes WILL happen :)

Ivan Pepelnjak 19 December 2011 19:40

David,

Once you add together FabricPath (or TRILL), OTV, BPDU filtering, HSRP filtering and LISP/RHI, the solution becomes "quite" complex and "somewhat" challenging to debug at 2AM on Sunday morning when everything breaks down.

I'm not saying it can't be done, I'm just saying it might not be the best way to deal with the problem at hand. Sometimes you simply have to tell the Apps people to fix their problems.

My €0.0002
Ivan

Ivan Pepelnjak 19 December 2011 19:41

Thank you! Feedback in early January, I'm wrapping up and disappearing in a few days =-X

federic0 21 December 2011 16:10

i would like to strike a blow about re-IPing, it is easier than expected, most often DNS changes makes it transparent to clients/users , it makes your new DC infrastructure very clean and summariz-able :-D
it's even a lot more easier when the customer takes care of changing IP on their servers 8-)

(btw, this is fine if you ain't hard-coded IPs into some strange/exotic apps...)

xiaohu 17 January 2012 03:25

I'm the author of that draft. Hoping to see more comments on that draft.

Mike Brown 09 April 2012 06:33

Hi Ivan,

Forgive me for the simple question, but I'm a virtualization dude and I can't seem to find a definitive answer. There're a couple data centers with a point-to-point fiber connection between Nexus 5020s. The distance is about one or two kilometers between data centers. This connection is only used for VMware's Site Recovery Manager (and thus storage replication with NetApp SnapMirror) for planned and unplanned migrations between the sites. The local staff call this connection a "stretched VLAN," but I'm not convinced they're using this term correctly. Only a single VLAN is trunked over this connection for replication traffic.

Reading your material, stretched VLANs include a handful of advanced networking technologies, not a simple P2P, L2 fiber connection between 5020s.

If this setup is not a stretched VLAN, then what is it?

BTW, as a vitualization guy (with a networking foundation), your site is ripe with good info. I recommend ioshints.info and PacketPushers regularly to folks. Thanks for what you do.

All the best,

Mike

Ivan Pepelnjak 09 April 2012 11:06

Hi Mike,

Stretched VLAN could be as simple as what you describe. Technology you use is not so important, the crucial question is "are we bridging or routing?" As long as you're bridging (some people would incorrectly call it "switching") you're vulnerable to broadcast storms and (now this IS depending on technology you use) STP topology changes or device bugs.

If you're using inter-DC link just for storage replication, I don't understand why it has to be a VLAN (most storage replication technologies work over IP, so you could use a routed solution), but it's definitely far better than tons of VLANs stretched across both data centers.

Hope this helps, if not, please use "Contact me" link to send me an email
Ivan

Add comment