Stretched Layer-2 Subnets – The Server Engineer Perspective

A long while ago I got a very interesting e-mail from Dmitriy Samovskiy, the author of VPN-Cubed, in which he politely asked me why the networking engineers find the stretched layer-2 subnets so important. As we might get lucky and spot a few unicorns merrily dancing over stretched layer-2 rainbows while attending the Networking Tech Field Day, I decided share the e-mail contents with you (obviously after getting an OK from Dmitriy).

I was wondering if you could spare a couple of minutes to explain something to me. You see, when I was designing VPN-Cubed, I specifically targeted an L3 interconnect. It is my understanding this is how it's been done forever. Be it a case of connecting to a vendor, partner or big customer - didn't people by default use IPsec forever for this? I worked at an [bigco] where we had a link to [big supplier]. It was used occasionally so didn't make sense to waste money on a dedicated circuit. So we did a VPN - regular IPsec, layer 3. Everyone was happy.

That’s exactly what most networking engineers are trying to do every time they have an opportunity to get asked about sensible design options before the project is less than a week away from its completion deadline. However, quite often the need for stretched layer-2 segments comes to us in form of a “mandatory requirement” from apps/server teams.


Unicorn and rainbow fabric - an ideal material to use in your stretched fabric designs

Interestingly, Dmitriy (being a server/application engineer and a programmer) thinks along exactly the same lines as I do:

My thinking was that fooling hosts to think that they are on the same eth segment or VLAN when they are milliseconds of latency (+ jitter) apart was pointless - apps written to rely on the fact that hosts are on the same eth segment for speed won't work anyway; apps written without such assumptions usually operate on top of IP, not L2.

Exactly. SNA circuit between two IBM mainframes might be an exception.

VPN-Cubed does L3 only, doesn't touch or attempt to virtualize L2 at all - but then I see folks focusing on L2 and I don't get it. What's the point of focusing on an L2 segment spread over WAN (in general case) - what kind of an app needs that sort of setup?

Apart from stretched clusters and other mythical beasts I have yet to find an application where stretched L2 segment would add value beyond being a design-failure-avoidance kludge.

Is it Ethernet broadcasts for DHCP? Is it the fact that when geo-distributed hosts are under the same administrative domain (my hosts in dc and my hosts in cloud, vs my hosts in dc and my partner's hosts in the cloud) that's driving this?

The only sensible explanation I got so far is the inability to change IP addresses while moving cold virtual machines from one data center to another (example: disaster recovery). While we might be able to solve that problem with routing protocols (and that’s how it’s been done forever ... but those skills got lost in the mists of times), it’s easier to request a stretched layer-2 segment and push the problem into another team’s lap.

Also I am wondering why geo-distributed L2 segments are so important to network engineers, which is a conclusion I made from your posts.

The only reason they’re important to us is that we get asked to implement them, even though we know they will eventually fail ... badly.

6 comments:

  1. we have a link to our remote l2 subnetzs over l2tpv3 for l2 ids (the tco of dedicated sensors would've been higher).
    thats not an application example the two of you are meant, but sometimes a single reason is enough... and the documentation battle has another hillto capture. =)

    ReplyDelete
  2. There are many reasons why layer 2 is so over played and over stretched. Layer 3 invoves routing and routing protocols which a lot of people don't understand, so they avoid it as much as they can. In the metro space, customers are deploying layer 2 Metro swtiches instead of layer 3, because layer 3 routers have the features but not wire speed packet throughput performances (think about Cisco ISR routers). Today I see a lot of folks are trying to solve problems that have been solved so elegantly by routing long time ago.

    ReplyDelete
  3. Wait...solve the problem of changing IPs when moving VMs across an L3 connection between data centers using...routing protocols? I'm not saying you're wrong but I can't understand how that would be done, except to maybe use NAT and/or tunneling? What do you mean that this problem can be solved by using routing protocols?

    ReplyDelete
  4. Ivan Pepelnjak20 April, 2012 20:55

    IP addresses on server loopback interfaces and servers running routing protocols with the first-hop switches. That's how server interface redundancy has been done before we forgot how to do it properly and decided L2 tricks make more sense.

    ReplyDelete
    Replies
    1. That's flippin' brilliant! I'd never thought of that as even an option. Makes me want to go try it now though. Would each server advertise a /32 or /128? Wouldn't this result in a rather large routing table?

      Delete
  5. Let me add a few logs to the fire ;-)

    First off it's even worse than many realize in many ways. The reason you need the ipaddr to stay the same is because you don't want to terminate connections/sessions. You goal of all of this is to be able to move an application without the users connecting to that application from noticing anything has happened. This may mean a web page reload, but state is maintained and transactional records are kept and continue.

    Application Affinity is also required. You can't move a single VM that has any local dependencies. You are moving 6, 10, 20 VMs at once. You have to move the web servers, the app servers, the databases and whatever support services it depends on. Apps must be moved either as a whole, or in pre-established pods or clusters that are wholly contained.

    When you consider that a single VM moving can eat a 10G link, you can imagine the bandwidth needed to move 20? Good thing 100GbE is so economical ;-)

    This is NOT for every application. Most applications can handle (from an SLA point of view) a few minutes of downtime. For those apps, you don't need to do this. You want to do snapshot based point in time array based replication. In this model you just bring up clones of the VMs at another site that are maybe 5-50min old. So you lose some data. This model works today, we have evidence of it working very well, especially in cases where warning can be given of the event. And it's much more within the budget.

    The main gain of this method is that complexity is reduced, cost is reduced, latency limitations go from 10ms to basically 350ms. And you don't really need to stretch anything. The clones can be brought up in a "quarantined" environment and their ipaddr/netmask/gateway changed before they are "promoted" and allowed to accept connections. A GSLB can then all that server to the pool and direct connections to the new location.

    For the top end of applications, and in the beginning just for that top end of companies/agencies the ability to move a pack of VMs from one location to another is not only possible, it's happening. I personally have clients doing it. It's not easy. It's VERY expensive. And it's pretty damn cool.

    But for most application and for most customers a point in time recovery solution is just fine and MUCH easier.

    The whole reason behind much of this is business continuity. It's about having applications that can just be migrated from data center to data center allowing that hardware to be worked on while the application migrates on to greener pastures. This is the Ferrari solution. Initially this is being done by large government agencies and large corporations where for them, the cost and complexity are justified.

    However, while everyone wants a Ferrari, a nice VW gets to you to work and back perfectly fine.

    ReplyDelete

You don't have to log in to post a comment, but please do provide your real name/URL. Anonymous comments might get deleted.

Ivan Pepelnjak, CCIE#1354, is the chief technology advisor for NIL Data Communications. He has been designing and implementing large-scale data communications networks as well as teaching and writing books about advanced technologies since 1990. See his full profile, contact him or follow @ioshints on Twitter.