Layer-3 gurus: asleep at the wheel

I just read a great article by Kurt (the Network Janitor) Bales eloquently describing how a series of stupid decisions led to the current situation where everyone (but the people who actually work with the networking infrastructure) think stretched layer-2 domains are the mandatory stepping stone toward the cloudy nirvana.

It’s easy to shift the blame to everyone else, including storage vendors (for their love of FC and FCoE) and VMware (for the broken vSwitch design), but let’s face the reality: the rigid mindset of layer-3 gurus probably has as much to do with the whole mess as anything else.

We should start with the business basics. Server virtualization became a viable solution a few years ago, saving the enterprises that embraced it tons of money (and countless server installation hours). The ability to move running virtual machines between physical servers (vMotion or Live Migration) is a huge bonus in a high-availability or changing-load environment. Is it so strange that the server admins want to use those features?

Now, imagine a purely fictional dialog between a server admin trying to deploy vMotion and a L3-leaning networking guru:

SA: You know what – VMware just introduced a fantastic new feature. vMotion allows me to move running virtual machines between physical servers.

NG: So?

SA: The only problem I have is that I have to keep the application sessions up and running.

NG: So?

SA: But they break down if I move the VM across the network.

NG: Do they?

SA: Yeah, supposedly it has to do with the destination server being in a different IP subnet.

NG: You want to move a VM between IP subnets? How is that supposed to work? Obviously you have to change its IP address if it lands in a different IP subnet. Don’t you know how IP routing works?

SA: But wouldn’t that kill all application sessions?

NG: I guess it would. Is that a problem?

SA: Is there anything else we could do?

NG: Sorry, pal, that’s how routing works. Shall I explain it to you?

SA: But VMware claims that if I have both servers in the same VLAN, things work.

NG: Yeah, that could work. So you want me to bridge between the servers? (Feel free to add alternate endings in the comments ;)

The sad part of the whole story is that we had a L3 solution available for decades – Cisco’s Local Area Mobility (LAM), which created host routes based on ARP requests, was working quite well in the 1990s. IP mobility could be another option, but would obviously require modifications in the guest operating system, which is usually considered a taboo.

It’s obvious that the programmers working on vSwitch/vMotion code (or Microsoft Network Load Balancing) lacked networking knowledge and rarely considered anything beyond their lab environment, declaring the product ready-to-deploy as soon as they could get two or three boxes working over a $9.99 switch. It’s also quite obvious that the networking vendors (including Cisco, Juniper, HP, and everyone else) did absolutely nothing to solve the problem. The worst offender (in my opinion) is Nexus 1000V. It could have been a great L3 platform, solving most of the architectural problems we’re endlessly debating, but instead Cisco decided to launch a product with a minimum feature set needed to get a reasonable foothold in the ESX space.

It’s infinitely sad to watch all the networking vendors running around like headless chickens trying to promote yet-another fabric routing at layer-2 solution, instead of stepping back and solving the problem where it should have been solved: within layer 3.


  1. I'm always interested in a better network design. What would you recommend that still achieves server portability?
  2. As I wrote - there is nothing, because nobody was working on this problem for the last 5+ years.

    LISP in Nexus 1000V might be the answer, but I don't like the extra layer of encapsulation it introduces.
  3. I must admit that the scale at which I am thinking is in the range of dozens of servers at most, but what is wrong with visualizing the first hop router? For every workload that needs mobility for outside of an ethernet domain, virtualize the default gateway.

    Your gateway would need VLAN interfaces for the VMs that it routes for, and an interface for some kind of 'OSPF adjacency' VLAN in each datacenter. As your workload migrates from one datacenter (or cluster) to another one, once the virtualized router is migrated OSPF adjacencies are formed on the appropriate OSPF VLAN, and new routes to the workload are propagated to the network. Keep in mind that when your router is in one cluster or DC and your workload is in the other then no traffic flows.

    How do you keep your virtual router from advertising the routes for datacenter OSPF networks that it is not connected to? How does this scale beyond a workload that needs 1 Gbps of network throughput? How do you get access to the storage? Does VMware SRM take care of the second two?
  4. Then I'm not sure I understand the issue. If there are no alternatives, then we're doing the best we can with the technologies available to implement desired capabilities that have immediate benefits. Should we do nothing simply because it's not optimal, or because at some future time it will no longer scale under a set of particular assumptions?
  5. Solutions do exist: you can use load balancers or (even better) a more optimized application architecture.

    But all you're willing to do is "move this VM to the other end of the world", then we have a problem ;) LISP can solve it, but (as I said) introduces yet another layer of encapsulation.
  6. Load balancers require more integration with the application install and configuration. That is much harder and more time consuming, and increases the ongoing operational maintenance activities (more servers = more work). And not all applications can support it. Again, we're dealing with the capabilities available now, not what we wish we had.
  7. It also increases costs because you have to license the additional server OS and application, as well as the load balancer.
  8. Ivan, the problem of clusters, security and other interesting stuff almost always can be solved with a good design in the application layer, and a little help from operating system, network & storage.

    Nowadays, most applications have bad designs and seek for a *a lot* of help from the operating system, network & storage.

    The fact that most engineers in the last 10+ years do not have a broad view of the above areas, has led to the problems that you describe.

    It's the application (& protocol) design that needs to be fixed!

    ps. Tell to the server admin to configure the application to perform all the network stuff from a loopback address and you can take care of the rest easily.
  9. I always thought that LAM failed because IOS wasn't able to hold enough routes in memory. The idea that /32 routes could exist in large volumes and be constantly updated meant the Cisco hardware was incapable of scaling to sufficient size. (see TCAM, Memory, and undersized CPUs).

    Is that your view as well ?
  10. Nicely timed post from Brad Hedlund...
  11. > ps. Tell to the server admin to configure the application to perform all the network
    > stuff from a loopback address and you can take care of the rest easily.

    I'll second that. So often network engineering is an effort to solve problems of crappy application development and/or server deployment practices that could be easily fixed if app/server people could think just a little bit outside of their domain.
  12. Absolutely. Love it. Did you notice how both proposals he finds sensible modify existing L2 or L3 behavior? Proves my point: we've been too complacent for too long.
  13. How does Cisco's OTV play into this scenario?
  14. Of course it wasn't *that* long ago that server OSes shipped with a routing daemon running by default (Solaris & gated anyone?) and could advertise a /32 route (my employer still has that for one legacy "cluster").

    Still doesn't scale nicely of course! :)
  15. IMHO, the whole idea behind what vMotion thinks it is trying to solve is suboptimal. If a service is so critical that one needs very high availability, put a load balancer in front of those servers. Requiring the network to be overly complicated or potentially unstable across an entire datacenter seems pretty crazy.

    Believe me when I say that I do not love load balancers. They are temperamental and expensive black boxes. They do provide a much more scalable solution than vMotion. My biggest problem with vMotion is that it allows application owners to be lazy. They will develop their software assuming that the network will always allow them to shift their services around. It won't challenge them to think about how to scale their service by orders of magnitude.

    There are all sorts of "great" protocols out there, like otv, that allow network engineers to come up with "creative solutions" to suboptimal service requirements. IMHO
  16. LAM sucks. That's why it didn't succeed. How can something based upon ARP/RARP be considered a viable solution?

    The meta-issue here isn't layer-2 vs. layer-3, it's a) the overloading of the IPv4 (and now IPv6) addresses with locator/EID information, b) the policy overloading of IPv4 (and now IPv6) addresses with policy information via ACLs, firewall rules, et. al., and c) the continued worst practice of application developers further overloading IPv4 (and now IPv6) addresses by directly hardcoding IP addresses into their applications/platforms/services, instead of abstracting this away via a naming service (e.g., DNS, at least for now).
  17. OTV is a slightly better bridging. Still doesn't scale (although at least they got rid of unknown unicast flooding).
  18. vMotion is not necessarily solving high-availability issues. It also provides load distribution / adjustment capabilities.

    As for the appdev laziness, I couldn't agree more with you ;)
  19. I absolutely agree with everything you wrote. However, you're missing an important point: live VM migration for load distribution purposes. It would be tough to implement persistent connections in that scenario ... unless we would have a robust session layer that would survive transport layer failures (and a very quick failure detection mechanism).
  20. We're not waiting anymore. We bought L2 service from a pair of national L2 service providers (their VPLS service) with large MTUs (4400 bytes) at all data centers and we are moving forward with VPLS. Its there, its vendor interoperable. It can be a pain in the ass, but it works. We already have MPLS in our WAN core so it seems like the natural thing to do.
  21. Will you do L2 or L3 DCI? If you go for L2, what technology will you use? Just bridging with STP over (provider-delivered) VPLS or something fancier?
  22. Sorry for being ambiguous..

    We're buying large MTU CoS-enabled multi-point VPLS from two different providers (for redundancy). Over top of this service, we run our own MPLS infrastructure. We terminate the access circuits on P nodes that we manage. Looking at the header of a packet in one of the two service-provider's networks you would see something like ETH|MPLS|MPLS|ETH|MPLS|MPLS|ETH. With the rightmost ethernet header being one of our customers (internal or external), the left two MPLS tags being our own VPLS and transport tags, the middle ethernet header belonging to one of our own P nodes... then everything to the right of that belonging to one of the two service providers (and therefore invisible to us).

    So we are running our own L3VPN and L2VPN/VPLS services for one of many internal L2 or L3 networks. Think many lines of business each with their own web tier, app tier, storage tier, etc... Some components are shared, many are not. Multiple network teams with a fair degree of autonomy (because business units themselves are marketable things that can broken off and sold whole). Even some of the web tiers within individual lines of business are so large they are broken into multiple logically isolated networks.

    We have piles of L2 and L3 DCI requirements. We have multiple vendors in the network and server spaces, so we looked at the problem as if we were a service-provider ourselves and decided to turn our network into a service rather than what it is/was... which was many parallel circuits poorly utilized owned by different groups and in total costing us outrageous amounts of money.

    Which is more info than what you were looking for. From my team's perspective OTV would be something that one or some of these lines-of-business might buy into and we would provide a multipoint VPLS service in support of that. I think we are a pretty good case study on why its just wrong to compare OTV and VPLS like they are competing technologies.
  23. Thanks for an extensive answer. You're doing exactly what I would recommend someone to do (which is nice to see; seems like I'm too far off the mark ;)

    I was trying to figure out how someone would use SP-delivered VPLS service for L2 DCI and the only viable use I could see was to turn it into an IP(+MPLS) subnet, which is what you did.
Add comment