Layer-3 gurus: asleep at the wheel
I just read a great article by Kurt (the Network Janitor) Bales eloquently describing how a series of stupid decisions led to the current situation where everyone (but the people who actually work with the networking infrastructure) think stretched layer-2 domains are the mandatory stepping stone toward the cloudy nirvana.
It’s easy to shift the blame to everyone else, including storage vendors (for their love of FC and FCoE) and VMware (for the broken vSwitch design), but let’s face the reality: the rigid mindset of layer-3 gurus probably has as much to do with the whole mess as anything else.
We should start with the business basics. Server virtualization became a viable solution a few years ago, saving the enterprises that embraced it tons of money (and countless server installation hours). The ability to move running virtual machines between physical servers (vMotion or Live Migration) is a huge bonus in a high-availability or changing-load environment. Is it so strange that the server admins want to use those features?
Now, imagine a purely fictional dialog between a server admin trying to deploy vMotion and a L3-leaning networking guru:
SA: You know what – VMware just introduced a fantastic new feature. vMotion allows me to move running virtual machines between physical servers.
NG: So?
SA: The only problem I have is that I have to keep the application sessions up and running.
NG: So?
SA: But they break down if I move the VM across the network.
NG: Do they?
SA: Yeah, supposedly it has to do with the destination server being in a different IP subnet.
NG: You want to move a VM between IP subnets? How is that supposed to work? Obviously you have to change its IP address if it lands in a different IP subnet. Don’t you know how IP routing works?
SA: But wouldn’t that kill all application sessions?
NG: I guess it would. Is that a problem?
SA: Is there anything else we could do?
NG: Sorry, pal, that’s how routing works. Shall I explain it to you?
SA: But VMware claims that if I have both servers in the same VLAN, things work.
NG: Yeah, that could work. So you want me to bridge between the servers? (Feel free to add alternate endings in the comments ;)
The sad part of the whole story is that we had a L3 solution available for decades – Cisco’s Local Area Mobility (LAM), which created host routes based on ARP requests, was working quite well in the 1990s. IP mobility could be another option, but would obviously require modifications in the guest operating system, which is usually considered a taboo.
It’s obvious that the programmers working on vSwitch/vMotion code (or Microsoft Network Load Balancing) lacked networking knowledge and rarely considered anything beyond their lab environment, declaring the product ready-to-deploy as soon as they could get two or three boxes working over a $9.99 switch. It’s also quite obvious that the networking vendors (including Cisco, Juniper, HP, and everyone else) did absolutely nothing to solve the problem. The worst offender (in my opinion) is Nexus 1000V. It could have been a great L3 platform, solving most of the architectural problems we’re endlessly debating, but instead Cisco decided to launch a product with a minimum feature set needed to get a reasonable foothold in the ESX space.
It’s infinitely sad to watch all the networking vendors running around like headless chickens trying to promote yet-another fabric routing at layer-2 solution, instead of stepping back and solving the problem where it should have been solved: within layer 3.
LISP in Nexus 1000V might be the answer, but I don't like the extra layer of encapsulation it introduces.
Your gateway would need VLAN interfaces for the VMs that it routes for, and an interface for some kind of 'OSPF adjacency' VLAN in each datacenter. As your workload migrates from one datacenter (or cluster) to another one, once the virtualized router is migrated OSPF adjacencies are formed on the appropriate OSPF VLAN, and new routes to the workload are propagated to the network. Keep in mind that when your router is in one cluster or DC and your workload is in the other then no traffic flows.
Problems:
How do you keep your virtual router from advertising the routes for datacenter OSPF networks that it is not connected to? How does this scale beyond a workload that needs 1 Gbps of network throughput? How do you get access to the storage? Does VMware SRM take care of the second two?
But all you're willing to do is "move this VM to the other end of the world", then we have a problem ;) LISP can solve it, but (as I said) introduces yet another layer of encapsulation.
Nowadays, most applications have bad designs and seek for a *a lot* of help from the operating system, network & storage.
The fact that most engineers in the last 10+ years do not have a broad view of the above areas, has led to the problems that you describe.
It's the application (& protocol) design that needs to be fixed!
ps. Tell to the server admin to configure the application to perform all the network stuff from a loopback address and you can take care of the rest easily.
Is that your view as well ?
http://bradhedlund.com/2011/02/09/emergence-of-the-massively-scalable-data-center/
> stuff from a loopback address and you can take care of the rest easily.
I'll second that. So often network engineering is an effort to solve problems of crappy application development and/or server deployment practices that could be easily fixed if app/server people could think just a little bit outside of their domain.
Still doesn't scale nicely of course! :)
Believe me when I say that I do not love load balancers. They are temperamental and expensive black boxes. They do provide a much more scalable solution than vMotion. My biggest problem with vMotion is that it allows application owners to be lazy. They will develop their software assuming that the network will always allow them to shift their services around. It won't challenge them to think about how to scale their service by orders of magnitude.
There are all sorts of "great" protocols out there, like otv, that allow network engineers to come up with "creative solutions" to suboptimal service requirements. IMHO
The meta-issue here isn't layer-2 vs. layer-3, it's a) the overloading of the IPv4 (and now IPv6) addresses with locator/EID information, b) the policy overloading of IPv4 (and now IPv6) addresses with policy information via ACLs, firewall rules, et. al., and c) the continued worst practice of application developers further overloading IPv4 (and now IPv6) addresses by directly hardcoding IP addresses into their applications/platforms/services, instead of abstracting this away via a naming service (e.g., DNS, at least for now).
As for the appdev laziness, I couldn't agree more with you ;)
We're buying large MTU CoS-enabled multi-point VPLS from two different providers (for redundancy). Over top of this service, we run our own MPLS infrastructure. We terminate the access circuits on P nodes that we manage. Looking at the header of a packet in one of the two service-provider's networks you would see something like ETH|MPLS|MPLS|ETH|MPLS|MPLS|ETH. With the rightmost ethernet header being one of our customers (internal or external), the left two MPLS tags being our own VPLS and transport tags, the middle ethernet header belonging to one of our own P nodes... then everything to the right of that belonging to one of the two service providers (and therefore invisible to us).
So we are running our own L3VPN and L2VPN/VPLS services for one of many internal L2 or L3 networks. Think many lines of business each with their own web tier, app tier, storage tier, etc... Some components are shared, many are not. Multiple network teams with a fair degree of autonomy (because business units themselves are marketable things that can broken off and sold whole). Even some of the web tiers within individual lines of business are so large they are broken into multiple logically isolated networks.
We have piles of L2 and L3 DCI requirements. We have multiple vendors in the network and server spaces, so we looked at the problem as if we were a service-provider ourselves and decided to turn our network into a service rather than what it is/was... which was many parallel circuits poorly utilized owned by different groups and in total costing us outrageous amounts of money.
Which is more info than what you were looking for. From my team's perspective OTV would be something that one or some of these lines-of-business might buy into and we would provide a multipoint VPLS service in support of that. I think we are a pretty good case study on why its just wrong to compare OTV and VPLS like they are competing technologies.
I was trying to figure out how someone would use SP-delivered VPLS service for L2 DCI and the only viable use I could see was to turn it into an IP(+MPLS) subnet, which is what you did.