The Grumpy Old Network Architects and Facebook

Nuno wrote an interesting comment to my Stretched Firewalls across L3 DCI blog post:

You're an old school, disciplined networking leader that architects networks based on rock-solid, time-tested designs. But it seems that the prevailing fashion in network design and availability go against your traditional design principles: inter-site firewall clustering, inter-site vMotion, DCI, etc.

Not so fast, my young padawan.

Let’s define prevailing fashion first. You might define it as Kool-Aid id peddled by snake oil salesmen or cool network designs by people who know what they’re doing. If we stick with the first definition, you’re absolutely right.

Now let’s look at the second camp: how people who know what they’re doing build their network (Amazon VPC, Microsoft Azure or Bing, Google, Facebook, a number of other large-scale networks). You’ll find L3 down to ToR switch (or even virtual switch), and absolutely no inter-site vMotion or clustering – because they don’t want to bet their service, ads or likes on the whims of technology that was designed to emulate thick yellow cable.

Want to know how to design an application to work over a stable network? Watch my Designing Active-Active and Disaster Recovery Data Centers webinar.

This isn't the first time that readers have asked you about these technologies, and it won't be the last. Vendors will continue to market them despite their shortcomings, and customers will continue to eat them up.

As long as there will be someone willing to believe in fairy tales and Santa Claus, there will be someone dressed in red coat and fake beard yelling “Ho, Ho, Ho!”

Enterprise IT managers sometimes act like small kids. They don’t want to hear that they have people- and process problems, and love to believe that the next magical bit of technology will solve whatever it is that bothers them. Vendors obviously love to explore these cravings and sell them ever-more-complex solutions.

I'd like to think that vendors will also continue to work out the kinks and over time the technology will become rock solid and time-tested.

I am positive you can make any technology almost-rock-solid. You can also make pigs fly (see RFC 1925 sect. 2.3). However, have you included the fuel costs in your TCO?

Also, the more complex a technology is, the likelier it is to crash down like a house of cards, and you’ll be left with an incomprehensible mix of bits and pieces that will be impossible to put back together (see also: You can’t reformat your data center).

Nino concluded his comment with a question:

Are you too stuck on past, traditional designs and not being open to new ways of building IT? I get that IT is very cyclical, and these new trends may die in the future...or thrive, and the customers may either fail...or succeed.

I am very open to new ways of building IT. I preach the need for meaningful SDN (not the centralized control plane crap), network automation, and proper application architecture. I just refuse to believe in fairy tales, and solving non-technical problems with technology.

Finally…

Looking for more red pills? Explore my SDN webinars, Designing Active/Active Data Centers webinar, and vMotion-related blog posts.

Latest blog posts in High Availability Service Clusters series

8 comments:

  1. All I could think when reading this was how these were somewhat naive suggestions from a business perspective. Nobody in their right mind does "fashionable" things when dealing with infrastructures that are required to be solid, dependable and robust.

    Nobody wants to be "that guy" who bought into vendor hot air and implemented some sort of control plane abstraction or some sort of stretched infrastructure that one day goes wrong, or melts and your vendor is looking at you with a blank stare.

    The reason the infrastructures we have are the way they are is because they are stable and known quantities.

    You'd have to be a sociopath to sit and think "hey ill do this fashionable new thing and look like a networking rockstar, if the infrastructure comes crashing down, loses the company money, and moreover peoples jobs who work on the coalface, it'll be fine, at least i did the fashionable thing...."

    Or would you rather be the guy that did everything with proven technology, best practice from solutions that are proven, and making sure your design is as simple as possible so that documentation is completed (which isnt unusual for the majority of us) and perhaps not some obscene vendor lock-in doesnt happen when you have to scale this up and your current vendor is putting the thumbscrews on you.

    We are here to support the business and bring value, not make it our playground for new fancy unproven things, if those technologies mature into something that isnt going to make anyone nervous, then great, they'll get used by people who arent startups or cornerstone cases, im certainly not going to be sat writing an RCA into how this new technology somehow made my loadbalancers melt........
  2. Ivan, any chance you want to chime in here: https://www.reddit.com/r/sysadmin/comments/3vx6wh/stretched_datacenter_topology/
  3. "Enterprise IT managers sometimes act like small kids. They don’t want to hear that they have people- and process problems, and love to believe that the next magical bit of technology will solve whatever it is that bothers them. Vendors obviously love to explore these cravings and sell them ever-more-complex solutions."

    Me IRL. I just went down the OTV, LISP, F5, Vmotion rabbit hole. Thankfully once the quotes came back that solution was cost prohibitive.
  4. everybody has a fashion
  5. The point is that managers would prefer to put a patch in the network (say F5 irule) rather than fix the root problem. Same with streaching L2 people believe we will shave piles of $$$ on servers by doing some network magic and they know everyone will forget how demanded the magic when it breaks
  6. Ivan, surely you realize that "Amazon VPC, Microsoft Azure or Bing, Google, Facebook" are not the norm. Most of us work with enterprise, academia, and government, where apps are built fast and sloppy and IT is left to provide scale and fault tolerance.

    Frankly, I've found L2 extension to be extremely helpful for short-term network upgrades and migrations, and long-term DR/SRM networks. Just like a load-balancer, L2 extension can create a great app environment for a fraction of the cost of building the apps "right."

    I'll admit that Live Migration at distance is a dumb idea (cost-to-value is much too high). But L2 extension as a network tool is here to stay -- at least until our customers start using dockers instead of wearing them!
    Replies
    1. I think there are probably plenty of people out there who use some sort of layer 2 circuit to facilitate a migration.

      DR/SRM via stretched layer 2 can be avoided these days by building virtual routers into the deployment. Vmware NSX/SRM looks like it handles it on paper.

      Doing a DIY job with Quagga or Vyos etc is probably also an option.
  7. Totally agree with you. Inter-site firewall cluster it's a marketing driver for a selling DCI solutions and etc. It's not a best way for DR situation. Most of the cases related with DR concerned with catastrophic scenarios. Therefore the better way it's to have few separated FW clusters and pretty simple failover based on classic routing signalling schemes. Other disadvantages with stretched cluster: the single failure domain (most of the cases - control-plane), architectural and hardware bottlenecks (for example shared resources for a handle active states, sessions, configuration and etc.), operations (upgrade software, configuration, hardware etc.). Sometimes cluster technologies could be a root cause for a network instability (flapping, brain splitting, vendor software bugs etc).

    "Resiliency isn’t just about having multiple components; it’s also about isolation of failure domains." - Russ White
Add comment
Sidebar