Some People Don’t Get It: It Will Eventually Fail

Mark Baker left this comment on my Stretched Firewalls across Layer-3 DCI blog post:

Strange how inter-DC clustering failure is considered a certainty in this blog.

Call it experience or exposure to a larger dataset. Anything you build will eventually fail; just because you haven’t experienced the failure yet doesn’t mean that the system will never fail but only that you were lucky so far.

Let me use a trivial example from real world to illustrate the point. When I was a kid, we didn’t use seat belts, because everyone knew that a traffic accident couldn’t possibly happen to him (or his dad/mom). When I was a teenager, I was fortunate enough to use a seat belt (even though it wasn’t common), or I wouldn’t be writing this blog post.

The whole stretched whatever debate is really a question of risk management and balancing convenience (in Mark’s case the management burden) against inevitable crash. The two real questions to consider are “How often does that happen?” and “What happens after the failure?” (or “How much will a failure cost me?”).

Unfortunately, many people promoting next-big-thingy don’t consider the risks involved and never go through the exercise of identifying all possible failure scenarios and their consequences, and all I’m trying to do is to point out that there’s non-zero risk and that the consequences could be fatal.

However, once you went through the above exercise, and understand all the implications of what you’re doing, go ahead and choose the option that makes most sense to you; we’ll explore quite a few of them during the design sessions in the Building Next-Generation Data Center online course.

8 comments:

  1. I guess the problem is we dont have any data about how frequent a modern stretched L2 design causes catastrophic failure. We all have an anecdote or two, mostly from older designs, but of course the plural of anecdote is not data.

    If there is clear business / technical benefit on one side versus what essentially amounts to FUD on the other side, then you can see why stretched L2 continues
    Replies
    1. I'm confused by the following statement:

      "If there is clear business / technical benefit on one side versus what essentially amounts to FUD on the other side, then you can see why stretched L2 continues"

      This seems to indicate that you are saying the continued existence of L2 DCI is predicated on clear business and technical benefits. You are also stating that the other side is playing a shady game of FUD. Yet the very sentence above that one states that everyone has at least 1 story of a catastrophic failure in L2 DCI. However you are also saying that everyone having a known incident when a L2 DCI failed isn't enough data?

      Can we at least agree that should a L2 DCI fail it doesn't bode well for the network in question?

      All Ivan is trying to say is that "there’s non-zero risk and that the consequences could be fatal."

      Isn't it our job as an architect/designer/engineer to be aware of these issues before they happen? Shouldn't we be thinking about mitigation techniques at the architecture, design and implementation levels to prevent that failure from becoming catastrophic up to and including not using L2 DCI? Shouldn't we inform our counterparts in the business and application areas and help them understand these risks so that as a company we can make the right decision?

      Or is all that just more FUD?
    2. There's definitely non-zero risk of L2 DCI. And yes I'm saying that everyone having a known incident when a L2 DCI failed isn't enough data.

      Everyone knows an airplane has crashed, but because we collect actual data, we know air travel is incredibly safe.

      We have no data on L2 DCI failure, so we don't really know the relative level of safety - hence FUD.

      We do know there are applications that benefit from stretched L2 design. Should those applications be re-written? In a perfect world yes, but that costs money to do so.

      You definitely should inform your counterparts in the business and application areas of the risks of stretched L2 - my point is that no one has any idea of the risks of stretched L2. It wouldn't be prudent to tell your business that no one should fly on a plane because the plane could crash right?
    3. Dear Anonymous,

      Isn't it amazing how whatever you believe in makes perfect sense, whereas evidence presented by the other side is FUD? How about looking up confirmation bias?

      As for relative level of safety and airplane crashes - I would love to see the networking industry as heavily focused on safety and being monitored and regulated as much as the airline industry. This would definitely stop most of the craziness promoted by startups and major vendors, and eventually result in well-behaving networks.

      Today all we have is anecdata (apart from post-mortems from web-scale companies) because nobody wants to go public with a statement like "I was stupid enough to risk my network believing vendor whitepaper".
    4. Anonymous - I think I understand what you are saying now, the type of risk you are talking about is how often a technology fails, and that we have no concrete evidence to decide either way. However I'm not aware of any data about any type of DCI and its failure rate. Looking further into your analogy the data tells us that flying is safe but everytime we get on a plane we are told what to do in the event the plane were to actually crash. A better question, should pilots not be trained to handle an airplane that has malfunctioned since they don't crash that often?
    5. Nope, a better question would be "Should airplane enineers not think about possible failures when they design an airplane as they don't crash that often an therefore don't fail that often?"
      Substitue "airplane" with "network"...

      Regards
      Christoph
    6. Thanks Christoph that's exactly what I was trying to say.
  2. Sure airline engineers should think about possible failure scenarios - but to torture the analogy the solution can't be "well then don't fly".

    Likewise we should consider the possible failure scenarios associated with stretched L2, but the answer shouldn't simply be "well then don't stretch L2"

    And just to be clear Ivan, there's a shocking lack of data on both sides of the equation. I know many customers stretching L2 in some capacity or another and achieving technical and business benefit by doing it. I know a handful of customers who have been burned by a stretched L2 design. I can't usefully quantify either of those sides of the equation.

    I think its imprudent to suggest that stretched L2 is always a house of cards with no merit (the same point I think Mark Baker was making), and especially mistaken to frame up that suggestion as a result of "exposure to a large dataset". Us vendors aren't all in the business of causing massive customer data center outages to sell a small amount of capability like L2 extension.

    Anyways - long time blog reader (and webinar attender), appreciate all your work and effort. Basically I think we're saying the same thing anyways when I read your last paragraph - think through the pros and cons and do what makes sense for the business you're supporting. There's no factual basis for religion either for or against L2 extension.
Add comment
Sidebar