Fast Failover in SD-WAN Networks

It’s amazing how quickly you get “must have feature Y or it should not be called X” comments coming from vendor engineers the moment you mention something vaguely-defined like SD-WAN.

Here are just two of the claims I got as a response to “BGP with IP-SLA is SD-WAN” trolling I started on LinkedIn based on this blog post:

Key missing features [of your solution]:

  • real time circuit failover (100ms is not real-time)
  • traffic steering (again, 100ms is not real-time)

Let’s get the facts straight: it seems Cisco IOS evaluates route-map statements using track objects in periodic BGP table scan process, so the failover time is on order of 30 seconds plus however long it takes IP SLA to detect the decreased link quality.

Now for the crux of this blog post: does it matter?

It’s obvious that 100 msec failover time looks way better than 30 second failover time in any PowerPoint slide deck (unless you’re dealing with someone with the capability to distort the local reality field), so we should always buy products with better failover times, right? Of course… assuming that’s the most critical parameter your business has to deal with.

If you’re a hospital doing remote surgery, or a drone operator flying a combat mission, or you’re participating in the final round of Fortnite World Cup qualifications, then 100 msec failover time matters a lot… but then I hope you use something better than IPsec-over-Internet for your WAN links (guess not if you’re playing Fortnite … but now you’ve been warned ;).

If you’re running a non-real-time business like most of us do, then you probably don’t care too much - after all, your remote users probably experience higher failover times when switching between cell phone towers (or maybe I’m just unlucky).

The next data point to consider is the frequency of failover events. If you have links with extremely inconsistent and quickly-deteriorating quality then it’s crucial to have fast detection and quick failover. If you’re dealing with “typical” links they provide good-enough quality (even for voice traffic) most of the time, so you might not care whether the once-a-year failover happens in 100 msec or in 30 seconds.

As always, you should NEVER base your decisions based on $vendor selling points regardless of how compelling or vital they seem. Start with “what problem am I trying to solve”, continue with “what do I REALLY need to solve that problem” (aka “what would be a good-enough solution”) as opposed to “let’s merge all the features any vendor ever mentioned into our RFP just to be on the safe side”, and select a vendor that solves your challenge in the simplest and cleanest way (and don’t forget to look behind the covers to figure out the hidden complexity).

I covered a few of these considerations in the Business Aspects of Networking Technologies webinar and if you’re new to SD-WAN, check out our free SD-WAN Overview webinar.


  1. 30 seconds is acceptable as a network level failover but it is noticeable. I have been doing SDWAN installations in co-working spaces where there is a high reliance on interactive collaboration tools. Any failed sessions are noticed and reported.
    1. "30 seconds is acceptable as a network level failover but it is noticeable." << absolutely.

      But does the business care enough if that happens once a year (or once a month) that they'd be willing to invest in a more expensive solution? Or would you be willing to support a more complex solution (which could break in many other ways) just to get around this?

      Obviously there's no right answer, I'm just saying we should always consider the true business impact (and asking "can I get more money to get this done" is a good indicator of whether the business really cares).
    2. In our neck of the woods, fat finger and fat ass problems happens on a more regular basis than yearly or monthly!!!!
      The first answer to will I spend more money is always no. The next step is that when it happens, can a bully the ISP to fix it immediately? The next step when they discover the ISP small print and that there really isn't the SLA that they thought they had is to put in the solution because by now they have worked out how much money is being lost.
      The fundamental problem is they don't know what it costs until it is experienced. Secondly, no amount of subjective argument is going to fix the problem.
      I haven't met the business that doesn't care about an outage. Would love the examples.
Add comment