Lack of Fast Convergence in SD-WAN Products

Thursday, March 1, 2018 09:21 +0100

Lack of Fast Convergence in SD-WAN Products

One of my readers sent me this question:

I'm in the process of researching SD-WAN solutions and have hit upon what I believe is a consistent deficiency across most of the current SD-WAN/SDx offerings. The standard "best practice" seems to be 60/180 BGP timers between the SD-WAN hub and the network core or WAN edge.

Needless to say, he wasn’t able to find BFD in these products either.

Does that matter? My reader thinks it does:

When we consider that many companies will use these SD-WAN solutions to transport voice and real-time traffic, why is there a lack of focus on mechanisms which enable fast convergence at critical aggregation points?

He definitely has a point, even without voice and real-time traffic requirements. Waiting three minutes to figure out an adjacent box is down definitely feels like a call from ’90s.

Or maybe we’re missing something. The whole idea of SD-WAN was to take over the core WAN routing in your network. Tom Hollingsworth has an interesting story in one of his blog posts:

We turned off OSPF and now have a /16 route for the internal network and a default route to the Internet where a lot of our workloads were moved into the cloud.

We’ve seen similar claims in the past. Remember MPLS/VPN and the nirvana you’ll reach once the Service Provider takes over your core routing? Yeah, it didn’t turn out exactly that way, did it?

Anyway, assuming we give a proprietary black-box system total control of our core network should we care about the routing protocols at the edge of the black haze?

If you’re deploying a single SD-WAN appliance per site, and it controls all uplinks, the answer is obviously NO. There’s a single exit point (SD-WAN appliance) and it’s either working or not.

If you’re deploying redundant SD-WAN appliances on a site that needs higher availability and these appliances work as a cluster, control all the uplinks, and use layer-2 tricks (usually VRRP) to give the impression of a single next-hop router to the site network, the answer is still NO. We’re doing the same thing with static default routes pointing to a shared IP address of a firewall cluster today. I’m not saying it’s a good idea, but some people would probably claim it’s the best practice.

Unfortunately some SD-WAN vendors believe they can use the same approach no matter what. Their appliances try to avoid routing protocols like black plague and use dirty tricks like VRRP between SD-WAN appliance and on-site router combined with static default routing to get the appearance of primary/backup behavior. How many times do we have to repeat the same mistakes?

Site using a router in parallel with an SD-WAN appliance

However, if you do run routing protocol between SD-WAN appliances and other routers on your site for whatever reason (failover, alternate uplinks…), then I guess we deserve a decent implementation of the said routing protocols. It’s not like you couldn’t get a routing protocol suite these days, either as open-source project or a commercial product.

Site using a routing protocol to implement internal layer-3 redundancy

Or maybe the SD-WAN vendors don’t have to care because they’re never asked about such mundane details. There must be plenty of customers believing in magic powers of PowerPoint out there…

Want to know more?

I covered the basics of SD-WAN in Choose the Optimal VPN Service and SDN Use Cases webinars.

You might also want to watch the free SDWAN 101 and Cisco SD-WAN Foundations and Design Aspects webinars.

Recent posts in the same categories

SD-WAN

IP routing

SDN

24 comments:

Daniel Dib 01 March 2018 13:12

Hi Ivan,

I can only answer for IWAN and Viptela as those are the solutions I have experience with. Those are the two with the strongest routing stack naturally since Cisco has done routing for some time now... And Viptela was founded by ex Cisco employees with strong background in routing and large scale networking.

Viptela runs BFD on the overlay while IWAN doesn't. However, you're not dependent on BGP to failover traffic when there is a failure. With IWAN there is channels per transport, destination site and DSCP that measure performance. If a channel becomes unreachable then that transport is not used any longer. This takes 4s normally but can be tuned to 1s if needed, for example for voice traffic. For "soft" failures such as packet loss and increase in latency, this is normally acted upon after about 30s.

I know some of the SD-WAN vendors use another approach to make sure voice packets arrive. They send the packets out both transport with some form of sequence number and then put the stream back at the other end. Deja vu to LFI? At least there is some resemblance. In theory this doesn't sound like the best idea but I've heard people having positive results with it especially on "wonky" circuits where you expect latency, packet lsos etc (think satellite links).

Any SD-WAN vendor should have a good routing stack though. Some of the companies came from the WAN acceleration market though so they have no strong background in routing.

Good post as always!

Replies

Ivan Pepelnjak 01 March 2018 14:15

There's convergence _WITHIN_ the SD-WAN cloud and convergence at the _EDGE_ of the SD-WAN cloud (with external devices). This blog post discusses the latter.

And yes, I'm pretty sure Viptela has something at the edge, but as they're still hiding their documentation I don't care much about what they're doing.

IWAN definitely does have decent routing protocols at the edge (after all, it's just Cisco IOS), but is it really SD-WAN?

Daniel Dib 01 March 2018 17:40

I see. I thought your reader was discussing BGP for the overlay.

Defining SD-WAN is probably as meaningful as defining SDN. IWAN does have more "intelligence" than traditional routing protocols. If it ticks one's check boxes to be SD-WAN is up to each person imo.

Syl 01 March 2018 20:18

Just as info, Viptela supports OSPF and BGP at the edge.

Unknown 01 May 2018 14:18

Not exactly sure since when, but SD-WAN/Viptela documentation can be reached on https://docs.viptela.com without any registration.
The relevant section: https://docs.viptela.com/Product_Documentation/Software_Features/Release_18.1/03Routing/03Configuring_Unicast_Overlay_Routing

PacketsLoveCoffee 01 March 2018 15:15

Just FYI, Cisco SD-WAN does use BFD and supports routing protocols for HA. Please make sure you don't paint with such a broad brush and give credit where credit is due.

Replies

Ivan Pepelnjak 01 March 2018 15:18

Dear Unknown,

As I already mentioned, Cisco SD-WAN (assuming you mean Viptela) has no public documentation, so I can't consider what it may or may not do. The moment their documentation is made public you're most welcome to add another comment telling me so.

PacketsLoveCoffee 01 March 2018 15:53

If I'm reading your blog and looking for insight into SD-WAN, I would expect you to have some real world experience with the technology and the limitations of the top 2-3 solutions before writing a blog post on the subject. Viptela is one of the lead SD-WAN vendors in the market. To ignore the distinct advantages of that solution seems strange...

Ivan Pepelnjak 01 March 2018 15:57

I made a very conscious decision years ago to ignore anyone who's not capable of making their documentation publicly available, and I don't care what the magic quadrants claim.

You don't like it? It's really easy to solve: publish the docs.

Salman Naqvi 01 March 2018 19:52

"I made a very conscious decision years ago to ignore anyone who's not capable of making their documentation publicly available"

Ivan, I love this policy. Vendors who choose to hide their documentation are really annoying. They do dis-service do their own products by doing so.

Iain 02 March 2018 19:13

"Unknown", please double-check your information:
1.) iWAN is PfR with lipstick (not SD-WAN)
2.) Viptela supports BFD within the SD-WAN cloud but does NOT support BFD on the LAN facing side (edge of the SD-WAN cloud). Big difference.
3.) Since Viptela is spending a bunch of their dev resources porting their product to non-x86 legacy ISR platforms it may be quite some time before this is resolved.

PacketsLoveCoffee 02 March 2018 19:48

1. iWAN is absolutely the most feature rich "SD-WAN", it just doesn't have a decent GUI or automation. DMVPN, PfR, WAAS are all very solid technologies. Implementation can be CCIE level, but it works as advertised.

2. Agreed. Fast convergence on the lan side is achieved through native routing protocol timers. BFD is used on the overlay to identify brownout and blackout conditions.

3. Haters gonna hate :) https://www.youtube.com/watch?v=nfWlot6h_JM

Iain 02 March 2018 21:10

@PacketsLoveCoffee

First of all, love the username!

"1. iWAN is absolutely the most feature rich "SD-WAN", it just doesn't have a decent GUI or automation. DMVPN, PfR, WAAS are all very solid technologies. Implementation can be CCIE level, but it works as advertised."

My two cents is that IWAN "basic" implementation IS inherently CCIE level, but most concerning of all ongoing operation is TAC level dependent - unsustainable levels of complexity.

"2. Agreed. Fast convergence on the lan side is achieved through native routing protocol timers. BFD is used on the overlay to identify brownout and blackout conditions."

Probably worth mentioning that this is currently an issue for the other vendors as well (Meraki, VeloCloud, Silver Peak, etc)

Regarding 3., Viptela looks cool (certainly not a hater), but porting Viptela to ISR seems like an expensive way to pacify existing IWAN customers. Why would anyone want a greenfield Viptela on ISR deployment if x86 is 1/3 or 1/4 the cost? What if Cisco instead atoned for IWAN by offering current IWAN customers with free/cheap x86 branch hardware while Viptela engineering resources were laser focused on improving their SD-WAN capabilities on native x86?

PacketsLoveCoffee 02 March 2018 21:27

You would be suprised how many folks cannot get Ethernet service without spending a ton of money for build out. Legacy connectivity and interface modularity is sitll a requirement. You would also be suprised how many people are still reliant on voice services on the router.

I believe all of the excitement around SD-WAN made people forget the reasons for the INTEGRATED Services Router. Porting Viptela capabilities to IOS-XE and the existing ISR 4K user base is a no brainer. Gotta bring the Cisco family that has invested in the 4K hardware along for the ride!

PacketsLoveCoffee 02 March 2018 22:38

"My two cents is that IWAN "basic" implementation IS inherently CCIE level, but most concerning of all ongoing operation is TAC level dependent - unsustainable levels of complexity. "

I ran a DMVPN network for half a decade. I think the complexity is overblown and is mostly fear and FUD. But lets move on. You want a GUI, you got it.

Replies

Iain 02 March 2018 23:00

DMVPN ≠ IWAN. Don't forget all the other components (especially PfR). Opinions aside, IWAN wasn't a tenable solution and this is why Cisco is essentially replacing it with Viptela.

PacketsLoveCoffee 02 March 2018 23:09

It really wasn't as bad as it is painted. If you have a decent QOS policy, PfRv3 is pretty straightforward. People just hate change and this did not move the needle far enough, IMHO.

Romain Jourdan 03 March 2018 23:06

Hi Ivan,
I am working for Riverbed Technology. Our documentation is public and accessible on our support website. SDWAN user guide is here http://rvbd.ly/2tlpIXV and deployment guide is here: http://rvbd.ly/2oEA8Nm
Your comments are valid. Since we are updating our documentations, we will take in accounts your inputs. Thanks for helping us to become better!

Unknown 04 March 2018 16:08

FYI, Silver Peak's training is also online and it's free. https://training.silver-peak.com/

One of the things, you will learn is that the HA bonding policy is capable of sub second failover and because of the Forward Error Correction will not drop a single packet. So your voice calls will remain crisp and clear..

Replies

Iain 05 March 2018 14:13

However, Silver Peak does not support BFD on the LAN side where the Edge Connect is peering with the WAN edge (eBGP). This deficiency means that even with tight timers (1/3?) in certain failure or upgrade scenarios there will be a noticeable outage.

PacketRancher 05 March 2018 02:34

VMware NSX SD-WAN/VeloCloud embeds probe packets within overlay paths sent every 100ms and will select an alternate path if required in 300-500ms. It can leverage tight timers on the LAN side with BGP and OSPF if required. Also uses FEC and NACK to overcome lossy paths. HA failover and clustering are options between CPE. We have dozens of customers with hundreds of sites deployed and I can definitively state, we have designed much more fault tolerant and faster converging designs than we were ever able to with traditional routers and routing protocols.

Publicly publishing technical docs would be great but in the extremely competitive SD-WAN market, all of the players are keeping their "secret sauce" hidden away.

Replies

Iain 05 March 2018 14:18

However, VeloCloud does not support BFD on the LAN side where the appliance is peering with the WAN edge (eBGP). This deficiency means that even with tight timers (1/3?) in certain failure or upgrade scenarios there will be a noticeable outage.

Regarding the hidden documentation, do you really believe that a competitor is incapable of registering with a personal/fake email address and gaining access to all the docs? If this isn't an effective solution for preventing competitors from snooping, then stop using it! It does more harm for potential customers than anything else.

Ivan Pepelnjak 05 March 2018 14:46

@Iain: the "we can't publish documentation in extremely competitive market" is total bullshit as illustrated by tons of public documentation available from established vendors in other competitive markets... or by Riverbed and Silver Peak publishing their SD-WAN documentation.

Furthermore, configuration and design guides leak no secret sauce, but document the shortcomings of the platform - which is the true reason these startups don't want to publish what they're doing.

Finally, I've seen some so-called documentation from other startups (not from any vendor mentioned on this page) and totally understand why nobody would want to make that public :))

A Network Artist 07 March 2018 07:11

Did they tell you Viptela couldn't even do basic Static NAT :)

HTH...
Evil CCIE

Add comment