Lack of Fast Convergence in SD-WAN Products
One of my readers sent me this question:
I'm in the process of researching SD-WAN solutions and have hit upon what I believe is a consistent deficiency across most of the current SD-WAN/SDx offerings. The standard "best practice" seems to be 60/180 BGP timers between the SD-WAN hub and the network core or WAN edge.
Needless to say, he wasn’t able to find BFD in these products either.
Does that matter? My reader thinks it does:
When we consider that many companies will use these SD-WAN solutions to transport voice and real-time traffic, why is there a lack of focus on mechanisms which enable fast convergence at critical aggregation points?
He definitely has a point, even without voice and real-time traffic requirements. Waiting three minutes to figure out an adjacent box is down definitely feels like a call from ’90s.
Or maybe we’re missing something. The whole idea of SD-WAN was to take over the core WAN routing in your network. Tom Hollingsworth has an interesting story in one of his blog posts:
We turned off OSPF and now have a /16 route for the internal network and a default route to the Internet where a lot of our workloads were moved into the cloud.
We’ve seen similar claims in the past. Remember MPLS/VPN and the nirvana you’ll reach once the Service Provider takes over your core routing? Yeah, it didn’t turn out exactly that way, did it?
Anyway, assuming we give a proprietary black-box system total control of our core network should we care about the routing protocols at the edge of the black haze?
If you’re deploying a single SD-WAN appliance per site, and it controls all uplinks, the answer is obviously NO. There’s a single exit point (SD-WAN appliance) and it’s either working or not.
If you’re deploying redundant SD-WAN appliances on a site that needs higher availability and these appliances work as a cluster, control all the uplinks, and use layer-2 tricks (usually VRRP) to give the impression of a single next-hop router to the site network, the answer is still NO. We’re doing the same thing with static default routes pointing to a shared IP address of a firewall cluster today. I’m not saying it’s a good idea, but some people would probably claim it’s the best practice.
Unfortunately some SD-WAN vendors believe they can use the same approach no matter what. Their appliances try to avoid routing protocols like black plague and use dirty tricks like VRRP between SD-WAN appliance and on-site router combined with static default routing to get the appearance of primary/backup behavior. How many times do we have to repeat the same mistakes?
However, if you do run routing protocol between SD-WAN appliances and other routers on your site for whatever reason (failover, alternate uplinks…), then I guess we deserve a decent implementation of the said routing protocols. It’s not like you couldn’t get a routing protocol suite these days, either as open-source project or a commercial product.
Or maybe the SD-WAN vendors don’t have to care because they’re never asked about such mundane details. There must be plenty of customers believing in magic powers of PowerPoint out there…
Want to know more?
I covered the basics of SD-WAN in Choose the Optimal VPN Service and SDN Use Cases webinars.
You might also want to watch the free SDWAN 101 and Cisco SD-WAN Foundations and Design Aspects webinars.
I can only answer for IWAN and Viptela as those are the solutions I have experience with. Those are the two with the strongest routing stack naturally since Cisco has done routing for some time now... And Viptela was founded by ex Cisco employees with strong background in routing and large scale networking.
Viptela runs BFD on the overlay while IWAN doesn't. However, you're not dependent on BGP to failover traffic when there is a failure. With IWAN there is channels per transport, destination site and DSCP that measure performance. If a channel becomes unreachable then that transport is not used any longer. This takes 4s normally but can be tuned to 1s if needed, for example for voice traffic. For "soft" failures such as packet loss and increase in latency, this is normally acted upon after about 30s.
I know some of the SD-WAN vendors use another approach to make sure voice packets arrive. They send the packets out both transport with some form of sequence number and then put the stream back at the other end. Deja vu to LFI? At least there is some resemblance. In theory this doesn't sound like the best idea but I've heard people having positive results with it especially on "wonky" circuits where you expect latency, packet lsos etc (think satellite links).
Any SD-WAN vendor should have a good routing stack though. Some of the companies came from the WAN acceleration market though so they have no strong background in routing.
Good post as always!
And yes, I'm pretty sure Viptela has something at the edge, but as they're still hiding their documentation I don't care much about what they're doing.
IWAN definitely does have decent routing protocols at the edge (after all, it's just Cisco IOS), but is it really SD-WAN?
Defining SD-WAN is probably as meaningful as defining SDN. IWAN does have more "intelligence" than traditional routing protocols. If it ticks one's check boxes to be SD-WAN is up to each person imo.
The relevant section: https://docs.viptela.com/Product_Documentation/Software_Features/Release_18.1/03Routing/03Configuring_Unicast_Overlay_Routing
As I already mentioned, Cisco SD-WAN (assuming you mean Viptela) has no public documentation, so I can't consider what it may or may not do. The moment their documentation is made public you're most welcome to add another comment telling me so.
You don't like it? It's really easy to solve: publish the docs.
Ivan, I love this policy. Vendors who choose to hide their documentation are really annoying. They do dis-service do their own products by doing so.
1.) iWAN is PfR with lipstick (not SD-WAN)
2.) Viptela supports BFD within the SD-WAN cloud but does NOT support BFD on the LAN facing side (edge of the SD-WAN cloud). Big difference.
3.) Since Viptela is spending a bunch of their dev resources porting their product to non-x86 legacy ISR platforms it may be quite some time before this is resolved.
2. Agreed. Fast convergence on the lan side is achieved through native routing protocol timers. BFD is used on the overlay to identify brownout and blackout conditions.
3. Haters gonna hate :) https://www.youtube.com/watch?v=nfWlot6h_JM
First of all, love the username!
"1. iWAN is absolutely the most feature rich "SD-WAN", it just doesn't have a decent GUI or automation. DMVPN, PfR, WAAS are all very solid technologies. Implementation can be CCIE level, but it works as advertised."
My two cents is that IWAN "basic" implementation IS inherently CCIE level, but most concerning of all ongoing operation is TAC level dependent - unsustainable levels of complexity.
"2. Agreed. Fast convergence on the lan side is achieved through native routing protocol timers. BFD is used on the overlay to identify brownout and blackout conditions."
Probably worth mentioning that this is currently an issue for the other vendors as well (Meraki, VeloCloud, Silver Peak, etc)
Regarding 3., Viptela looks cool (certainly not a hater), but porting Viptela to ISR seems like an expensive way to pacify existing IWAN customers. Why would anyone want a greenfield Viptela on ISR deployment if x86 is 1/3 or 1/4 the cost? What if Cisco instead atoned for IWAN by offering current IWAN customers with free/cheap x86 branch hardware while Viptela engineering resources were laser focused on improving their SD-WAN capabilities on native x86?
I believe all of the excitement around SD-WAN made people forget the reasons for the INTEGRATED Services Router. Porting Viptela capabilities to IOS-XE and the existing ISR 4K user base is a no brainer. Gotta bring the Cisco family that has invested in the 4K hardware along for the ride!
I ran a DMVPN network for half a decade. I think the complexity is overblown and is mostly fear and FUD. But lets move on. You want a GUI, you got it.
I am working for Riverbed Technology. Our documentation is public and accessible on our support website. SDWAN user guide is here http://rvbd.ly/2tlpIXV and deployment guide is here: http://rvbd.ly/2oEA8Nm
Your comments are valid. Since we are updating our documentations, we will take in accounts your inputs. Thanks for helping us to become better!
One of the things, you will learn is that the HA bonding policy is capable of sub second failover and because of the Forward Error Correction will not drop a single packet. So your voice calls will remain crisp and clear..
Publicly publishing technical docs would be great but in the extremely competitive SD-WAN market, all of the players are keeping their "secret sauce" hidden away.
Regarding the hidden documentation, do you really believe that a competitor is incapable of registering with a personal/fake email address and gaining access to all the docs? If this isn't an effective solution for preventing competitors from snooping, then stop using it! It does more harm for potential customers than anything else.
Furthermore, configuration and design guides leak no secret sauce, but document the shortcomings of the platform - which is the true reason these startups don't want to publish what they're doing.
Finally, I've seen some so-called documentation from other startups (not from any vendor mentioned on this page) and totally understand why nobody would want to make that public :))