One of my readers sent me this question:
I'm in the process of researching SD-WAN solutions and have hit upon what I believe is a consistent deficiency across most of the current SD-WAN/SDx offerings. The standard "best practice" seems to be 60/180 BGP timers between the SD-WAN hub and the network core or WAN edge.
Needless to say, he wasn’t able to find BFD in these products either.
Does that matter? My reader thinks it does:
When we consider that many companies will use these SD-WAN solutions to transport voice and real-time traffic, why is there a lack of focus on mechanisms which enable fast convergence at critical aggregation points?
He definitely has a point, even without voice and real-time traffic requirements. Waiting three minutes to figure out an adjacent box is down definitely feels like a call from ‘90s.
Or maybe we’re missing something. The whole idea of SD-WAN was to take over the core WAN routing in your network. Tom Hollingsworth has an interesting story in one of his blog posts:
We turned off OSPF and now have a /16 route for the internal network and a default route to the Internet where a lot of our workloads were moved into the cloud.
We’ve seen similar claims in the past. Remember MPLS/VPN and the nirvana you’ll reach once the Service Provider takes over your core routing? Yeah, it didn’t turn out exactly that way, did it?
Anyway, assuming we give a proprietary black-box system with no publicly-accessible documentation total control of our core network should we care about the routing protocols at the edge of the black haze?
If you’re deploying a single SD-WAN appliance per site, and it controls all uplinks, the answer is obviously NO. There’s a single exit point (SD-WAN appliance) and it’s either working or not.
If you’re deploying redundant SD-WAN appliances on a site that needs higher availability and these appliances work as a cluster, control all the uplinks, and use layer-2 tricks (usually VRRP) to give the impression of a single next-hop router to the site network, the answer is still NO. We’re doing the same thing with static default routes pointing to a shared IP address of a firewall cluster today. I’m not saying it’s a good idea, but some people would probably claim it’s the best practice.
Unfortunately some SD-WAN vendors believe they can use the same approach no matter what. Their appliances try to avoid routing protocols like black plague and use dirty tricks like VRRP between SD-WAN appliance and on-site router combined with static default routing to get the appearance of primary/backup behavior. How many times do we have to repeat the same mistakes?
However, if you do run routing protocol between SD-WAN appliances and other routers on your site for whatever reason (failover, alternate uplinks…), then I guess we deserve a decent implementation of the said routing protocols. It’s not like you couldn’t get a routing protocol suite these days, either as open-source project or a commercial product.
Or maybe the SD-WAN vendors don’t have to care because they’re never asked about such mundane details. There must be plenty of customers believing in magic powers of PowerPoint out there…
2018-06-18: Added link to Viptela's online documentation
Want to know more?
I covered the basics of SD-WAN in Choose the Optimal VPN Service and SDN Use Cases webinars, and we'll do SDWAN 101 webinar in September 2018. All three webinars are included in the ipSpace.net Webinar Subscription.