Know Thy Environment Before Redesigning It

Monday, June 3, 2019 07:21 +0200

Know Thy Environment Before Redesigning It

A while ago, I had an interesting consulting engagement: a multinational organization wanted to migrate off global Carrier Ethernet VPN (with routers at the edges) to MPLS/VPN.

While that sounds like the right thing to do (after all, L3 must be better than L2, right?), in that particular case, they wanted to combine the provider VPN with an Internet-based IPsec VPN. Doing that in parallel with MPLS/VPN tends to become an interesting exercise in “how convoluted can I make my design before I give up and migrate to BGP.”

As we analyzed their options and potential designs, I became convinced that it makes no sense to give the keys to the kingdom (core routing protocols) to a third party… but still wondered whether they were dealing with a horrible service provider (in which case switching the provider would make sense).

They mentioned frequent outages, so I tried to put that claim in perspective. We eventually figured out they would have a link failure (detected as routing protocol adjacency loss) every week… in a network with approximately 100 sites. I know that my conclusions would probably make Rachel Traylor extremely upset, but using a rule of thumb, I converted that into “a link fails on average once in 100 weeks… or once in two years”. It was not exactly a stellar performance, but it was not catastrophic either.

It was interesting to discover that links don’t fail very often, but they could still be dealing with gray failures. So, I asked them to deploy BFD and track BFD errors over several weeks. The result was tons of BFD errors, so maybe the Service Provider was the root cause of their problems.

Another quick check: BFD timers. They set aggressive timers, and it seemed that BFD packets got stuck behind large bursts of user traffic in output queues on Service Provider switches. After increasing the BFD timer to 300 msec, BFD errors disappeared almost completely, proving that (A) the links were pretty reliable but (B) also experiencing periods of significant congestion.

Ultimately, the customer made no significant changes apart from minor cleanups of their core routing configuration. However, they understood the network behavior much better than before and had data to back up their decisions.

More Information

I did a webinar describing various VPN services from architecture- and technology perspectives.

design
WAN

4 comments:

Anonymous 03 June 2019 12:05

What business needs did they have (convergence etc.)?

Ivan Pepelnjak 03 June 2019 15:16

Here's something to consider: assuming one outage every two years (per site), what kind of application would you have to run to care about the difference between 50 msec convergence and 5 sec convergence?

Bogdan Golab 03 June 2019 21:02

Mission Critical Voice is the app. See Public Safety requirements. 1.5sec total re-convergence based on 300ms BFD timer (3 packet lost) + some time na IGP reconvergence.

Anonymous 03 June 2019 18:44

So according to your rhetorical question there wasn't a business need for subsecond convergence. Either way for subsecond convergence you need more than just fast failure detection. Btw you can't detect gray failures with BFD. It's only checking forwarding (data plane).

More Information

Recent posts in the same categories

design

WAN

4 comments: