Know Thy Environment Before Redesigning It

A while ago I had an interesting consulting engagement: a multinational organization wanted to migrate off global Carrier Ethernet VPN (with routers at the edges) to MPLS/VPN.

While that sounds like the right thing to do (after all, L3 must be better than L2, right?) in that particular case they wanted to combine the provider VPN with Internet-based IPsec VPN… and doing that in parallel with MPLS/VPN tends to become an interesting exercise in “how convoluted can I make my design before I give up and migrate to BGP”.

As we analyzed their options and potential designs, I became convinced that it makes no sense to give the keys to the kingdom (core routing protocols) to a third party… but still wondered whether they were dealing with a particularly bad service provider (in which case switching the provider would make sense).

They were mentioning frequent outages, so I tried to put that claim in perspective. We eventually figured out that they would have a link failure (detected as routing protocol adjacency loss) every week… in a network with approximately 100 sites. I know that my conclusions would probably make Rachel Traylor extremely upset, but using a rule-of-thumb I converted that into “a link fails on average once in 100 weeks… or once in two years”. Not exactly a stellar performance, but not catastrophic either.

Figuring out that links don’t fail that often was interesting, but they could still be dealing with gray failures, so I asked them to deploy BFD and track BFD errors over a period of few weeks. End result: tons of BFD errors, so maybe the Service Provider was the root cause of their problems.

Another quick check: BFD timers. They set really aggressive timers, and it seemed that BFD packets got stuck behind large bursts of user traffic in output queues on Service Provider switches. After increasing the BFD timer to 300 msec BFD errors disappeared almost completely, proving that (A) the links were pretty reliable but (B) also experiencing periods of significant congestion.

In the end, the customer made no significant changes apart from minor cleanups of their core routing configuration… but at least they understood the network behavior much better than they did in the past, and had data to back up their decisions.

In case you want to know more

I did a webinar describing various VPN services from architecture- and technology perspectives and if you need a second opinion about your chosen design/service we might be able to help you.

4 comments:

  1. What business needs did they have (convergence etc.)?
    Replies
    1. Here's something to consider: assuming one outage every two years (per site), what kind of application would you have to run to care about the difference between 50 msec convergence and 5 sec convergence?
    2. Mission Critical Voice is the app. See Public Safety requirements. 1.5sec total re-convergence based on 300ms BFD timer (3 packet lost) + some time na IGP reconvergence.
  2. So according to your rhetorical question there wasn't a business need for subsecond convergence. Either way for subsecond convergence you need more than just fast failure detection. Btw you can't detect gray failures with BFD. It's only checking forwarding (data plane).
Add comment
Sidebar