One of my readers sent me an interesting challenge:
We have two MPLS providers sending us default routes and it seems like whenever we have problem with SP1 our failover is not happening properly and actually we have to go in manually and influence our traffic to forward via another path.
Welcome to the wondrous world of byzantine routing failures ;)
Let’s recap what the problem is:
- The customer uses two service providers (ISPs, MPLS/VPN providers – doesn’t really matter);
- Both SPs announce aggregated prefixes (example: default route);
- The aggregated prefix is not revoked when the SP has a bad-hair-day.
It might be that the primary provider is clueless enough to generate the default route on the PE-router without considering the state of the rest of the network (because they never read this blog post), in which case the problem cannot be solved with BGP (the “shared fate” property of BGP is broken due to localized default route origination).
In any case, there are two commonly used solutions when one cannot trust the routing provided by the SP (no surprise there):
- Running your own routing protocol between your sites across an overlay network (example: DMVPN, potentially without the encryption part if you’re running DMVPN across an MPLS/VPN infrastructure, or something way more complex).
- Locally generated default route based on IP SLA measurements (for example, pinging Google DNS or one or more root nameservers).
You’ll find plenty of information on the overlay approach in the DMVPN webinars, and I covered some aspects of multi-provider connectivity in the Data Center Design Case Studies book. I also blogged extensively about individual components of the second solution, but don’t have a comprehensive case study addressing it yet.
You might want to augment the DMVPN solution with PFRv3 (the combo known under the marketing name Intelligent WAN), and orchestrate it with Gluware Orchestration Engine, or consider one of the startups working in this space: Border6 if you need a BGP-only solution or Viptela if you need end-to-end private WAN load balanced across both SPs.
Last but definitely not least, I’m always available for short online consulting sessions.
And finally – don’t forget to read Radia Perlman’s take on routing with byzantine robustness