Do Not EVER Run OSPF or IS-IS With Your Internet Customers

Someone started an interesting discussion on the NANOG mailing list. He inherited a network that extended its internal OSPF to its multihomed customers and wondered whether he should leave the network, change OSPF to IS-IS, or deploy BGP. Here are a few thoughts from my reply.

Please remember that we were discussing running global OSPF with the customer routers. Running OSPF in a VRF is a different story, as the customer cannot impact another customer’s routing (they can only burn your CPU cycles).

Do not ever run an SPF routing protocol (OSPF or IS-IS) with your customer. They can insert anything they want into it, be it due to configuration mistakes, malicious intent, or third-party hijacking, and your whole network (or at least the other customers) will be affected.

Just to give you a few examples:

  • They could hijack the host route to your DNS server and spoof every other customer that uses your DNS (I haven’t seen this one yet, but it’s feasible).
  • They could hijack the host route to your POP3 server and collect the usernames and passwords of your residential users (I’ve seen this in a production network, but the attack vector was not OSPF but another routing protocol).
  • Company A could hijack the host route to Company B’s web server.
  • They could insert a better default route than you do, and at least some of your routers will listen to them (I’ve seen this done with OSPF).
  • If they ever make a total mess and start flapping their LSAs, your whole network will be affected, and all your routers will burn the CPU cycles running the SPF algorithm.

If you absolutely insist on not using BGP (but then BGP is the only currently available routing protocol designed to handle routing in scenarios where the two parties don’t necessarily trust each other), use RIP. It’s safer than OSPF; at least you can filter the incoming updates.

I’ve also seen a Service Provider running RIP with their customer … but they were not using any filters when redistributing RIP routes into their IGP.

Numerous other respondents shared my feelings, and Steve Bertrand provided the best summary: “If in the same sentence you read ‘my network’ and ‘customer network,’ use BGP.”

6 comments:

  1. Another solution is to run a brand new OSPF/ISIS process and redistribute it into the legacy IGP. The customers won't see the change (except a brief connection loss) and you will be able to filter the updates in order to protect the "inner" network.
  2. Many of the small ISP networks I've seen are either running an IGP with customers, or redistributing customer static routes into their IGP (instead of BGP). Then they redistribute the IGP into BGP (usually with no filters) and hope for the best. I've had numerous conversations trying to explain the nightmare waiting to happen, but as far as I know none of them has ever changed the practice.
  3. Considering the amount of duct tape used to patch together the Internet in various odd places, it's a wonder we don't get more BGP-related incidents.

    But, as you've said, some people try really hard not to learn.
  4. This is absolutely better than the "original" idea, but still has a few drawbacks. Unless you deploy OSPF process per customer, other customers in the same OSPF process could be impacted (things could get a bit better if you run each customer in a separate area).

    Additionally, you have to use "distribute-list in" in customer OSPF processes on edge routers to prevent invalid OSPF routes from entering the IP routing table.
  5. Best option (ISIS) is of course to flood all LSPs with all bits set in sequence number and invalid information so that you are closest in terms of metric to everyone, you'd break entire network.
    Reloading boxes wouldn't help a bit, as if some box is up, it'll reflood the broken data.

    Few ways to recover
    1) reload all boxes at same time
    2) wait for LSP to time out, many networks have LSP lifetime maxed to 18h
    3) change net address of each box
  6. My own horror story. At one point we had OSPF happily in the core and were asked (for the 1st time) to deliver an active/active dual link to a customer site in quick order. We took the easy way out and simply extended the core ospf out to the CPE. Unfortunately the method stuck for a limited number of such connections before a more robust solution was used. Wind forward 2 years and we start seeing horrendous churn in the core OSPF. No route shows up as being stable for more than 30 mins. After a considerable amount of debugging mostly in the small hours when other changes were minimal we identified that all OSPF routes were being flushed milliseconds before they should have be refreshed. The flush instantly triggered a refresh but every route was disappearing for a few seconds every 30 mins.

    The flush was coming from one of the CPEs deployed above...who's clock was running almost exactly twice as fast as normal. So it was seeing the routes hit the 60min (no refresh) and flush. A few cyles slower and we'd probably never have had a problem. It took me several minutes staring at the errant CPE cli (a 1720 I recall!) when I found it with scripted "sh clock" checking the router time over a fixed period. Disabled OSPF on that CPE and......the whole network returned to graceful stability.
Add comment
Sidebar