Do Not EVER Run OSPF or IS-IS With Your Internet Customers
Someone started an interesting discussion on the NANOG mailing list. He inherited a network that extended its internal OSPF to its multihomed customers and wondered whether he should leave the network, change OSPF to IS-IS, or deploy BGP. Here are a few thoughts from my reply.
Do not ever run an SPF routing protocol (OSPF or IS-IS) with your customer. They can insert anything they want into it, be it due to configuration mistakes, malicious intent, or third-party hijacking, and your whole network (or at least the other customers) will be affected.
Just to give you a few examples:
- They could hijack the host route to your DNS server and spoof every other customer that uses your DNS (I haven’t seen this one yet, but it’s feasible).
- They could hijack the host route to your POP3 server and collect the usernames and passwords of your residential users (I’ve seen this in a production network, but the attack vector was not OSPF but another routing protocol).
- Company A could hijack the host route to Company B’s web server.
- They could insert a better default route than you do, and at least some of your routers will listen to them (I’ve seen this done with OSPF).
- If they ever make a total mess and start flapping their LSAs, your whole network will be affected, and all your routers will burn the CPU cycles running the SPF algorithm.
If you absolutely insist on not using BGP (but then BGP is the only currently available routing protocol designed to handle routing in scenarios where the two parties don’t necessarily trust each other), use RIP. It’s safer than OSPF; at least you can filter the incoming updates.
Numerous other respondents shared my feelings, and Steve Bertrand provided the best summary: “If in the same sentence you read ‘my network’ and ‘customer network,’ use BGP.”
But, as you've said, some people try really hard not to learn.
Additionally, you have to use "distribute-list in" in customer OSPF processes on edge routers to prevent invalid OSPF routes from entering the IP routing table.
Reloading boxes wouldn't help a bit, as if some box is up, it'll reflood the broken data.
Few ways to recover
1) reload all boxes at same time
2) wait for LSP to time out, many networks have LSP lifetime maxed to 18h
3) change net address of each box
The flush was coming from one of the CPEs deployed above...who's clock was running almost exactly twice as fast as normal. So it was seeing the routes hit the 60min (no refresh) and flush. A few cyles slower and we'd probably never have had a problem. It took me several minutes staring at the errant CPE cli (a 1720 I recall!) when I found it with scripted "sh clock" checking the router time over a fixed period. Disabled OSPF on that CPE and......the whole network returned to graceful stability.