He’s obviously right, but I wasn’t talking about interconnected domains, but failure domains (yeah, I know, you could argue they are the same, but do read on).
BGP routers are clearly interconnected. They are also loosely coupled – a bogus BGP update generated by a BGP router (or any other BGP speaker) can bring down other BGP routers.
Sometimes an unexpected update triggers a loss of BGP session (as was the case with long AS paths a while ago), sometimes a weird transitive attribute that is passed transparently by some implementations of BGP causes a crash in other implementations (and is thus able to trigger a crash in a BGP speaker several hops away). Hijacking attacks are also nothing new, so it might seem like BGP fares no better than the new centralized controller architectures.
However, as I explain in more details in the SDN Architectures and Deployment Considerations webinar, the crucial questions to consider are (Colin made approximately the same points in follow-up tweets, read the whole thread for his view).
- What happens when a control plane (or controller) fails?
- What is the size of the failure domain?
- What can be done to protect the controller/control plane?
On all three counts BGP performs substantially better than architectures with centralized control plane heavily promoted by hard-core SDN aficionados.
What happens when a BGP router fails? Best case, a single BGP peering session is lost. Worst case, you lose a single router.
In theory, a BGP router might propagate a poisoned update before it falls over; based on anecdata it’s as likely as a round square (but of course do prove me wrong!).
Losing a major peering session is not exactly fun (and sometimes the ripples can be felt throughout the Internet), but it might be a bit better than losing a whole controller-managed network.
In this respect, a controller-based network is similar to an OSPF routing domain: a faulty LSA is propagated across the whole area (because the flooding algorithm is not tied in any way to the SPF algorithm) and might trigger a problem in all routers participating in that OSPF area. That’s one of the reasons I’d never run OSPF with untrusted third parties (like ISP customers).
What is the size of a failure domain? When a BGP router receives an update with an attribute that causes it to hiccup (drop BGP session, crash, or do something else along these same lines), the update is not propagated beyond that router. The worst-case failure domain in a BGP network (or blast radius, as Jeremy Schulman would call it) is thus a single device.
Remember the sequence of events that happen within a BGP router: receive update, insert it in RIB, select the best paths, and finally send best paths. If a router encounters an error anywhere along this sequence, the update that triggered the error is not propagated to the BGP neighbors.
On the other hand, if you manage to hit a bug in OpenFlow controller that causes the controller to crash after receiving a crafted packet, you’ll easily bring down all controllers in the cluster.
What can be done to protect the controller (or control plane)? It’s pretty easy to protect a BGP router – there are tons of security-related tools and knobs available in BGP (for more details, read BGP Operations and Security RFC) – and the receive-process-send mechanism explained in the previous section easily protects the network core from potential exploits received by edge routers.
A controller-based network is like a single device. You need a single exploit and it’s game over.