Sysadmins Shouldn’t Be Involved with Routing
I had a great chat with Enno Rey the morning before Troopers 2016 started in which he he made an interesting remark:
I disagree with your idea of running BGP on servers because I think sysadmins shouldn’t be involved with routing.
As (almost) always, it turned out that we were still in violent agreement ;)
We quickly agreed that running OSPF on servers is a patently bad idea, and expecting hosts to act as peers in network path calculations is another one.
Then there’s the gray area of hypervisor connectivity. Like it or not, hypervisors are really the new network edge, and you can link them with the physical networks in one of three ways:
- You pretend they aren’t there, and give them simple IP connectivity which they can use to build whatever-over-IP tunnels (aka overlay virtual networking);
- You allow them to dump whatever **** they have into the network and deal with the consequences (aka VLAN-based virtual networking);
- You accept them as the new network edge and start treating them as PE-routers (the Project Calico way). Unfortunately, this approach works well only when you can enforce the residential ISP mentality in your service offering (Here’s your IP address, take it and stop complaining. And no, you cannot move it), otherwise you’re quickly stuck in a quagmire of host routes or end-to-end paths (VLANs, tunnels or LSPs).
However, coming back to the original question: Should we run a routing protocol on a regular (application) server? As I said, I don’t think we should… and yet I’m advocating running BGP on those same servers. I must be confused, right?
Not really. BGP (at least in that particular use case) is not a routing protocol (as in figuring the best end-to-end path tool), but a service advertisement protocol – a host is advertising its service (encoded in an IP address because some people still can’t spell DNS) and receiving default route (or not even that) in return. While doing that you’re also solving the host multihoming problem (more about that in another blog post).
Assuming we can’t fix the application code, so we’re stuck with “IP address = service” paradigm, we could use a variety of tools to get that job done. BGP just happens to be a convenient one:
- It fulfills the requirements (although you’re admittedly using a cannon to kill a fly, but then virtual cannons are cheap);
- It’s available in many ToR switches (excluding greedy vendors who want to slap SP tax on everyone using BGP).
- It’s available in every Linux distribution (not sure about Windows Server, comments most welcome).
Finally, if you want to know how the whole thing works, watch the Leaf-and-Spine Fabric Designs webinar; guest star Dinesh Dutt covered numerous implementation details in his part of the session.
And don't forget other use cases like anycast services or routing to a hypervisor. Think how simple your network might turn out as you could get rid of overlay protocols and just use BGP to ensure end to end connectivity and app mobility.
The real solution is always the same: solve the problems with simple networking at the application layer. Maybe leaving Server Route Injection for a small number of critical clusters that really benefit from Anycasting and Multihoming.
P.S. The server/networking silos dilemma can be solved with some work on standardization of BGP server configurations. BIRD is well suited for that. The server team ends up just configuring host /32 addresses and gateways (BGP neighbors) in an standalone text file.
The FIB relation is huge, but not a problem if we just limit server route injection to the endpoints that can really need it. Front-end web servers behind a loadbalancer just don't need more than current simple routing. That gives the extra FIB capacity to a limited amount of back-end clusters that profit from multihoming and anycast RHI.
That's not how things are done, and you're _NOT_ supposed to leak full Internet table into your data center or your host routes into the Internet.
Anyway, if you have more than ~50K host routes (which means you're BIG) _and_ you plan to propagate them further than a single DC (which means you have the Enterprise Craplication mindset), you're doing something fundamentally wrong.
But my point was that host routing could be feasible for just a minority of endpoints that justify it. Just not for every server. So maybe a few hundreds or thousands, and little FIB scalability impact. It's not an all/nothing decission. Could be as small as 2/3 nodes of a backend cluster anycasting... Some dozens, etc...
Regarding the original topic, edge BGP for service announcement on a host can be so standarized (also with as much security and protection as wanted at the network side) that the server team just doesn't have to touch the configurations further than configuring the loopback and interface addresses and the gateways, as they do today.