Sysadmins Shouldn’t Be Involved with Routing

I had a great chat with Enno Rey the morning before Troopers 2016 started in which he he made an interesting remark:

I disagree with your idea of running BGP on servers because I think sysadmins shouldn’t be involved with routing.

As (almost) always, it turned out that we were still in violent agreement ;)

We quickly agreed that running OSPF on servers is a patently bad idea, and expecting hosts to act as peers in network path calculations is another one.

Then there’s the gray area of hypervisor connectivity. Like it or not, hypervisors are really the new network edge, and you can link them with the physical networks in one of three ways:

  • You pretend they aren’t there, and give them simple IP connectivity which they can use to build whatever-over-IP tunnels (aka overlay virtual networking);
  • You allow them to dump whatever **** they have into the network and deal with the consequences (aka VLAN-based virtual networking);
  • You accept them as the new network edge and start treating them as PE-routers (the Project Calico way). Unfortunately, this approach works well only when you can enforce the residential ISP mentality in your service offering (Here’s your IP address, take it and stop complaining. And no, you cannot move it), otherwise you’re quickly stuck in a quagmire of host routes or end-to-end paths (VLANs, tunnels or LSPs).

However, coming back to the original question: Should we run a routing protocol on a regular (application) server? As I said, I don’t think we should… and yet I’m advocating running BGP on those same servers. I must be confused, right?

Not really. BGP (at least in that particular use case) is not a routing protocol (as in figuring the best end-to-end path tool), but a service advertisement protocol – a host is advertising its service (encoded in an IP address because some people still can’t spell DNS) and receiving default route (or not even that) in return. While doing that you’re also solving the host multihoming problem (more about that in another blog post).

Assuming we can’t fix the application code, so we’re stuck with “IP address = service” paradigm, we could use a variety of tools to get that job done. BGP just happens to be a convenient one:

  • It fulfills the requirements (although you’re admittedly using a cannon to kill a fly, but then virtual cannons are cheap);
  • It’s available in many ToR switches (excluding greedy vendors who want to slap SP tax on everyone using BGP).
  • It’s available in every Linux distribution (not sure about Windows Server, comments most welcome).

Finally, if you want to know how the whole thing works, watch the Leaf-and-Spine Fabric Designs webinar; guest star Dinesh Dutt covered numerous implementation details in his part of the session.

Latest blog posts in BGP in Data Center Fabrics series


  1. Running BGP on an application server has a lot of advantages and it's very easy to do. For example, you could achieve [legacy] application mobility by just configuring RFC1918 addresses between your ToR and your servers, assigning an application IP to a loopback interface and announcing it. You want to move the application to another rack/pod/dc? Just start announcing that IP address from another host. And the application guys don't even have to know that BGP is involved in there. That's what puppet classes are for. Just ask them to tag their host with the required class that will take care of the loopback interface and the BGP configuration and off you go ;)

    And don't forget other use cases like anycast services or routing to a hypervisor. Think how simple your network might turn out as you could get rid of overlay protocols and just use BGP to ensure end to end connectivity and app mobility.
    1. There are a lot of great things like this that can be done, that are easy for people with some networking experience but can be a disaster if this falls in wrong hands. Although the title claims about sysadmins, I guess it should say "people with no networking knowledge". I'm willing to bet that this setup will become in a host becoming a transit AS , because someone will try to peer against different TOR with different ASes and the host become a transit AS :).
    2. Well, of course that would happen. That's why you'd be running BGP, not OSPF, so you can filter on AS-path on ToR switches and only accept /32 (or /128) prefixes with empty AS-path.
    3. That's the point of your article, you know it, i know it, and everybody that have used bgp before knows it, but devs , sysadmins that haven't used bgp before doesn't know this. (Pd. This was a real case)
    4. There are simple safety mechanisms you can implement as Ivan mentioned. Reserve a /24 for the data enter and accept only /32 from that prefix. However, the point here is that you should provide a way for developers to consume the network as a service. They want to announce their service, they couldn't care less if that's done by BGP or voodoo. So build an abstraction they can consume and keep control of the how for yourself.
  2. Windows Server 2012 R2 supports BGP and it's easy to configure via powershell as BGP Route Health Injection via loopback address.
    1. Has anyone actually had the stones to use it?
    2. In lab only :) But seems a good-enough BGP implementation...
    3. I would say Hyper-V in general is a viable server/network virtualization solution. As Ivan pointed out several years ago ( Hyper-V Server 2012 is when Microsoft really started to tackle network virtualization and scalability issues. 2012 R2 added some much needed BGP features accessible via powershell..I believe OSPF support has also been removed, as it most likely was never really used. Server 2016 (nano) added even more nice features ( namely access to a programmable network controller/switch, VXLAN support and software LB. I've been running Hyper-V on production since 2008 release without any issues, I am really looking forward to the Hyper-V 2016 and beyond. I would still not want to sprawl any VM's across to another DC, same rules applies, keep the failure domains small, here is a nice write-up from Daniel's blog -
  3. Running BGP on servers is feasible, but challenging at big scale. Moving from /24 subnets to /32 host routes is difficult for recent ToR switches with small FIBs, unless aggregating on a leaf pair (which kills awesome value add services such as distant Anycast RHI or Mobility DHCP+RHI). Fortunately, moving from ARP on a /24 connected subnet to hundreds of BGP peerings can be solved via intermediary Route Servers (IXP-like) specially if we are talking about virtualized environments, but /32 host routing is definitely an scalability limitation in large deployments.

    The real solution is always the same: solve the problems with simple networking at the application layer. Maybe leaving Server Route Injection for a small number of critical clusters that really benefit from Anycasting and Multihoming.

    P.S. The server/networking silos dilemma can be solved with some work on standardization of BGP server configurations. BIRD is well suited for that. The server team ends up just configuring host /32 addresses and gateways (BGP neighbors) in an standalone text file.
    1. What do you mean at scale? Any ToR can do at least a few tens of thousands of host routes and recent Broadcom asics can do several hundreds of thousands.
    2. Depends on the use case. What if we are taking about first hops belonging to a national BGP backbone with no aggregation?

      The FIB relation is huge, but not a problem if we just limit server route injection to the endpoints that can really need it. Front-end web servers behind a loadbalancer just don't need more than current simple routing. That gives the extra FIB capacity to a limited amount of back-end clusters that profit from multihoming and anycast RHI.
    3. We talk about interesting stuff and the next moment someone says "let's solve IP address mobility across data centers with this" ;))

      That's not how things are done, and you're _NOT_ supposed to leak full Internet table into your data center or your host routes into the Internet.

      Anyway, if you have more than ~50K host routes (which means you're BIG) _and_ you plan to propagate them further than a single DC (which means you have the Enterprise Craplication mindset), you're doing something fundamentally wrong.
    4. In that use case, the DC first L3 hop were high-end PEs of a national BGP/MPLS backbone, so even in case of intraDC multihoming or anycasting between infrastructures (L2 failure domains), the /32 routes were visible nationally. Not a very cost-effective architecture, but with some advantages...

      But my point was that host routing could be feasible for just a minority of endpoints that justify it. Just not for every server. So maybe a few hundreds or thousands, and little FIB scalability impact. It's not an all/nothing decission. Could be as small as 2/3 nodes of a backend cluster anycasting... Some dozens, etc...

      Regarding the original topic, edge BGP for service announcement on a host can be so standarized (also with as much security and protection as wanted at the network side) that the server team just doesn't have to touch the configurations further than configuring the loopback and interface addresses and the gateways, as they do today.
  4. I'll just leave this here...
    1. Petr's draft doesn't touch on edge BGP/IP mobility at all, but it's still an excellent read.
  5. When using BGP on a VM for mobility, what is the best way to establish a peer relationship with a new TOR switch after a live migration? The VM won't inherently know the peer address or the ASN. Cumulus quagga has bgp unnumbered and remote-as external, but what about other vendors? Is live vm mobility with bgp on servers only possible with cumulus? Maybe derive peer address and ASN from LLDP or CDP?
Add comment