Running BGP on Servers

Mr. A. Anonymous left this comment on my BGP in the data centers blog post:

BGP is starting to penetrate into servers as well. What are your thoughts on having BGP running from the servers themselves?

Finally some people got it. Also, welcome back to the '90s (see also RFC 1925 section 2.11).

Running a routing protocol on servers (or IBM mainframes) is nothing new – we’ve been doing that 30 years ago, using either RIP or OSPFv2 – and it’s one of the best ways to achieve path redundancy.

I’ve also heard of a network design that was one link failure away from IBM mainframe becoming the core router. If you run routing protocols on servers make sure they cannot become transit nodes.

Later it became unfashionable to have any communication between the server silo and the network silo, resulting in the unhealthy mess we have today where everyone expects the other team to solve the problem. Unfortunately, the brown substance tends to flow down the stack.

However, even though the mainstream best practices focused on link bonding, MLAG and similar kludges, I know people who were running BGP on their servers (with good results) for years if not decades.

The old ideas resurfaced in the mainstream networking as means of connecting the virtual (overlay) world with the physical world, first with routing protocol support on VMware NSX Edge Services Router (ESR), later with BGP support in Hyper-V gateways… and I was really glad VMware decided to implement BGP on ESR because BGP establishes a clean separation between two administrative domains (virtual and physical).

Juniper Contrail and Nuage VSP use BGP to connect to the outside world, but not from end-hosts, so they’re out of the scope of this article.

Lately, I’ve seen very smart full-stack engineers (read: sysadmins who understand networking) use FRR to run BGP across unnumbered links between servers and ToR switches totally simplifying both BGP configurations as well as deployment procedures (not to mention turning the whole fabric into pure L3 fabric with no VLANs on ToR switches).

Want to know more? Dinesh Dutt described the idea in the Leaf-and-Spine Fabric Architectures webinar.

Latest blog posts in BGP in Data Center Fabrics series

35 comments:

  1. As with many design decisions, it's not automatically bad all the time. It's about the overall design design goals, and *you're ability to validate the design.*
  2. From my experience (being one of the principle network engineers at six data centers), unless you control the server, and understand the software that runs BGP, OSPF or another routing protocol, DO NOT run routing protocols directly on end hosts. My prior counterparts thought running OSPF on Mainframes was a good idea. Then we had a routing blackhole due to misconfiguration on the server. Twice! The main issue was the Mainframe admins lack of networking/OSPF knowledge. In reality, there was no requirement that couldn't be met with a simple secondary route. We didn't even need anything special like vPC or MLAG. In short, stay away unless you control the server and know what you are doing.
    Replies
    1. I think you're taking a specific failure and turning into too general of a rule to fix it.

      OSPF has essentially no routing policy control (within an area, anyway, only on area or AS boundaries). Which means it's hard to stop an end node from becoming a transit link, and increases danger of blackholing failures such as what you apparently experienced. You also end up with the server holding a full set of OSPF routes in its RIB/FIB, which is....inelegant.

      We (started with OSPF, but learned and switched to) BGP which does provide the routing policy to prevent exactly these sorts of failures. Only pass a default down to the server, restrict what routes the network will accept from the server. Solves both problems, and works phenomenally well.

      There are all sorts of other capabilities that come along when you use a good setup for running routing protocols on servers that are beyond this scope.

      So I would caution against saying you had failures using OSPF, therefore you shouldn't run *any* routing protocols on servers. Switch to BGP and be happy.
    2. Salman, as JMcA pointed out (and I wrote in one of the blog posts I linked to), running OSPF on nodes not under common administration is not a good idea. BGP (or heavily filtered RIP) is the only routing protocol I would run between servers and the network.
  3. Ive never really understood why sysadmins don't run OSPF/RIP on their servers as stubs and source their traffic from a static IP on an installed loopback interface as a /32 (could run the OSPF/RIP interfaces with DHCP that connect to the outside environment.

    DC disaster recovery scenarios would be much more simple, move the VM, wait for OSPF/RIP to reconverge, and youre done, you wouldn't need L2 extensions to get it to work, and you could implement it with far less expensive equipment (would also work cross vendor with relative ease too), all you'd need would be DHCP, RIP/OSPF/ISIS

    obviously you'd have the challenge of having a large adjacency design as you'd have to allow anything in the DHCP range to form an adjacency, and you'd have to be careful that you never ended up with a host as a transit host, but still I think its an interesting design.

    Im sure if ive got this appalling wrong Ivan is going to shoot my argument full of holes, or some windows person is going to say sourcing the IP address from the loopback will cause windows to disintegrate...
    Replies
    1. No need to punch holes in your argument - we're perfectly aligned, apart from a "minor" detail - don't EVER use a link-state routing protocol between devices that are not under common administration. In this case, use BGP.
    2. What sucks about BGP in this regard is that the protocols (or maybe it's not the protocol, just every single implementation I've seen) requires every single neighbour to be explicitly configured. That fits badly with "cloudy" environments where a customer might all of a sudden decide to spin up a dozen or two new VMs from which he wants to immediately begin advertising a set of service addresses. Having to wait for the network admin to configure BGP sessions to each of those new VMs is a non-starter.

      OSPF is also a non-starter due to the inability to filter out accidental/malicious advertisements, so we're using RIP, which works well enough. (Of course, we don't have a RIP *topology*, whatever the router picks up from RIP just gets exported to a more sensible routing protocol.)

      I've always wondered why it's not possible to have an unspecified peer address in BGP (0.0.0.0/::). That would have solved the problem - I'd just ask the customer to establish BGP sessions to the default router address on their server LAN. Do you know, Ivan?
    3. What you are talking about is BGP Dynamic Peers or Cumulus's BGP unnumbered, where you can configure "neighbor eth0 remote-as external" and the peer just establishes.

      Disclaimer: I work for Cumulus
    4. Dynamic BGP peers are a generic solution (work even with a single VLAN/subnet per ToR switch), not sure how many ToR switches have that feature.

      BGP Unnumbered is a cool solution but requires a compatible BGP daemon on the other end (yeah, I know it's all based on RFCs and you can make it work with Junos and NX-OS, but it gets kludgy) and L3 interfaces toward servers instead of a single VLAN-per-ToR.

      So, how about adding dynamic BGP neighbors to Quagga? ;))
    5. We do have it:

      bgp listen range address/mask peer-group

      The documentation is missing, I'll get it added.

      Obligatory Disclaimer: I work for Cumulus :)
  4. If the hosts are stubs, a default gateway is all that is really needed. A routing protocol is required for (alternate) path selection.

    Routing on the host went away some time ago because we could fix the dual-attachment/loop avoidance/ RSTP problem with your favorite flavor of MLAG and decrease convergence time, SPF hickups, etc by having less routers in your area / domain and keeping the hosts as "leafs" in the design.

    Not saying you could not have a rock steady DC with 510 routers for example (500 hosts + 10 routers / VRFs) but one thing is for sure, you are increasing your probability towards stability by lowering the complexity and just managing the 10 routers / VRFs.

    One of the main reason for the "BGP on the host" discussion lately is for egress path selection between DCs. PBR, TE, LISP, etc all catches the traffic once it is out of the host and rely on some form of "marker" to apply the proper forwarding.

    Now if you have overlay networks originating directly from the host, "maybe" you want to provide the information to the host so that the end to end connection is established over the appropriate egress point in the DC and over the WAN link of choice for this tenant or application.

    This makes "some" sense if you own both the DCs and control / own the WAN.

    Past history has proven however, a number of times, that implementing more granlar controls is not for everyone... actually few ever do (QoS has been around for ever and I'm still surprised how little of it is actally implemented).

    BTW BGP has been in DLR & ESG from the begining in NSX-v.

    My 2 cents :)



    Replies
    1. I have a perfect counter-argument, but it wouldn't fit into a comment ;)) Just kidding, time to write another blog post.

      Thanks for the comment!
      Ivan
    2. It will be a pleasure to read you as always
    3. You're assuming mLAG is a good solution. If you don't need IP mobility, you can simply advertise a /24 per rack (or whatever appropriately sized aggregate) and a default route to the servers. This makes maintenance much easier and no risk of mLAG failover failures bringing down all attached hosts.
    4. I am not assuming anything nor am I saying that implementing or not MLAG is a good thing. Neither am I saying that running a routing BGP on the host for picking an egress path is a good thing. Just a recap of what we did in the last 15+ years.

      sometimes we need to stop drinking our own kool-aid ;)

      BGP as a mean to solve world hunger has been a recurring topic over the years because we are networking people and we see the world thru our very limited lens. For instance it is funny how Peer to Peer networking, which handles millions of endpoints with attributes (files, songs, etc) has been designed without the IETF and widely implemented without our help...
  5. To get to the presentation go here:
    https://cumulusnetworks.com/webinars/

    Click on 'On Demand Webinars'
    Click on the 'Watch Now' Under the Demystifying Networking heading.
    Fill in some info and click on 'Watch Now'

    Full Disclosure: I work for Cumulus and am going to see if we can make this easier to find.
    Replies
    1. Added the link Cumulus marketing sent me. Thank you!
  6. Thanks for writing this Ivan. I've had countless conversations with customers about routing on the host, using the Linux package, Docker containers and vRouters. The fact this is revolutionary, I believe, is a result of vendors selling L3 licenses (which increases the cost of the datacenter) and the fact they have no reason to encourage good design, since they can't monetize that server endpoint. Cisco's CSR1000v has taken a step in the right direction but it's a resource hog and too expensive, so customers avoid it. I hope more vendors see the value of a L3 only datacenter and build products to enable those customers.

    Disclaimer: I work at Cumulus
    Replies
    1. > The fact this is revolutionary, I believe, is a result of vendors selling L3 licenses

      Exactly. We were trying OSPF on servers for L3 failover instead of usual L2 but idea died for several reasons, one of them was that you can't really control link-state protocol announcements all that well. We had some unfortunate more-specific announcements too. Or occasional and unexpected anycast after failover. It wasn't pretty.

      I thought about using BGP from the start but it would mean that every ToR switch would cost much more due to standard "ISP tax" license for BGP/IS-IS/MPLS (thank you Juniper, I know you do that because you love us).

      Other reason was that server team hated it because they didn't understand it so network team had to do everything network-related by themselves, creating even bigger administrative overhead instead of lowering it.

      Now it's back to stretched L2 between datacenters. Brown substance flows as it's supposed to and world is back to normal.

      End of story.


      Funny thing: I actually used that scheme years ago with Solaris servers and Juniper M-series routers with huge success, but back then server admins were actually understanding how network works instead of just googling quick and dirty solutions.
  7. Currently we run (for different part) both BGP and OSPF on servers (customer DNSes and shared cache / web blocking).

    Since servers' teams have zero understanding on this, we (the network team) run them.
    We employ quagga and bird for diversity, roughly 50/50 percent.
    Not a single failure, patching really fast.. no dozen of teams to be involved, no offshoring and sudo and ... Unuseful stupid paperwork for a reboot.. I patched all of them (roughly 30) today for CVE-2015-7547 in almost 2h with no service interruption..

    But.. It's hard to cope with this stuff if you're the average point-and-click GUI guy.. Even unix teams are not the way they used to be 10 years ago.. Nobody knows the loadaverage meaning anymore.. How could you expect them to understand non transit routing protocol config?
  8. Another interesting article Ivan.

    Looks like there is no black & white answer but the decision to run any routing protocols on servers would probably be based on combination of many factors.

    1) Operational issue: Extent of Silo between Server & Network teams. Not sure if having just good sysadmin with strong networking knowledge will help for large deployments.

    2) Complexity: #session as pointed out in one of the comments. Is there enough data to prove more session is more complex ?. On a slightly different note, these days gossip protocol on hosts.

    3) Scale_of_deployment: If the scale of deployment is small (2-4 switches) then both operational and complexity can be managed better.

    4) Rate_of_change: #network updates is another factor. Constant network update is not desirable. On VM based deployment, #updates is far less when compared to microservices that have very short life.

    5) Where SDN: Is it from the host or from ToR ?. Marking en all makes seems fine from the host but there is growing complexity for #policy. Traditional systems usually had these policy at L2/L3 boundary which was simple and easy to manage.
  9. >one link failure away from IBM mainframe becoming the core router

    There's a great bsdnow blog that talks about a similar scenario. Episode 103 at the 34:30 mark. Great story. Link is below.

    Thanks,
    Phil

    https://youtu.be/l6XQUciI-Sc?t=2072
  10. Perhaps secondary to the discussion, but aren't anycast services also leveraging routing daemons on servers? Point being, certain sysadmins are well aware of networking concepts before the "full-stack" term became hip.
  11. Some anycast designs use that method yes. I've also seen others which use some kind of LB in front of the end servers and they take care of health monitoring and doing the NAT tricks to make it so the servers are just vanilla servers. A lot of this comes down to skillset on the server side and who manages the network portion of the servers. We've diverged over time to silos.

    I like BGP but running it to the servers in order to get dynamic redundancy is a bit overkill unless for instance it's a VR handling lots of VMs behind it. It's just not scalable in a large datacenter to run BGP on the nodes. That's why Contrail for instance uses XMPP for that exchange and translates it on a centralized node to BGP.
  12. Any clever way for running bgp on the hosts in a environment where I wanted to advertise just a loopback ip on a vm on a host? It would be highly suboptimal to have every vm (let say there's 30) on a specified host all run bgp in order to advertise each of their respective /32 loopback ip's... static routing on at the host level and redistributing it into bgp obviously wouldn't scale nor accommodate vm migrations... any ideas on that?
    Replies
    1. Not exactly what you're looking for, but Project Calico does some of what you want to achieve.
  13. Popular VM LB's run BGP from servers. If that is okay, why not a routing bgp stack from servers is a problem ?
    Replies
    1. It's not a technology, but a people/skills problem. The load balancers are usually configured by people who understand (some) networking.
  14. I've been building an internal project, with Docker overlay over L3VPN (EVPN later), where every veth pair is represented as a PE-CE link. Control plane can be run locally, as an additional container or centrally, programming forwarding logic into OVS.
    Every Docker host is represented by a single /32, AKA PE loopback, used as NH in BGP updates. Data plane - MPLSoVxLAN, going forward I envision using EVPN VxLAN control plane (draft-ietf-bess-evpn-overlay)
    So far we have been redistributing connected within VRF so remote end of /31 is the container itself. We have also built libnetwork adapters.

    Obviously there's some magic on IP assignment,mapping into internal API's, etc

    At some point I will provide more details and perhaps a demo
  15. Great blog, great comments. Check out osrg/go-bgp on github. BGP as a distributed data store has some interesting properties. Management plane needs a datastore anyways. Only diff is if you want to consolidate things or stack up multiple clustering types. Opaque BGP from our friend Petr Lapukhov at Facebook is pretty awesome imo. I like it much more then forced data structs that I don't need. Let me set a TLV and encode pictures of cats if I want to. Cya round guys and great convo.
    Replies
    1. Thanks for the info but how is this relevant to the topic here ?
    2. Hi Anonymous, well we are talking about BGP on servers right?
    3. Brent - communities :)
    4. @Jeff - alas, using communities (32-bit or 62-bit) to encode arbitrary data is pretty awkward - tried that. Not to mention these are just attributes that need to go with an NLRI (semantic overloading). Not sure why there was no generic opaque payload in BGP from day 2 (MP-BGP) - seemed pretty natural.
  16. @Petr - really?
    I recall there's draft-lapukhov-bgp-opaque-signaling :)

    See you tomorrow at Networking @Scale
Add comment
Sidebar