Running BGP on Servers
Mr. A. Anonymous left this comment on my BGP in the data centers blog post:
BGP is starting to penetrate into servers as well. What are your thoughts on having BGP running from the servers themselves?
Finally some people got it. Also, welcome back to the '90s (see also RFC 1925 section 2.11).
Running a routing protocol on servers (or IBM mainframes) is nothing new – we’ve been doing that 30 years ago, using either RIP or OSPFv2 – and it’s one of the best ways to achieve path redundancy.
Later it became unfashionable to have any communication between the server silo and the network silo, resulting in the unhealthy mess we have today where everyone expects the other team to solve the problem. Unfortunately, the brown substance tends to flow down the stack.
However, even though the mainstream best practices focused on link bonding, MLAG and similar kludges, I know people who were running BGP on their servers (with good results) for years if not decades.
The old ideas resurfaced in the mainstream networking as means of connecting the virtual (overlay) world with the physical world, first with routing protocol support on VMware NSX Edge Services Router (ESR), later with BGP support in Hyper-V gateways… and I was really glad VMware decided to implement BGP on ESR because BGP establishes a clean separation between two administrative domains (virtual and physical).
Lately, I’ve seen very smart full-stack engineers (read: sysadmins who understand networking) use FRR to run BGP across unnumbered links between servers and ToR switches totally simplifying both BGP configurations as well as deployment procedures (not to mention turning the whole fabric into pure L3 fabric with no VLANs on ToR switches).
Want to know more? Dinesh Dutt described the idea in the Leaf-and-Spine Fabric Architectures webinar.
OSPF has essentially no routing policy control (within an area, anyway, only on area or AS boundaries). Which means it's hard to stop an end node from becoming a transit link, and increases danger of blackholing failures such as what you apparently experienced. You also end up with the server holding a full set of OSPF routes in its RIB/FIB, which is....inelegant.
We (started with OSPF, but learned and switched to) BGP which does provide the routing policy to prevent exactly these sorts of failures. Only pass a default down to the server, restrict what routes the network will accept from the server. Solves both problems, and works phenomenally well.
There are all sorts of other capabilities that come along when you use a good setup for running routing protocols on servers that are beyond this scope.
So I would caution against saying you had failures using OSPF, therefore you shouldn't run *any* routing protocols on servers. Switch to BGP and be happy.
DC disaster recovery scenarios would be much more simple, move the VM, wait for OSPF/RIP to reconverge, and youre done, you wouldn't need L2 extensions to get it to work, and you could implement it with far less expensive equipment (would also work cross vendor with relative ease too), all you'd need would be DHCP, RIP/OSPF/ISIS
obviously you'd have the challenge of having a large adjacency design as you'd have to allow anything in the DHCP range to form an adjacency, and you'd have to be careful that you never ended up with a host as a transit host, but still I think its an interesting design.
Im sure if ive got this appalling wrong Ivan is going to shoot my argument full of holes, or some windows person is going to say sourcing the IP address from the loopback will cause windows to disintegrate...
OSPF is also a non-starter due to the inability to filter out accidental/malicious advertisements, so we're using RIP, which works well enough. (Of course, we don't have a RIP *topology*, whatever the router picks up from RIP just gets exported to a more sensible routing protocol.)
I've always wondered why it's not possible to have an unspecified peer address in BGP (0.0.0.0/::). That would have solved the problem - I'd just ask the customer to establish BGP sessions to the default router address on their server LAN. Do you know, Ivan?
Disclaimer: I work for Cumulus
BGP Unnumbered is a cool solution but requires a compatible BGP daemon on the other end (yeah, I know it's all based on RFCs and you can make it work with Junos and NX-OS, but it gets kludgy) and L3 interfaces toward servers instead of a single VLAN-per-ToR.
So, how about adding dynamic BGP neighbors to Quagga? ;))
bgp listen range address/mask peer-group
The documentation is missing, I'll get it added.
Obligatory Disclaimer: I work for Cumulus :)
Routing on the host went away some time ago because we could fix the dual-attachment/loop avoidance/ RSTP problem with your favorite flavor of MLAG and decrease convergence time, SPF hickups, etc by having less routers in your area / domain and keeping the hosts as "leafs" in the design.
Not saying you could not have a rock steady DC with 510 routers for example (500 hosts + 10 routers / VRFs) but one thing is for sure, you are increasing your probability towards stability by lowering the complexity and just managing the 10 routers / VRFs.
One of the main reason for the "BGP on the host" discussion lately is for egress path selection between DCs. PBR, TE, LISP, etc all catches the traffic once it is out of the host and rely on some form of "marker" to apply the proper forwarding.
Now if you have overlay networks originating directly from the host, "maybe" you want to provide the information to the host so that the end to end connection is established over the appropriate egress point in the DC and over the WAN link of choice for this tenant or application.
This makes "some" sense if you own both the DCs and control / own the WAN.
Past history has proven however, a number of times, that implementing more granlar controls is not for everyone... actually few ever do (QoS has been around for ever and I'm still surprised how little of it is actally implemented).
BTW BGP has been in DLR & ESG from the begining in NSX-v.
My 2 cents :)
Thanks for the comment!
Ivan
sometimes we need to stop drinking our own kool-aid ;)
BGP as a mean to solve world hunger has been a recurring topic over the years because we are networking people and we see the world thru our very limited lens. For instance it is funny how Peer to Peer networking, which handles millions of endpoints with attributes (files, songs, etc) has been designed without the IETF and widely implemented without our help...
https://cumulusnetworks.com/webinars/
Click on 'On Demand Webinars'
Click on the 'Watch Now' Under the Demystifying Networking heading.
Fill in some info and click on 'Watch Now'
Full Disclosure: I work for Cumulus and am going to see if we can make this easier to find.
Disclaimer: I work at Cumulus
Exactly. We were trying OSPF on servers for L3 failover instead of usual L2 but idea died for several reasons, one of them was that you can't really control link-state protocol announcements all that well. We had some unfortunate more-specific announcements too. Or occasional and unexpected anycast after failover. It wasn't pretty.
I thought about using BGP from the start but it would mean that every ToR switch would cost much more due to standard "ISP tax" license for BGP/IS-IS/MPLS (thank you Juniper, I know you do that because you love us).
Other reason was that server team hated it because they didn't understand it so network team had to do everything network-related by themselves, creating even bigger administrative overhead instead of lowering it.
Now it's back to stretched L2 between datacenters. Brown substance flows as it's supposed to and world is back to normal.
End of story.
Funny thing: I actually used that scheme years ago with Solaris servers and Juniper M-series routers with huge success, but back then server admins were actually understanding how network works instead of just googling quick and dirty solutions.
Since servers' teams have zero understanding on this, we (the network team) run them.
We employ quagga and bird for diversity, roughly 50/50 percent.
Not a single failure, patching really fast.. no dozen of teams to be involved, no offshoring and sudo and ... Unuseful stupid paperwork for a reboot.. I patched all of them (roughly 30) today for CVE-2015-7547 in almost 2h with no service interruption..
But.. It's hard to cope with this stuff if you're the average point-and-click GUI guy.. Even unix teams are not the way they used to be 10 years ago.. Nobody knows the loadaverage meaning anymore.. How could you expect them to understand non transit routing protocol config?
Looks like there is no black & white answer but the decision to run any routing protocols on servers would probably be based on combination of many factors.
1) Operational issue: Extent of Silo between Server & Network teams. Not sure if having just good sysadmin with strong networking knowledge will help for large deployments.
2) Complexity: #session as pointed out in one of the comments. Is there enough data to prove more session is more complex ?. On a slightly different note, these days gossip protocol on hosts.
3) Scale_of_deployment: If the scale of deployment is small (2-4 switches) then both operational and complexity can be managed better.
4) Rate_of_change: #network updates is another factor. Constant network update is not desirable. On VM based deployment, #updates is far less when compared to microservices that have very short life.
5) Where SDN: Is it from the host or from ToR ?. Marking en all makes seems fine from the host but there is growing complexity for #policy. Traditional systems usually had these policy at L2/L3 boundary which was simple and easy to manage.
There's a great bsdnow blog that talks about a similar scenario. Episode 103 at the 34:30 mark. Great story. Link is below.
Thanks,
Phil
https://youtu.be/l6XQUciI-Sc?t=2072
I like BGP but running it to the servers in order to get dynamic redundancy is a bit overkill unless for instance it's a VR handling lots of VMs behind it. It's just not scalable in a large datacenter to run BGP on the nodes. That's why Contrail for instance uses XMPP for that exchange and translates it on a centralized node to BGP.
Every Docker host is represented by a single /32, AKA PE loopback, used as NH in BGP updates. Data plane - MPLSoVxLAN, going forward I envision using EVPN VxLAN control plane (draft-ietf-bess-evpn-overlay)
So far we have been redistributing connected within VRF so remote end of /31 is the container itself. We have also built libnetwork adapters.
Obviously there's some magic on IP assignment,mapping into internal API's, etc
At some point I will provide more details and perhaps a demo
I recall there's draft-lapukhov-bgp-opaque-signaling :)
See you tomorrow at Networking @Scale