Cumulus Linux in Real Life on Software Gone Wild
A year ago Matthew Stone first heard about Cumulus Linux when I ranted about it on a Packet Pushers podcast (which only proves that any publicity is good publicity even though some people thought otherwise at that time), and when his cloud service provider company started selecting ToR switches he considered Cumulus together with Cisco and Arista… and chose Cumulus.
I firmly believe hands-on experience in production environment always beats vendor marketing, so I was really keen to hear how he feels after running his whole data center network on Cumulus-powered switches for a few months… and thus the Episode 13 of Software Gone Wild was recorded.
It’s obvious Matt’s a true believer, but he was also very open about the glitches he found in Cumulus Linux, so I’m positive you’ll enjoy our hour-long chat. Here are a few highlights.
Why did they go with Cumulus?
- They were selecting ToR switches and had to choose between Cisco, Arista or Cumulus.
- Cisco lacked programmability features, Arista was (obviously) more expensive than whitebox switch + Cumulus Linux.
On the command line interface and glitches:
- Most network features are handled by open-source Linux daemons like Quagga;
- You don’t have a unified CLI, you’re using Linux commands to configure Linux networking;
- Various daemons have inconsistent interfaces - for example, you have to telnet to Quagga VTY to configure it – but Cumulus is working on fixing those inconsistencies;
Does it make sense?
Matt claims that Cumulus might be an ideal solution for large shops that have the resources to developed things themselves, for everyone else it opens the possibility of third-party applications running on top of it. In his own words, “you could switch the operating system if you don’t like the one you use today” (let’s ignore that there’s approximately one at the moment).
Enjoy the podcast and don’t forget to subscribe to the Software Gone Wild feed.
We are approaching a point where we need to replace a lot of very old infrastructure while at the same time our business is seeing the benefits of public cloud. The current roadmap is to build a private/hybrid converged infrastructure adjacent to the aging infrastructure and migrate.
However... go with a large vendor product or build white box and maintain it ourselves (like Cumuls)? There are huge pros and cons with either model. I would love to hear how the various venders were selected, what was compared, and how the decision was arrived at. Also, since it's been in operation for a while, have there been any major challenges? Regrets? Would they do it again?
On the flipside, some engineers have attended ACI roadshows and classes. Going with that model definitely seems to guarantee that network engineers will still have jobs. It is very hardware centric, I'm told, with MP-BGP, route-reflectors and MPLS running underneath.
And finally there is the do-it-all-yourself model with whitebox hardware, an overlay, an orchestrator and probably a lot of operational learning curve pain. (technical debt). I was curious if the people in this article provided any insight as to how much pain there was to going this route.
I'd like to clarify one point, Arista EOS is based on a standard Linux distribution (Fedora), so you can rpm install (or pip install once you've install python tools) standard Linux tools on an EOS box just as you would apt-get a Debian based switch OS. Great podcast!
However, you still configure EOS using session-based CLI (or REST API) and extract information with SHOW commands (although that's getting better now with eAPI), whereas you can do most of the things with command line (which is easier to automate) on Linux-based networking.
I don't mind doing this stuff on a test box or two I have in the lab, but on my prod network gear I'll take a good unified interface over fragmented utilities on my network hardware.
Well, there's Cumulus Linux, and there's ONL for some platforms (e.g. Accton/Edge-Core). And most white-boxen have a "non-white-box option" where they bundle a proprietary/licensed OS, should you become so frustrated that you want to go back to that model (or for the day when SDN has become tired and someone invents "HDN!!!111").