Cumulus Linux in Real Life on Software Gone Wild

A year ago Matthew Stone first heard about Cumulus Linux when I ranted about it on a Packet Pushers podcast (which only proves that any publicity is good publicity even though some people thought otherwise at that time), and when his cloud service provider company started selecting ToR switches he considered Cumulus together with Cisco and Arista… and chose Cumulus.

I firmly believe hands-on experience in production environment always beats vendor marketing, so I was really keen to hear how he feels after running his whole data center network on Cumulus-powered switches for a few months… and thus the Episode 13 of Software Gone Wild was recorded.

It’s obvious Matt’s a true believer, but he was also very open about the glitches he found in Cumulus Linux, so I’m positive you’ll enjoy our hour-long chat. Here are a few highlights.

Why did they go with Cumulus?

  • They were selecting ToR switches and had to choose between Cisco, Arista or Cumulus.
  • Cisco lacked programmability features, Arista was (obviously) more expensive than whitebox switch + Cumulus Linux.

On the command line interface and glitches:

  • Most network features are handled by open-source Linux daemons like Quagga;
  • You don’t have a unified CLI, you’re using Linux commands to configure Linux networking;
  • Various daemons have inconsistent interfaces - for example, you have to telnet to Quagga VTY to configure it – but Cumulus is working on fixing those inconsistencies;

Does it make sense?

Matt claims that Cumulus might be an ideal solution for large shops that have the resources to developed things themselves, for everyone else it opens the possibility of third-party applications running on top of it. In his own words, “you could switch the operating system if you don’t like the one you use today” (let’s ignore that there’s approximately one at the moment).

Enjoy the podcast and don’t forget to subscribe to the Software Gone Wild feed.

11 comments:

  1. http://puppetlabs.com/presentations/managing-cisco-devices-using-puppet
  2. Would love to see more on this. Did management ok the deviation from big vendors up front, or did IT have to sell them on it? How were the comparisons made? Strictly capacity and costs?

    We are approaching a point where we need to replace a lot of very old infrastructure while at the same time our business is seeing the benefits of public cloud. The current roadmap is to build a private/hybrid converged infrastructure adjacent to the aging infrastructure and migrate.

    However... go with a large vendor product or build white box and maintain it ourselves (like Cumuls)? There are huge pros and cons with either model. I would love to hear how the various venders were selected, what was compared, and how the decision was arrived at. Also, since it's been in operation for a while, have there been any major challenges? Regrets? Would they do it again?
    Replies
    1. If all you need are two switches (and you'd have to be pretty big to need more, see also http://blog.ipspace.net/2014/10/all-you-need-are-two-top-of-rack.html) you shouldn't bother. The potential savings you might get by switching vendors will be more than offset by increased OpEx (not to mention longer time-to-production).
    2. Ivan, I read that article and we've actually had these conversations internally. Some environments have adopted the full VMware suite of tools and the traditional networking is all now in the hypervisor. There are physical switches that terminate internet/WAN/other-edge, but that's about it. In those cases, we agree that next gen networks could be dramatically smaller. Good for the customers, but makes the network engineers nervous.

      On the flipside, some engineers have attended ACI roadshows and classes. Going with that model definitely seems to guarantee that network engineers will still have jobs. It is very hardware centric, I'm told, with MP-BGP, route-reflectors and MPLS running underneath.

      And finally there is the do-it-all-yourself model with whitebox hardware, an overlay, an orchestrator and probably a lot of operational learning curve pain. (technical debt). I was curious if the people in this article provided any insight as to how much pain there was to going this route.

  3. Disclaimer: I work for Arista
    I'd like to clarify one point, Arista EOS is based on a standard Linux distribution (Fedora), so you can rpm install (or pip install once you've install python tools) standard Linux tools on an EOS box just as you would apt-get a Debian based switch OS. Great podcast!
    Replies
    1. Perfectly valid point (and I like Arista's approach to "let's use Linux as much as possible").

      However, you still configure EOS using session-based CLI (or REST API) and extract information with SHOW commands (although that's getting better now with eAPI), whereas you can do most of the things with command line (which is easier to automate) on Linux-based networking.
  4. no one ever talks about debugging issues on these things. are we typing things like ip addr show, brctl show and sysctl -a (and more) to try to piece together that's going on in the box?

    I don't mind doing this stuff on a test box or two I have in the lab, but on my prod network gear I'll take a good unified interface over fragmented utilities on my network hardware.
    Replies
    1. Of course, what else would you use ;) Happy Linux troubleshooting :D
  5. “you could switch the operating system if you don’t like the one you use today” (let’s ignore that there’s approximately one at the moment)

    Well, there's Cumulus Linux, and there's ONL for some platforms (e.g. Accton/Edge-Core). And most white-boxen have a "non-white-box option" where they bundle a proprietary/licensed OS, should you become so frustrated that you want to go back to that model (or for the day when SDN has become tired and someone invents "HDN!!!111").
  6. We are discarding Cumulus OS from our network,; almost 50 switches from production core. It was too inconsistent with system interface and management of features, protocol is also way cluttered. But now, we have got into bigger trouble - What else can we install on this White-box switches? Is there something better than Cumulus OS for data center with most common L2/L3 switching features?
    Replies
    1. If you want to have a traditional architecture, Cumulus OS seems to be by far the best alternative. You could try Big Cloud Fabric next for a totally different approach.
Add comment
Sidebar