So You Need ISSU on Your ToR switch? Really?

During the Cumulus Linux presentation Dinesh Dutt had at Data Center Fabrics webinar, someone asked an unexpected question: “Do you have In-Service Software Upgrade (ISSU) on Cumulus Linux” and we both went like “What? Why?

Dinesh is an honest engineer and answered: “No, we don’t do it” with absolutely no hesitation, but we both kept wondering, “Why exactly would you want to do that?

Back-channel conversation with the attendee brought up interesting facts:

  • He was asking about ISSU on ToR switches (not on MLAG core, where it might potentially make sense… or not);
  • Supposedly he’s getting requests from service providers who build their cloud infrastructure with single-homed servers and then hammer on the network equipment vendors to implement ISSU on the ToR switches.

There’s something radically wrong with this picture.

From my biased perspective, you have exactly two options:

  • You’re big enough to afford losing dozens of servers after a ToR switch failure (regardless of whether you’re building a scale-out web farm, Hadoop cluster, or public cloud infrastructure);
  • You’re not big enough (acceptable unit of loss is less than a ToR switch and all attached servers), in which case you dual-home your servers to two ToR switches, and stop caring about a ToR switch failure.

There is no middle ground or fifty shades of ToR redundancy – you either have redundancy, or you don’t. Forcing equipment manufacturers to do backflips with a mortar tied to their back because you botched your design is (at least) counterproductive.

Also, when was the “Keep it Simple, Stupid” replaced with “Let’s throw more spaghetti at the wall”?

If you’re still not persuaded, consider all possible failure scenarios:

  • Switch or server hardware failure (unlikely);
  • Transceiver failure or cable fault (not-so-unlikely);
  • Server or switch software crash;
  • Server or switch software upgrade.

Assuming hardware failures are unlikely (you might disagree with that, in which case you should change your supplier), will the switch software upgrade really be the most disruptive operation that will happen in your network, or will you experience switch software crashes more often than you’re doing software upgrades (in which case ISSU doesn’t buy you much). Also, how often are you planning to do the software upgrades anyway? Are you solving a major problem or complicating everyone’s life to address a small minority of potential outages?

Also, do keep in mind that ISSU (and associated Graceful Restart and Non-Stop Forwarding) vastly increases the device complexity, resulting in higher costs, more subtle bugs, and more opportunities for weird hangs and crashes.

Want to know more?

Interested in data center switches and fabrics? Check out the Data Center Fabrics webinar.

Latest blog posts in High Availability Switching series

9 comments:

  1. Change management processes/rules are the biggest drivers for having ISSU implemented. :)
    Replies
    1. You can do a software update without an approved change if you have ISSU?
    2. You can and no one will notice. At least that's the hope here. :)
  2. If it's hitless than the SLAs typically allow you to schedule them more quickly. One of those, "if it fails, you can ask forgiveness," but if you know for sure it will fail, then you need all those lead times.
  3. Imagine just how simple the network could be if we just didn't care so much about layer 2 all over the place and other weird solutions, and went straight to a routed IPv6 network without looking back. No NAT either, just simple robust easy to understand routing without any special hacks or protocols.

    Maybe the big public cloud providers can finally "save us". They don't seem too interested in implementing all crazy shit (that doesn't scale) :)

    But then again, why dream. Let's add complexity because it's more fun this way (it creates bureaucracy and network engineers jobs)...
  4. Unfortunately, often things are as black and white in real life.... Yes, it seems like that ideally ISSU won't be logically needed on a top of rack switch. In practice, there are always complications and corner cases when you would rather have it (or, possibly, a "fast reboot" where it fits as an alternative).

    And last, but not least, things get much more involved when you build your own SW and roll the updates continuously to the fleet :)
    Replies
    1. Agreed. In a perfect world you won't need ISSU on ToR switches. But when you have single-NIC servers and thousands of ToR switches, coordinating maintenance windows for software upgrades becomes a nightmare especially when the applications were not designed to fail. ISSU in this case makes your life easier. Several vendors today support ISSU in their ToRs.
  5. The TOR is no longer an "access" switch. Today's high-density, high-speed pizza-box/non-modular switches deployed at the TOR serve as aggregation switches (with access relegated to the soft-switch inside the virtualized server). If you think HA is critical in the legacy aggregation/core switch, then this new class of aggregation points in the current data-centers will need them in the enterprise still running legacy apps (now virtualized, of course). Juniper, Cisco and Brocade all support this on their high-density fixed-form factor TOR switches they target for enterprise DCs.
  6. Just to add to my previous comment. Imaged the servers being bladed servers, then you will see the importance and the critical new role the current TORs play as aggregation points (think N5K aggregating some 4K VMs, you want to add some resiliency - yes, traditional thinking but that's the way customers have been programmed to think - good or bad).
Add comment
Sidebar