Can You Avoid Networking Software Bugs?

One of my readers sent me an interesting reliability design question. It all started with a catastrophic WAN failure:

Once a particular volume of encrypted traffic was reached the data center WAN edge router crashed, and then the backup router took over, which also crashed. The traffic then failed over to the second DC, and you can guess what happened then...

Obviously they’re now trying to redesign the network to avoid such failures.

All kind of random things are being suggested such as deliberately maintaining different software revisions, having a different vendor in the second DC, etc.

I don’t see what one could do to avoid this type of failures apart from using different equipment in parallel, be it equipment from different vendors or different enough boxes from the same vendor.

Obviously you can only do that if you do simple IP routing and not something more creative like DMVPN, OTV or VPLS.

The multi-vendor approach avoids box-specific catastrophes, but brings its own set of complexities and inter-vendor interoperability challenges… unless you split the network in multiple independent availability zones linked with simple IP transport, and use different set of vendors in each availability zone.

However, I’m not sure whether it would make more sense to deal with the ongoing complexity or once-in-a-blue-moon crash (assuming it does happen once in a blue moon).

I know that the traditional DMZ design guidelines suggested using equipment from different vendors, but I always had a problem with that approach.

Did you ever have to solve such a challenge? What did you do? What would you suggest? Oh, and keep in mind that SDN controllers might not be the right answer (see also this blog post).

4 comments:

  1. I know of some financial environments that look much like the left-side-Cisco/right-side-HP topologies that appear in some Spirent (look what we tested!) presentations.

    As long as you stick with standards-based stuff (and using proprietary features is not a decision that should be undertaken lightly), this approach should be mostly okay.

    Still, when you get down into the weeds, it can be tough to find feature parity when looking at things like per-member BFD for LACP links, etc...
  2. It's as you said: once in a blue moon... I once let a campus core stack crash which was a single point of failure for that part of the LAN. All managers screamed I should split it up into single switches. That cooled down once it became clear it would require a redesign, more IP addresses, more routing, and spanning-tree blocked links. It never happened again.
    Awaiting a second event before taking drastic measures would be best here. And trying to keep using safer code revisions of the vendor.
  3. Recently we purchased a Cisco VPN ISM module and submitted to a internal team that is responsible to bless everything from hardware to software. They found a potential bug reported on the vendor website that a memory leak could happen on our IOS version (combined with VPN-ISM). In summary: IPsec re-key the Security Association every hour or amount of data (whatever is reached first) and this is when the memory leak occurs. Every re-keying we will be losing around 176 Bytes of memory out of 64MBytes available on the module. This mean that if you have only one tunnel re-keying every hour it would take 43 years for the ISM module to run out of memory and for the router to crash. But it can happen before that depending on how many tunnels you have configured and amount of data.

    What I would ask myself:
    1) Is my team qualified to support a new vendor?
    2) Assuming we have a well-documented network, does my team has time to study/write standards that can support when configuring or troubleshooting QoS for example?
    3) Are we implementing procedures to validate if a solution will perform as expected prior the deployment?

    In addition, even in a multi-vendor environment you might face issues in some corner case situations . A real case: We have a branch with two border routers working in a hot-standby fashion. The seconday (standby router) had been replaced a time ago because it was EOL and one beautiful day the primary link failed and the secondary did not take over. Reason? It had a IOS basic version that supports only EIGRP Stub routing. The problem: The router was in a transit path. Thus, the entire branch went down. Now wondering that we could have had a Juniper as primary and Cisco as secondary, this problem would have happened anyway.

    All said, personally, I prefer wait to Halley Comet rather spend too much energy working to avoid it and suddenly appear a new one (hale-bopp)!
  4. My first instinct is to do nothing. If the event is truly a aberration. I would really want to make sure that it was truly a once in a blue moon event. I have a few questions. Do we know what caused the volume of traffic to increase? Will we continue to see that cause continue to create a large volume of traffic?
    If after investigation I did find out that the issue wasn't going to be a blue moon event, I would ask "Do we know that it is specific volume that is causing this issue?" If for example we know that 500 MB of aggregate encrypted traffic is causing this software issue and it will run smoothly under that amount can we consider using the backup WAN router in a load sharing design?
    If we absolutely have to solve this issue I would look into load sharing across the WAN routers. It adds complexity, but its a set of known complexity problems. Its a set of known bugs. Adding a second vendor presents a whole new set of bugs to discover and unforeseen complexity.
Add comment
Sidebar