And this is why you need automation

I stumbled upon a great description of how you can go bankrupt in 45 minutes due to a manual deployment process. The most relevant part of it:

Any time your deployment process relies on humans reading and following instructions you are exposing yourself to risk. Humans make mistakes. The mistakes could be in the instructions, in the interpretation of the instructions, or in the execution of the instructions.

And no, it's not just application deployment. A similar disaster could happen in your network.

7 comments:

  1. Deployement error ? I'd say protocol design error.

    Their system consisted of 8 servers. Thee servers worked together as one system. So I am sure there was network communication between them. Thus there was a protocol. A proprietary protcol at the application layer.

    They changed their protocol. They changed the meaning of one bit (the power peg flag). But supposedly their 8 servers were not able to figure out that one of them was talking an old version of their proprietary protocol. This was the root of their problem.

    They could have had a protocol-version in their packet-header. They could have used TLV-encoding instead of fixed-size, fixed-location, fixed-meaning encoding. They could have done version negotiating during connection establishment. They could have done something. But it seems they did nothing at their own protocol-level to prevent these incompatabilities.
  2. As it always happens in such cases: many small mistakes led to one big disaster. it is not about automation, but about proper design and supervision on all stages. if they would have installed 8th server correctly the would still have been sitting on the time bomb waiting for the right circumstances to explode
  3. What if the mistake was embedded into the automation process/tool(designed by humans) in the first place. Now you have a video series titled "Automation Gone Wild"

  4. Automation is a powerful tool just like redundancy.
    By now most network engineers have learned that redundancy has its own challenges and is not automatically equal to robustness. Similarly, automation does not equate to correctness.

    I can't agree with Ivan on this one. More automation would not have helped at all in this case. More thinking was needed at all levels: design, planning, testing, monitoring, system integration, etc.
    Replies
    1. I think both Hank and yourself missed the point. Yes, there are many things that could have been made more resilient (I totally agree with Hank's remarks), and who knows how many skeletons were hidden in those closets, BUT expecting a human to repeat a long and convoluted process flawlessly eight times in a row borders on insanity.

      Listen to this podcast http://blog.ipspace.net/2014/11/flipit-cloud-orchestrating-it-as.html if you want to hear more details from people who had to solve a similar problem.
    2. Well yes, I completely agree with the part involving humans repeating complicated processes.

      I was thinking about processes at all levels, for guys managing billions of dollars it looks like they skimped almost everywhere:
      -operational: check-lists (did we do all the servers?), verification by a second technician, supervision by a software engineer, monitoring of error messages.
      -change management: you have a backout plan, right?
      -system design: transaction anomaly monitoring with safe-guards (in networking has this is BPDU guard, uni-directional link detection, etc.)
      -management oversight seems lacking

      The point I was trying to make is that automating an incorrect procedure will not help and that automation, as a software activity, has all the problems associated with software development.

  5. Its summer, have a look at this most recent trend in factory automation:

    youtu.be/ewNLCkA0oBk?t=110
Add comment
Sidebar