And this is why you need automation
I stumbled upon a great description of how you can go bankrupt in 45 minutes due to a manual deployment process. The most relevant part of it:
Any time your deployment process relies on humans reading and following instructions you are exposing yourself to risk. Humans make mistakes. The mistakes could be in the instructions, in the interpretation of the instructions, or in the execution of the instructions.
And no, it's not just application deployment. A similar disaster could happen in your network.
Their system consisted of 8 servers. Thee servers worked together as one system. So I am sure there was network communication between them. Thus there was a protocol. A proprietary protcol at the application layer.
They changed their protocol. They changed the meaning of one bit (the power peg flag). But supposedly their 8 servers were not able to figure out that one of them was talking an old version of their proprietary protocol. This was the root of their problem.
They could have had a protocol-version in their packet-header. They could have used TLV-encoding instead of fixed-size, fixed-location, fixed-meaning encoding. They could have done version negotiating during connection establishment. They could have done something. But it seems they did nothing at their own protocol-level to prevent these incompatabilities.
By now most network engineers have learned that redundancy has its own challenges and is not automatically equal to robustness. Similarly, automation does not equate to correctness.
I can't agree with Ivan on this one. More automation would not have helped at all in this case. More thinking was needed at all levels: design, planning, testing, monitoring, system integration, etc.
Listen to this podcast http://blog.ipspace.net/2014/11/flipit-cloud-orchestrating-it-as.html if you want to hear more details from people who had to solve a similar problem.
I was thinking about processes at all levels, for guys managing billions of dollars it looks like they skimped almost everywhere:
-operational: check-lists (did we do all the servers?), verification by a second technician, supervision by a software engineer, monitoring of error messages.
-change management: you have a backout plan, right?
-system design: transaction anomaly monitoring with safe-guards (in networking has this is BPDU guard, uni-directional link detection, etc.)
-management oversight seems lacking
The point I was trying to make is that automating an incorrect procedure will not help and that automation, as a software activity, has all the problems associated with software development.
youtu.be/ewNLCkA0oBk?t=110