Worth Reading: How Complex Systems Fail
I don’t remember who pointed me to the excellent How Complex Systems Fail document. It’s almost like RFC1925 – I could quote it all day long, and anyone dealing with large mission-critical distributed systems (hint: networks) should read it once a day ;))
The other part is much more interesting to me: how to troubleshoot such complex failures. We all know, for example, that complex issues often as a result of coincidence of a few sub-issues but finding the root cause isn't easy ).
Maybe this could be a part of the NEW System Troubleshooting Blog Series, Ivan?
Why? Because I believe that troubleshooting is crucial part of our job, if not most important one. You need to fail to learn (at least at the beginning)