How GitHub Learned How Hard Distributed Systems Are
Anne Baretta found a great video describing the October 2018 GitHub failure. Here’s the TL&DW:
- The failure was caused by a short (~ 1 minute) disconnect of the primary data center
 - The database replicas failed over to the secondary data center, but that failover was never tested and of course some stuff didn’t work.
 - In the meantime, batch jobs modified data in the primary data center, making the two replicas out-of-sync.
 - It took them over 24 hours to clean up the mess.
 
You REALLY SHOULD watch the video – it nicely proves two points I’ve been making for ages (not that anyone would listen):
- Distributed systems are hard. Making them highly available is even harder.
 - A Disaster Recovery Plan is just wishful thinking until it has been thoroughly tested under realistic conditions.