How GitHub Learned How Hard Distributed Systems Are
Anne Baretta found a great video describing the October 2018 GitHub failure. Here’s the TL&DW:
- The failure was caused by a short (~ 1 minute) disconnect of the primary data center
- The database replicas failed over to the secondary data center, but that failover was never tested and of course some stuff didn’t work.
- In the meantime, batch jobs modified data in the primary data center, making the two replicas out-of-sync.
- It took them over 24 hours to clean up the mess.
You REALLY SHOULD watch the video – it nicely proves two points I’ve been making for ages (not that anyone would listen):
- Distributed systems are hard. Making them highly available is even harder.
- A Disaster Recovery Plan is just wishful thinking until it has been thoroughly tested under realistic conditions.