How GitHub Learned How Hard Distributed Systems Are « ipSpace.net blog

Wednesday, August 23, 2023 05:55 UTC

How GitHub Learned How Hard Distributed Systems Are

Anne Baretta found a great video describing the October 2018 GitHub failure. Here’s the TL&DW:

The failure was caused by a short (~ 1 minute) disconnect of the primary data center
The database replicas failed over to the secondary data center, but that failover was never tested and of course some stuff didn’t work.
In the meantime, batch jobs modified data in the primary data center, making the two replicas out-of-sync.
It took them over 24 hours to clean up the mess.

You REALLY SHOULD watch the video – it nicely proves two points I’ve been making for ages (not that anyone would listen):

Distributed systems are hard. Making them highly available is even harder.
A Disaster Recovery Plan is just wishful thinking until it has been thoroughly tested under realistic conditions.

Recent posts in the same categories

high availability

worth reading