Cloudflare experienced a significant outage in early November 2023 and published a detailed post-mortem report. You should read the whole report; here are my CliffsNotes:
- Regardless of how much redundancy you have, sometimes all systems will fail at once. Having redundant systems decreases the probability of total failure but does not reduce it to zero.
- As your systems grow, they gather hidden- and circular dependencies.
- You won’t uncover those dependencies unless you run a full-blown disaster recovery test (not a fake one)
- If you don’t test your disaster recovery plan, it probably won’t work when needed.
Also (unrelated to Cloudflare outage):
- Even Cloudflare can get an outage. Don’t expect your stretched VLAN fairyland to survive the encounter with reality.
- Keep your design as simple as possible
- Don’t rely on vendor-supplied miracles
- Unless you can stress-test your ideas, leave the high-level decisions (for example, when to failover) to humans.
- Automate the low-level operations as much as you can