Worth Reading: Cloudflare Control Plane Outage
Cloudflare experienced a significant outage in early November 2023 and published a detailed post-mortem report. You should read the whole report; here are my CliffsNotes:
- Regardless of how much redundancy you have, sometimes all systems will fail at once. Having redundant systems decreases the probability of total failure but does not reduce it to zero.
- As your systems grow, they gather hidden- and circular dependencies.
- You won’t uncover those dependencies unless you run a full-blown disaster recovery test (not a fake one)
- If you don’t test your disaster recovery plan, it probably won’t work when needed.
Also (unrelated to Cloudflare outage):
- Even Cloudflare can get an outage. Don’t expect your stretched VLAN fairyland to survive the encounter with reality.
- Keep your design as simple as possible
- Don’t rely on vendor-supplied miracles
- Unless you can stress-test your ideas, leave the high-level decisions (for example, when to failover) to humans.
- Automate the low-level operations as much as you can
It reads like they are going to migrate from centralized control plane to distributed control plane and aiming for partition tolerance. But according to CAP theorem they will have to either give up availability or consistency. At the same time they want to stick with "the high availability cluster (if they rely on any of our core data centers)". To me these look like conflicting goals. Also it seems to me they are caught in a complexity trap. Thought that shop would do better specially with testing, recovery planning and organization.
BTW I know of a outsourcing provider which test their diesel generators and UPS every month because they know for sure that their crappy "active-active redundant data clusters" will not survive a complete outage of one data center. So their weakest "link" really is electricity. They also have an impressive stock of diesel at the facility to be able to run independently for days.
As an auditor, I always tell if an individual server, router, switch, or other infrastructure element has a long uptime, that means for me that you have not exercised your redundancy failover as you should. Since they could be unforeseen surprises in complex systems (because people will never follow rules and principles properly), it is important to have frequent artificially induced failures in your production system. Then you should document the results of such exercises and make corrective actions. It is like with regular fire alarm exercises in office buildings. For critical infrastructure, a monthly or weekly major facility failure exercise is highly recommended. Individual server reboots might be done according to a rolling schedule almost each day.
BTW, it would also help with undetected errors in large memories. Your laptop memory is designed to be rebooted each night. Your server has ECC memory, but it still not a perfect protection. You might also have software memory leaks or other one time events, such as solar flares. So regular failure exercises and server reboots would also provide you a clean start and reduce your memory issues, too.