Worth Reading: Cloudflare Control Plane Outage

Cloudflare experienced a significant outage in early November 2023 and published a detailed post-mortem report. You should read the whole report; here are my CliffsNotes:

Also (unrelated to Cloudflare outage):

2 comments:

  1. It reads like they are going to migrate from centralized control plane to distributed control plane and aiming for partition tolerance. But according to CAP theorem they will have to either give up availability or consistency. At the same time they want to stick with "the high availability cluster (if they rely on any of our core data centers)". To me these look like conflicting goals. Also it seems to me they are caught in a complexity trap. Thought that shop would do better specially with testing, recovery planning and organization.

    BTW I know of a outsourcing provider which test their diesel generators and UPS every month because they know for sure that their crappy "active-active redundant data clusters" will not survive a complete outage of one data center. So their weakest "link" really is electricity. They also have an impressive stock of diesel at the facility to be able to run independently for days.

  2. As an auditor, I always tell if an individual server, router, switch, or other infrastructure element has a long uptime, that means for me that you have not exercised your redundancy failover as you should. Since they could be unforeseen surprises in complex systems (because people will never follow rules and principles properly), it is important to have frequent artificially induced failures in your production system. Then you should document the results of such exercises and make corrective actions. It is like with regular fire alarm exercises in office buildings. For critical infrastructure, a monthly or weekly major facility failure exercise is highly recommended. Individual server reboots might be done according to a rolling schedule almost each day.

    BTW, it would also help with undetected errors in large memories. Your laptop memory is designed to be rebooted each night. Your server has ECC memory, but it still not a perfect protection. You might also have software memory leaks or other one time events, such as solar flares. So regular failure exercises and server reboots would also provide you a clean start and reduce your memory issues, too.

Add comment
Sidebar