Title: “Cloudflare 06/21 Disaster Report”
The official Cloudflare article explains in detail what happened on the day of 06/21/2022 that affected users,

The scope of this issue summarizes the 19 data centers under Cloudflare, and unfortunately these 19 data centers are all responsible for processing heavy global traffic, so the number of users affected is so large.
The main cause of the problem is the adjustment of network settings (if there is a problem, guess BGP first, if not, guess DNS…), the overall occurrence time is not very long

  1. 06:27 UTC Problem occurs
  2. 06:58 UTC First data center repaired and online
  3. 07:42 UTC All data centers are fixed and online


Over the past 18 months, Cloudflare has been committed to restructuring its busy data center to achieve a more resilient and resilient network architecture. Internally, this architecture is called Multi-Colo POP (MCP), which affects 19 data centers. Including Tokyo, Singapore … etc.

The most important part of the new architecture is that the network part is based on the Clos network architecture design. Through multi-level design, a mesh network-like network connection is achieved. This architecture makes it easier to maintain and adjust parts of the network in the future. It can be processed by the road equipment without affecting the overall network (the article has a picture of the structure).


The problem this time is mainly related to BGP. During Cloudflare’s BGP update, some subnets were not delivered smoothly, which eventually made the traffic of some subnets unable to be forwarded smoothly, which in turn caused the entire network problem.

There is a more detailed introduction to BGP issues in the article. Friends who are familiar with BGP can take a moment to look at it.


This issue has a wide range of impacts, and Cloudflare has reflected on the reasons for the following three aspects:


Although the purpose of the new MCP architecture is to provide better and stronger availability, the process of upgrading the old architecture to the new architecture is still not perfect. The overall update process does not really touch the new MCP architecture until the last step, which makes it necessary to observe the network explosion of the MCP data center if there is an error in the intermediate update process.
The way to improve is that these processes and automations in the future must incorporate more testing of the MCP architecture to ensure that the overall deployment does not encounter unexpected results.


The wrong configuration of the router makes it impossible for the correct routing rules to be conveyed smoothly, and ultimately makes the network packets unable to reach these data centers as expected.
Therefore, the repair process is to find out these wrong settings and correct them, so that these BGPs can forward the correct routing policies.


There are many parts of the current automation process that can be improved, and these improvements have the opportunity to fully or partially mitigate the impact of the problem when it occurs.
There are two goals we want to achieve by improving automation

  1. Reduce the scope of impact when problems occur
  2. Reduce repair time when problems occur

