Reading Notes: “Cloudflare 06/21 Post Disaster Report”

Original link: https://www.hwchiu.com/read-notes-62.html

Title: “Cloudflare 06/21 Disaster Report”
Category: networks
Link: https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/

The official Cloudflare article explains in detail what happened on the day of 06/21/2022 that affected users,

The scope of this issue summarizes the 19 data centers under Cloudflare, and unfortunately these 19 data centers are all responsible for processing heavy global traffic, so the number of users affected is so large.
The main cause of the problem is the adjustment of network settings (if there is a problem, guess BGP first, if not, guess DNS…), the overall occurrence time is not very long

  1. 06:27 UTC Problem occurs
  2. 06:58 UTC First data center repaired and online
  3. 07:42 UTC All data centers are fixed and online

background

Over the past 18 months, Cloudflare has been committed to restructuring its busy data center to achieve a more resilient and resilient network architecture. Internally, this architecture is called Multi-Colo POP (MCP), which affects 19 data centers. Including Tokyo, Singapore … etc.

The most important part of the new architecture is that the network part is based on the Clos network architecture design. Through multi-level design, a mesh network-like network connection is achieved. This architecture makes it easier to maintain and adjust parts of the network in the future. It can be processed by the road equipment without affecting the overall network (the article has a picture of the structure).

question

The problem this time is mainly related to BGP. During Cloudflare’s BGP update, some subnets were not delivered smoothly, which eventually made the traffic of some subnets unable to be forwarded smoothly, which in turn caused the entire network problem.

There is a more detailed introduction to BGP issues in the article. Friends who are familiar with BGP can take a moment to look at it.

reflection

This issue has a wide range of impacts, and Cloudflare has reflected on the reasons for the following three aspects:

Process

Although the purpose of the new MCP architecture is to provide better and stronger availability, the process of upgrading the old architecture to the new architecture is still not perfect. The overall update process does not really touch the new MCP architecture until the last step, which makes it necessary to observe the network explosion of the MCP data center if there is an error in the intermediate update process.
The way to improve is that these processes and automations in the future must incorporate more testing of the MCP architecture to ensure that the overall deployment does not encounter unexpected results.

Architecture

The wrong configuration of the router makes it impossible for the correct routing rules to be conveyed smoothly, and ultimately makes the network packets unable to reach these data centers as expected.
Therefore, the repair process is to find out these wrong settings and correct them, so that these BGPs can forward the correct routing policies.

Automaiton

There are many parts of the current automation process that can be improved, and these improvements have the opportunity to fully or partially mitigate the impact of the problem when it occurs.
There are two goals we want to achieve by improving automation

  1. Reduce the scope of impact when problems occur
  2. Reduce repair time when problems occur

in conclusion

If the CDN doesn’t work, go to the community first to see if your peers are crying, and you can probably know if it’s your own problem?

personal information

I currently have Kubernetes-related courses on the Hiskio platform. Interested people are welcome to refer and share, which contains my various ideas about Kubernetes from the bottom to the actual combat.

For details, please refer to the online course details: https://course.hwchiu.com/

In addition, please click like to join my personal fan page, which will regularly share various articles, some are translated articles, and some are original articles, mainly focusing on the CNCF field
https://www.facebook.com/technologynoteniu

If you use Telegram, you can also subscribe to the following channels, where I will regularly push notifications of various articles
https://t.me/technologynote

Your donation will give me the motivation to grow my article

This article is reprinted from: https://www.hwchiu.com/read-notes-62.html
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment