Restoration of the downtime accident at station B: 2021.07.13 We crashed like this

Source | Reprinted with permission from Bilibili Technology Public Account

darkest hour

At 22:52 on July 13, 2021, SRE received a large number of service and domain name access layer unavailability alarms, and the customer service side began to receive a large number of user feedback that B station could not be used. At the same time, internal students also reported that B station could not be opened, and even the APP The home page cannot be opened either. Based on the content of the alarm, SRE immediately suspected that there was a problem with the equipment room, network, four-layer LB, seven-layer SLB and other infrastructure, and urgently launched a voice conference, pulling relevant personnel from each team to start emergency treatment (for the convenience of understanding, the following accident handling process partially simplified).

causal orientation

22:55 After the related students who are at home remotely log in to the VPN, they cannot log in to the intranet authentication system (the internal system of station B has unified authentication, and it is necessary to obtain the login status before logging in to other internal systems), resulting in the inability to open the internal system, unable to Check monitoring and logs in time to locate problems.

At 22:57 , the SRE students in the company’s Oncall (without VPN and re-login to the intranet authentication system) found that the CPU of the seventh-floor SLB (based on OpenResty) in the online business main room was 100%, unable to process user requests, and other infrastructure feedback did not have problems , it has been confirmed that it is the SLB fault of the access layer seven, and the problem of the service layer below the SLB has been eliminated.

At 23:07 , the students who were at home remotely contacted the students in charge of the VPN and intranet authentication system, and learned that they can log in to the intranet system through the green channel.

At 23:17 , the relevant students successively logged in to the intranet system through the green channel and began to assist in dealing with the problem. At this time, the core students (seventh-floor SLB, four-layer LB, CDN) who dealt with the accident were all in place.

failure stop

23:20 SLB operation and maintenance analysis found that there was a burst of traffic during the failure, and it was suspected that the SLB was unavailable due to traffic overload. Because the main server room SLB carries all online services, if the SLB is not restored after reloading, the user traffic is rejected and the SLB is cold restarted. After the cold restart, the CPU is still at 100% and has not been restored.

23:22 Judging from user feedback, the multi-active computer room service is also unavailable. SLB operation and maintenance analysis found that a large number of SLB requests in the multi-active computer room have timed out, but the CPU is not overloaded. Prepare to restart the multi-active computer room SLB and try to stop the loss first.

At 23:23 , students in the internal group reported that the main station service had been restored. Observed the SLB monitoring of the multi-active computer room. The number of request timeouts was greatly reduced, and the business success rate recovered to more than 50%. At this time, the core functions of the multi-active business have basically returned to normal, such as APP recommendation, APP playback, comment & barrage pulling, dynamic, follow-up, film and television, etc. Non-multi-active services have not been restored yet.

23:25 – 23:55 There is no other immediately effective stop loss plan for the unrecovered business. At this time, try to restore the SLB in the main room.

Through Perf, we found that the SLB CPU hotspots are concentrated on Lua functions, suspected to be related to the recently launched Lua code, and began to try to roll back the recently launched Lua code.
Recently, SLB cooperated with security students to launch a self-developed Lua version of WAF. It is suspected that the CPU hotspot is related to this. I tried to remove WAF and restarted SLB, but SLB did not recover.
SLB optimized Nginx’s retry logic in the balance_by_lua phase two weeks ago to avoid requesting to the last unavailable node when requesting retry. There is a loop logic of up to 10 times here. I suspect that there are performance hot spots here, try to roll back After restarting SLB, it did not recover.
SLB launched Grayscale a week ago to support the HTTP2 protocol. I tried to remove the configuration related to the H2 protocol and restarted the SLB, but it did not recover.

Create a new source site SLB

00:00 SLB operation and maintenance tried to roll back the relevant configuration and still failed to restore the SLB, and decided to rebuild a new set of SLB clusters, so that the CDN could schedule the public network traffic of the faulty business, and observe whether the business could recover through traffic isolation.

00:20 The initialization of the new SLB cluster is completed, and the configuration of the Layer 4 LB and public IP is started.

01:00 The initialization and testing of the new SLB cluster are all completed, and the CDN starts to cut. SLB operation and maintenance continued to investigate the problem of 100% CPU, and the amount was assisted by business SRE students.

01:18 The traffic of the live broadcast service was switched to the new SLB cluster, and the live broadcast service returned to normal.

01:40 Core businesses such as the main site, e-commerce, comics, and payment were successively switched to the new SLB cluster, and business resumed.

01:50 At this time, all online services are basically restored.

Restoring SLB

01:00 After the new SLB cluster is built, SLB operation and maintenance began to continue to analyze the reasons for the 100% CPU while cutting the business volume and stopping losses.

01:10 – 01:27 Use the Lua program analysis tool to run a detailed flame graph data and analyze it. It is found that the CPU hotspot is obviously concentrated in the call to the lua-resty-balancer module, from the SLB traffic entry logic to the analysis. The underlying module is called, and it is found that there are multiple functions in the module that may have hot spots.

01:28 – 01:38 Select an SLB node, add debug logs to functions that may have hotspots, and restart to observe the execution results of these hotspot functions.

01:39 – 01:58 After analyzing the debug log, I found that the _gcd function in the lua-resty-balancer module returned an unexpected value: nan after a certain execution, and found the trigger condition: a certain The container IP has weight=0.

01:59 – 02:06 It is suspected that the _gcd function triggers a bug in the jit compiler. The running error falls into an infinite loop, causing the SLB CPU to 100%. Temporary solution: disable jit compilation globally.

02:07 SLB operation and maintenance modify the configuration of the SLB cluster, close the jit compilation and restart the process in batches, all the SLB CPUs return to normal, and the requests can be processed normally. At the same time, a process core file under the abnormal scene is reserved for subsequent analysis.

02:31 – 03:50 SLB operation and maintenance Modify the configuration of other SLB clusters, temporarily disable jit compilation, and avoid risks.

root cause localization

At 11:40 , the bug was successfully reproduced in the offline environment. At the same time, it was found that the problem still existed even if the jit compilation was turned off in SLB. At this time, we also further locate the cause of this problem: in a special publishing mode of the service, the weight of the container instance will be 0.

12:30 After internal discussion, we believe that the problem has not been completely solved, and SLB still has great risks. In order to avoid the recurrence of the problem, we finally decided: the platform prohibits this publishing mode; SLB ignores the weight returned by the registration center and forces it to specify Weights.

13:24 Publishing platform prohibits this publishing mode.

14:06 SLB modifies the Lua code to ignore the weight returned by the registry.

14:30 SLB is released and upgraded in the UAT environment, and it has repeatedly verified that the node weight is as expected, and this problem no longer occurs.

15:00 – 20:00 All SLB clusters are produced gradually in grayscale and the full upgrade is completed.

Reason statement

background

Station B migrated from Tengine to OpenResty in September 2019, and developed a service discovery module based on its rich Lua capabilities. It synchronizes service registration information from our self-developed registry to Nginx shared memory. When SLB requests forwarding, it uses Lua selects nodes from shared memory to process requests, using OpenResty’s lua-resty-balancer module. It had been running stably for almost two years by the time of the failure.

In the first two months of the failure, some businesses proposed to realize the dynamic weight adjustment of the SLB by changing the weight of the service in the registry, so as to achieve a more refined grayscale capability. The SLB team evaluates this requirement and believes that it can be supported. After the development is completed, it will be launched in grayscale.

Incentive

In a certain publishing mode, the instance weight of the application will be temporarily adjusted to 0. At this time, the weight returned by the registry to the SLB is “0” of the string type. This release mode is only used in the production environment, and the frequency of use is extremely low, and this problem is not triggered during the early grayscale process of SLB.
In the balance_by_lua stage, the SLB will pass the service IP, Port, and Weight stored in the shared memory to the lua-resty-balancer module as parameters to select the upstream server. When the node weight = “0”, the _gcd function in the balancer module The received input parameter b may be “0”.

Root cause

Lua is a dynamically typed language. In common practice, variables do not need to define types, but only need to assign values to variables.
When Lua performs arithmetic operations on a string of numbers, it will try to convert the string of numbers into a number.
In Lua language, if you perform a math operation n % 0, the result becomes nan (Not A Number).
The _gcd function does not perform type verification on the input parameters, allowing parameter b to be passed in: “0”. At the same time, because “0” != 0, the return of this function is _gcd(“0”,nan) after the first execution. If int 0 is passed in, the [ if b == 0 ] branch logic judgment will be triggered, and there will be no infinite loop.
When the _gcd(“0”,nan) function is executed again, the return value is _gcd(nan,nan), and then the Nginx worker starts to fall into an infinite loop, and the process CPU is 100%.

problem analysis

After review, it was found that when a user logs in to the intranet authentication system, the authentication system will jump to the login cookies under multiple domain names. One of the domain names is represented by the faulty SLB. Affected by the SLB fault, the domain name cannot be Process the request, causing the user to log in to fail. The process is as follows:

After the pictures, we sorted out the access links of the office network system and separated them from the user links. The office network links no longer depend on the user access links.

When a multi-active SLB fails, due to CDN traffic back-to-source retries and user retries, the traffic suddenly increases by more than 4 times, and the number of connections suddenly increases by 100 times to 1000W level, which causes this group of SLBs to be overloaded. Later, due to traffic drop and restart, it gradually recovered. The daily evening peak CPU usage of this SLB cluster is about 30%, and the remaining Buffer is less than twice. If the capacity of the multi-active SLB is sufficient, the burst traffic can theoretically be carried, and the multi-active service can be restored to normal immediately. It can also be seen here that in the event of a computer room-level failure, multi-active is the fastest solution for business disaster recovery and loss prevention. This is also a direction we focus on governance after the failure.

Why do I choose to create a new origin site slicing instead of parallel after the rollback of the SLB changes is invalid?

Our SLB team was small, with one platform developer and one component operator at the time. In the event of a failure, although other students assisted, the core changes of the SLB components need to be executed or reviewed by the component operation and maintenance students, so they cannot be parallelized.

Our public network architecture is as follows:

Three teams are involved here:

SLB Team: Select SLB Machine, SLB Machine Initialization, SLB Configuration Initialization
Layer 4 LB Team: SLB Layer 4 LB Public IP Configuration
CDN team: CDN update back-to-source public IP, CDN volume

In the SLB plan, only SLB machine initialization and configuration initialization have been drilled, but no full-link drill has been done with the four-layer LB public network IP configuration and cooperation between CDNs, and meta information is not linked between platforms. Real Server information provision of layer LB, public network operator line, CDN back-to-source IP update, etc. Therefore, a complete new origin site takes a long time. After the accident, the linkage and automation of this area is also our key optimization direction. At present, a new cluster creation, initialization, and the configuration of the four-layer LB public network IP can be optimized within 5 minutes.

Subsequent root cause identification proved that closing jit compilation did not solve the problem. How did the SLB that failed that night recover?

That night, it was located that the cause is the weight=”0” of a container IP. The app was published at 1:45, and the weight=”0” incentive was removed. Therefore, although the subsequent shutdown of jit is invalid, because the incentive disappears, it returns to normal after restarting SLB.

If the incentive did not disappear at that time, SLB did not recover after closing jit compilation. Based on the located incentive information: the weight=0 of a container IP, this service and its release mode can also be located to quickly locate the root cause.

Optimization and improvement

There are many optimizations and improvements on both the technical side and the management side of this accident. Here we only list the core optimization and improvement directions on the technical side formulated at that time.

1. Multi-live construction

At 23:23, the core functions of the business that have done more activities basically returned to normal, such as APP recommendation, APP playback, comment & barrage pulling, dynamic, chasing, film and television, etc. In the event of a failure, the live broadcast service also did more work, but the reason why it was not restored in time that night was that although the live broadcast mobile home page interface realized more work, it did not configure multi-room scheduling. As a result, the homepage of the live APP cannot be opened when the host room SLB is unavailable, which is a pity. Through this accident, we found some serious problems with the multi-active architecture:

Insufficient capacity of multi-active base frame

The relationship between the computer room and the multi-activity positioning of the business is chaotic.
CDN multi-room traffic scheduling does not support user attribute fixed routing and fragmentation.
The multi-active business architecture does not support writing, and the writing function was not restored at that time.
Some storage components have insufficient multi-active synchronization and switching capabilities, and cannot achieve multi-active.

There is a lack of platform management for business multi-active element information

Which business does more work?
What type of business is multi-active, intra-city dual-active or remote unitization?
Which URL rules of the business support multi-active, and what is the current multi-active traffic scheduling policy?
The above information could only be temporarily maintained by documents at that time, and there was no unified management and arrangement on the platform.

Weak disaster recovery capability

Multi-live and cut volume depends on CDN classmates to execute, other personnel have no authority, and the efficiency is low.
There is no cutting management platform, and the entire cutting process is invisible.
The access layer and storage layer are separated from each other, and the amount cannot be arranged.
There is no business multi-active element information, and the quantification accuracy rate and disaster recovery effect are poor.

Our previous multi-active cut-off was often a scenario: business A failed, and we had to cut to the multi-active computer room. After communicating with R&D, the SRE confirmed that the domain name A+URL A should be switched, and informed the CDN operation and maintenance. After the CDN operation and maintenance cut the amount, the research and development found that there was still a URL that was not cut, and the above process was repeated again, resulting in extremely low efficiency and poor disaster recovery effect.

Therefore, the main direction of our multi-living construction:

Multi-active scaffolding capacity building

Optimize the support capabilities of multi-active basic components, such as data layer synchronization component optimization, access layer support based on user sharding, so that the multi-active access cost of services is lower.
Reorganize the positioning of each computer room under the multi-active architecture, and sort out the Czone, Gzone, and Rzone business domains.
Promote core business that does not support multi-activity and business transformation and optimization that have achieved multi-activity but whose architecture is not standardized.

Enhancement of multi-active management and control capabilities

Unified management and control of the meta information and routing rules of all multi-active services, and linkage with other platforms to become a multi-active metadata center.
It supports multi-active access layer rule scheduling, data layer scheduling, plan scheduling, traffic scheduling, etc., and the access process is automated and visualized.
It abstracts the ability of multi-active volume cutting, connects with CDN, storage and other components, realizes one-click full-link volume cutting, and improves efficiency and accuracy.
It supports the pre-check of pre-capacity when multiple live cuts are made, and the risk inspection during cut quantity and the observability of core indicators.

2. SLB governance

Architecture Governance

Before the fault, a set of SLBs in a computer room provided external proxy services in a unified manner, so that the fault domain could not be isolated. Subsequent SLBs need to split clusters by business department, and core business departments have independent SLB clusters and public network IPs.
Discuss with the CDN team, the four-layer LB & network team to determine the management plan for SLB cluster and public network IP isolation.
The SLB capability boundary is clarified. The non-essential capabilities of SLB should be integrated into API Gateway. SLB components and platforms are no longer supported, such as the grayscale capability of dynamic weights.

Operation and maintenance capabilities

The SLB management platform implements Lua code version management, and the platform supports version upgrade and fast rollback.
The environment and configuration of the SLB node are initialized and hosted on the platform, and the API of the four-layer LB is linked to realize the four-layer LB application, public network IP application, node online and other operations on the SLB platform, and the whole process can be initialized within 5 minutes.
As the core of the core service, SLB has a high utilization rate of 30% without the ability to expand elastically. It needs to be expanded to reduce the CPU to about 15%.
Optimize the CDN back-to-source timeout period to reduce the number of SLB connections in extreme failure scenarios. At the same time, the limit performance stress test is performed on the number of connections.

Self-research ability

There is a drawback for the operation and maintenance team to work on the project. After the development is completed and the self-test is no problem, it will be launched in grayscale, and there is no professional testing team involved. This component is too core, and a basic component testing team needs to be introduced to perform a complete exception test for the SLB input parameters.
Together with the community, Review uses the OpenResty core open source library source code to eliminate other risks. Based on the existing features and defects of Lua, improve the robustness of our Lua code, such as variable type judgment, coercion, etc.
Recruit professional LB people. We chose to develop based on Lua because Lua is easy to use and the community has similar success stories. The team has no senior Nginx component development students, nor C/C++ development students.

3. Troubleshooting

In this accident, the business multi-active traffic scheduling, the speed of creating an origin site, the speed of CDN slicing, and the back-to-origin timeout mechanism did not meet expectations. Therefore, we will explore the fault drill plan at the computer room level in the future:

Simulate the failure of a single computer room of CDN back-to-source, together with business development and testing, check the disaster recovery effect of the multi-active business through the real performance of the business on both ends, and optimize the hidden danger that the business multi-active does not meet the expectations in advance.
Grayscale specific user traffic to the CDN node of the exercise, simulate the source site failure at the CDN node, and observe the disaster recovery effect of the CDN and the source site.
Simulate the failure of a single computer room, and practice the multi-active cut-out and stop-loss plan of the business through the multi-active management and control platform.

4. Emergency Response

There has never been a NOC/technical support team at Station B. In the event of an emergency, the SRE students who are responsible for fault handling are responsible for fault response, fault reporting, and fault coordination. If it is an ordinary accident, it is fine. If it is a major accident, it is too late to synchronize the information. Therefore, the emergency response mechanism of the accident must be optimized:

Optimize the fault response system, clarify the responsibilities of the fault commander and fault handler in the fault, and share the pressure of the fault handler.
When an accident occurs, the fault handler immediately finds the backup as the fault commander, responsible for fault notification and fault coordination. Make it mandatory on the team and make it a habit for everyone.
Build an easy-to-use fault notification platform, responsible for fault summary information input and fault progress synchronization.

The cause of this failure is that a service is triggered using a special release mode. Our event analysis platform currently only provides application-oriented event query capabilities, and lacks user-oriented, platform-oriented, and component-oriented event analysis capabilities:

Cooperate with the monitoring team, build the platform control plane event reporting capability, and promote access to more core platforms.
SLB builds data plane event change reporting and query capabilities for the underlying engine. For example, when the service registration information changes, the IP update and weight change events of an application can be queried on the platform.
Expand event query and analysis capabilities, in addition to application-oriented, build event query and analysis capabilities for different users, different teams, and different platforms to help quickly locate fault causes.

Summarize

When this accident happened, station B was hung up and quickly appeared on the hot search of the whole network. As a technician, the pressure on him can be imagined. The accident has happened, and what we can do is to reflect deeply, learn lessons, sum up experience, and forge ahead.

This article, as the first article in the “713 Accident” series, briefly introduces the causes, root causes, treatment process, optimization and improvement of the failure. Subsequent articles will introduce in detail how we implemented optimization and landing after the “713 accident”, so stay tuned.

Finally, I want to say one thing: the multi-active high-availability disaster recovery architecture has indeed taken effect.

The text and pictures in this article are from InfoQ

This article is reprinted from https://www.techug.com/post/resumption-of-station-b-downtime-accident-on-july-13-2021-we-collapsed-like-this15d48879c7d4fec00481/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment Cancel Reply