Tech Talk · Talk about cloud technology | How to ensure high reliability of key basic components?

On April 14th, Marshall, an expert in reliability technology of Trust Cloud, shared “Technical Analysis of Reliability of Key Basic Components” in the live series of “Tech Talk Cloud Technology Talk” series of Trust Cloud, and introduced in detail the common problems of IT systems. The impact of physical failures on business reliability, how to use software definition to solve hardware failures, etc. The following is a summary of the content he shared. If you want to know more, you can click to read the original text to watch the live replay.

1. Definition and Objectives of Reliability

Reliability means that the system will not crash, restart or even lose data unexpectedly, which means that a reliable system must be able to self-repair faults, and isolate faults that cannot be self-repaired as much as possible to ensure the normal operation of the rest of the system . Simply put, the goal of reliability is to reduce business interruption time due to failures (product quality, external components, environment, human factors, etc.).

High reliability can be understood from three levels: First, the system can always run normally without failure. In this case, it is necessary to improve the quality of hardware research and development. Second, the failure does not affect the business. Third, it affects the business but can recover quickly. The latter two levels can be “software-defined” to avoid business interruptions caused by hardware failures.

When it comes to reliability, it starts with understanding the key foundational components of a server. According to the industry’s server statistics, the problems of hardware components are concentrated on the memory, hard disk, CPU, motherboard, power supply, and network card. In the cloud environment, several virtual machines of different services and different scenarios may be running on the same server. Once the physical device crashes, it will affect many users and cause huge losses to the operators themselves. In the existing failure modes, memory and hard disk failures are the most frequent and serious failures.

Regarding the failure of memory and hard disk, you can learn more about it through these two cases.

Case 1: A memory UCE error causes the server system to repeatedly crash and restart. The server is down and restarted, log in to the BMC management interface of the server, and query the alarm information of the server. The following alarm appears: “2019-07-25 08:03:06 memory has an uncorrectable error.” Later, further query the hardware error log file, It is found that DIMM020 has a large number of memory CE errors and some memory UCE errors. It can be seen that the UCE error occurred in the DIMM020 memory module, which caused the server to crash and restart.

Case 2: A slow disk card causes a big data cluster failure. A slow disk failure occurs on a cluster node of a big data platform (the system executes the iostat command every second to monitor the system indicators of disk I/O. If the number of cycles with svctm greater than 100ms is greater than 30 times within 60s, it is considered that there is a problem with the disk. the alert). First, ZOOKEEPER failed, and then the cluster balance state was abnormal. Then other services on the same node also fail, and finally all services on the entire node fail, and then restart automatically. But after 3-10 minutes the node repeats itself. If no other problems are found, if you choose to restart the system, the business will be interrupted for more than ten minutes.

2. Memory reliability technology

From the external structure, the memory includes PCB board, gold finger, memory chip, memory stick fixed card gap, etc. From the perspective of internal structure, it includes storage bank, storage unit Cell, storage array Bank, Chip (device), Rank, DIMM, Channel, etc.

Based on the structure of memory, the improvement of memory technology (process reduction and frequency) tends to bring a higher failure rate.

(1) Challenges brought by process shrinkage

(1) Lithography is more susceptible to diffraction, focusing, etc., which affect the quality.

(2) Epitaxial growth (EPI) is prone to short circuits between leakage growth and epitaxial growth.

(3) The influence of particles caused by processes such as etching and cleaning is aggravated.

(4) The size of a single die becomes smaller, and the number of single wafer dies increases.

(5) In the future, the multi-die back-end packaging of TSV packaging will be more difficult and the failure rate will increase.

(2) Challenges brought by frequency increase

(1) The high-speed signal timing margin is smaller, and the compatibility problem is more prominent.

(2) Signal attenuation is more serious, DDR5 adds DFE circuit, and the design is more complicated.

(3) Higher frequency brings higher power consumption and higher requirements for PI.

Memory faults can be divided into two categories according to whether the fault can be corrected or not: CE (Correctable Error): a general term for any single-bit error and some single-granular multi-bit errors that can be corrected; UE (Uncorrectable Error): a general term for uncorrectable errors. There are some UE errors that cause the system to crash because the OS cannot handle it.

The reasons for memory failure include: memory cell energy leakage, high impedance in the memory data transmission path, abnormal memory voltage operation, abnormal internal timing, abnormal internal operation (such as self-refresh), abnormal bit line/word line, and address decoding circuit. Soft failure caused by anomalies, weak cells in memory (normally usable), cosmic rays or radioactivity (no permanent damage) (multiple detections fail to reproduce).

When dealing with faults, hierarchical processing is carried out. There are two ideas in the industry: software-led and hardware-led. Based on the hardware-dominant point of view, some high-quality hardware will be selected during device selection. In addition, the hardware itself has some “reliability”, such as automatically correcting some relatively simple errors.

But there is no way for hardware to be very reliable, and software is needed to do some work. The software-defined approach isolates the faulty memory area so that it is no longer used, so that it has no impact on the business.

After a CE (correctable error) occurs, it may become an uncorrectable UE error if it is not handled. Therefore, it is necessary to prevent micro-duplication, and when CE (correctable error) occurs, further processing is required to isolate the suspected fault.

Convinced Cloud’s Design Ideas for Memory CE Fault Isolation

When a CE triggers an interrupt in the memory hardware, see if the memory can be isolated (not used by the operating system kernel or peripherals). If it can be isolated, add it to the whitelist to isolate the memory. When the memory isolation function is used to switch the faulty memory page to a normal memory page, the faulty memory page is isolated and no longer used.

At the same time, detailed information such as the location and frequency of these faults will be alerted to help O&M personnel replace the faulty memory modules. For the memory that cannot be isolated, when the system restarts the next time, according to the information of the memory error area recorded before the restart, when the system does not use the memory, isolate the part of the memory in question, so as to ensure that the memory used by the system is not available. part of the question.

↑ The overall architecture of the memory CE fault isolation scheme

After Convinced Cloud implemented this solution, the average isolation success rate was 96.93% by collecting statistics on existing network operations. Compared with the CE shielding of the general solution in the industry, which cannot isolate the CE in time and locate the memory module after an error, Xinfuyun has a leading edge in the solution, and has applied for 5 patents in this field. The isolation scheme has little overhead for CPU and memory resources during use, and the effect is obvious.

For memory UE failures, Convince Cloud’s solution design idea is to solve the problem of recoverability and early warning of memory UEs, downgrade some UE downtime to killing the corresponding application, and even only need to isolate bad pages to avoid downtime to improve system stability sturdiness and reliability. At least improve the memory failure recovery capability by more than 30%. The solution of Xinfu Cloud can reach 60% of the memory UE failure recovery rate, which is better than the public data in the industry (in the industry, UE failure recovery can cover 50%). In the actual POC test scenario , which is better than the general solution in the industry (for example, the general solution will be down, there is no memory fault alarm log, and the slot where the faulty memory is located cannot be located).

↑ The overall architecture of the memory UE fault isolation scheme

Third, the reliability technology of hard disk

Hard disks mainly include system disks, cache disks, and data disks. The system disk generally uses a solid-state drive (SSD) to store the cloud platform system software and host OS, as well as related logs and configurations. Cache disks generally use solid-state drives (SSDs), and use the fast speed characteristics of SSDs as cache disks as a cache layer for IO read and write acceleration. They are used to store data that is frequently accessed by user services, which is called hot data. The data disk generally uses the mechanical hard disk HDD, and the high capacity is suitable for the data disk as the final storage location of the data (such as the virtual disk of the virtual machine).

(1) Hard disk TOP failure mode/category:

Stuck: Hard disk IO temporarily or has not responded;

Slow card: The hard disk IO is obviously slow or stuck;

Bad sectors: the logical unit (sector) of the hard disk is damaged;

Bad block: The physical unit (block) of the hard disk is damaged;

Insufficient lifespan: The mechanical hard disk is physically worn out, or the flash memory particles of the solid-state hard disk actively reach the number of times of erasing and writing.

When the input and output (I/O) response time of the hard disk becomes longer, or the hard disk is stuck and does not return, it will cause the user’s business to continue to be slow, or even hang. A hard disk is stuck and even causes all the business of the system. interrupt.

As the service life increases, the probability of hard disks with bad sectors, head degradation or other problems also increases; from the distribution of historical problems and the industry hard disk reliability failure curve, it can be seen that the hard disk chuck problem is becoming a problem that affects the stable operation of the system one of the most serious problems.

↑ The overall architecture of Convince Cloud Card Slow Disk Solution

(2) Convinced Cloud’s ideas for the card slow disk solution:

1. For the problem of complex slow failure modes of disk cards, multi-dimensional detection and diagnosis are carried out. It adopts common tools and information of Linux, and does not rely on specific hardware tools, including kernel log analysis, smart information analysis, hard disk io monitoring data analysis, etc. to accurately locate faulty hard disks from multiple dimensions.

2. A multi-level isolation algorithm is formulated for the choice between business and data when dealing with slow disks. ①Mild slow disk: No isolation, the user will be notified by an alarm on the page; ②Severe slow disk: Select a service: No isolation when the peer end is abnormal, and the user will be notified by a page alarm; If it is not isolated, the page alarm will notify the user; ④ Chuck (frequent): Select data: 3 exceptions occur within one hour, and permanent isolation will be performed.

3. Threshold grinding is performed on the basis of the multi-level isolation algorithm. Use a large number of real card slow disks for testing and data collected on the user side to formulate more accurate card slow detection thresholds; use fault injection tools for threshold verification.

The effect of enabling the card slow disk function can ensure that the isolation is triggered within 1 minute, the virtual machine does not appear HA, and the service IO is restored to stability after the isolation.

The above is the main content of this live broadcast. IT friends who are interested in cloud computing can follow the “Sangfor Technology” public account to review this live broadcast and learn more about cloud computing.

Leifeng Network

This article is reproduced from: https://www.leiphone.com/category/industrynews/XouJJ6iy6jTcDt4C.html
This site is for inclusion only, and the copyright belongs to the original author.