Release itNotes

Original link: https://xiaket.github.io/2022/notes-on-release-it.html

Release it is a book that I have been wanting to review since I finished reading it at the beginning of the year. In addition to common reasoning and key points, the book also rambles about some examples of production environment failures. Below are some reading notes and fault reports, In order to effect the stone of the mountain.

Case: Airline boarding pass system failure

The architecture of the airline is very simple. User requests go to the load balancer first, then to multiple background business logic servers, and then to two master-slave database servers. The service has monitoring that can automatically switch databases and replace faulty business logic servers. One night, a maintenance staff performed a routine database master-slave switching operation. After the operation, the request returned to normal. After the maintenance staff observed that there was no abnormality, they went to rest. However, the service was suspended two hours later, resulting in all the airlines in the United States. The boarding service didn’t work. The programmer restarted the business server, the failure recovered, and the total failure time was three hours.

After analysis, the cause of the failure is that when processing the database connection, the programmer handles the exception when there is an exception, stops the current query and closes the connection. However, the same exception will be triggered when the programmer does not pay attention to stopping the current query (when When the database master-slave switch), the database connection will not be released normally. After a period of time, all the connections in the connection pool will be polluted, and the business logic server will not be able to query the database at this time.

The lesson is that services should expect other services to fail and fully protect themselves from failures of other services. The simplest is to add a timeout when making external calls to avoid unlimited waits; add messages where appropriate Queues will also help.

Case: Database connection loss failure

For a three-tier application, when the business logic server starts, a connection pool will be created, and each connection will be connected to the database. Every day at five o’clock, the request will hang up, and it will be restarted until the same time the next day. From the business server to the The network connection to the database is normal.

The reason for the failure is that there is a firewall between the business server and the database server, and there is a function on the firewall to automatically kill vacant connections every hour. So when the business is not too busy, there may only be one or two connections that are really alive. Five-point request After the upload, a single connection can’t stand it and hangs up. When the firewall unilaterally discards this connection, the servers on both sides do not receive any data packets, so they still think that this connection still exists and is healthy. When trying to use this connection to write data , after a timeout of 20-30 minutes, an error will be reported that the connection is damaged. And reading data has an infinite timeout.

The repair method is that the Oracle database has a function to detect dead connections. The principle is to periodically send heartbeat packets to the client to see if the client is still working normally. This heartbeat packet can reset the last data traffic time on the firewall to prevent this connection from being blocked. closure.

Reddit scaling failure

The official failure description is here . When migrating a service, in order to avoid the automatic expansion and contraction of the system causing the system to hang, the programmer manually turned off the automatic expansion and contraction system. A subsequent software update system found that there was a human modification here, and re-enabled automatic Expanding and shrinking the system, resulting in system failure.

The author believes that restrictions should be added to the automatic scaling system to avoid backstabs:

If the robot finds that more than 80% of the systems are faulty, it is more likely that the observer himself has a problem
It is faster to expand and slower to shrink. It is much safer to start a new machine than to shut it down.
When the difference found by the expansion and contraction system is too large, human flesh should be asked for confirmation before making changes.
Should not be unlimited expansion
A cooling mechanism should be added to the expansion to avoid excessive expansion due to the time difference between the issuance of the expansion command and the availability of the real machine.

Data table suddenly increases the number of rows failure

A large-scale three-tier application, suddenly one day the logic server reports various failures, often restarts and loads the cache and hangs again. Later, it was verified that a small database table with several rows of data suddenly had 10 million rows of data. , and I didn’t add any limit when I wrote it before, which caused memory overflow. This fault was nothing, the order of the author’s investigation was quite satisfactory to me, and there was a lot of special information directly related to java, so I omitted it. The author said that this This situation is caused by the lack of a good limit on database query results. In fact, not only the database may have such a situation, but there may also be such suicidal queries if there is no well-designed API.

In addition, although the author did not mention it, I think that the lack of a change management system is also the direct cause of people having no clue about the source of the failure. This kind of database change will immediately cause the newly started logic server to be unable to execute the loading logic normally. , and if the cause of the failure can be quickly located at that time, the loss caused by the unavailability of the service can also be reduced.

AWS S3 failure

The fault report is here . When this fault came out, I thought this fault report was worth rereading. I read it several times at that time, and now, a few years later, I reread it several times. I will not discuss this specific case any more. There have been many comments. Except for one point, which I also agree with the author, in the entire fault report, human error is not mentioned once, but system error, although the cause of the whole problem is actually a human operation failure. , not for His Holiness, but to discuss and summarize the operator of the entire system, the API provided by the report script (playbook), and the behavior of the system itself when it receives erroneous input.

Ambiguous log events

Thank goodness this is finally not a fault. When the author chatted with an operation and maintenance once, the operation and maintenance received an alarm log message (“Data channel lifetime limit reached. Reset required.”), and then the operation and maintenance went online and made a database master From switching. In fact, this log is written by the author’s code. The intention is that the key of an encrypted channel has been used for too long and needs to be reset (and the service itself will be reset). The author pointed out that this log itself It should be clear who should perform the reset operation, so that this problem can be largely avoided. In addition, this log itself is a debugging added by the author, just want to see how often this reset occurs during runtime , he forgot to remove it later. I think before the production environment goes online, review all the logs, remove the unnecessary logs, and ensure that each log should be able to explain what it is doing and avoid ambiguity.

Of course, it is inaccurate to say that this is not a fault. This operation and maintenance will do a master-slave switch once a week during the busiest period, which will naturally bring a short downtime.

Specification imperfect event

The authors discussed a case where two groups were located on two continents. In order to collaborate, they discussed and adopted a well-established Specification. In most cases, the work ended there. But the authors went a step further at the time , wrote a complete test case. In the process of writing this test case, they found quite a few defects in Specification caused by boundary conditions. After the entire test case was run, they had a good understanding of the operation of the entire system. Better confidence, and the system goes online very smoothly.

Cost accounting for automated testing

If there is an automated test pipeline (pipeline), we can get a lot of benefits, and I won’t go into details. But the direct benefit is that we can avoid the loss caused by the production environment hanging up. For example, the development of the test pipeline may require two programs with a monthly salary of 10,000 yuan. The total development cost is 20,000 yuan. Only occasional maintenance may be required later. For example, one such programmer is required to spend three days on maintenance every month, that is, the maintenance cost per month is about 20,000 yuan. Thousands. Let’s take a look at the running cost. If we run this test process at a cost of 10 yuan each time, and run it 50 times a day on average, the monthly running cost is about 2,000. Let’s make a reasonable assumption that if this test pipeline The development can reduce our failure rate, and the annual reduction of service failure time is 10 hours (a very conservative estimate). So, not to mention other benefits, if our service revenue reaches 5,000 yuan per hour, then a year of impairment You can cover the operation and maintenance costs. At this time, you should invest in automated testing.

In the real world, there is often no such simple either-or approach. Tests are often run in other ways, and the development cost and operation and maintenance cost are different. But from a cost accounting point of view, a rational manager should follow the revenue SLA/SLO To estimate the cost of development and operation and maintenance. Reasonable allocation of resources to maximize revenue.

how the system hangs

The one sentence that struck me the most throughout the book was: Every system that dies is because a queue somewhere is not working properly.

Whether it is IO queues, TCP queues, thread pools, soft and hard limits, including external queue services. If you can find the faulty queue and solve it, you can solve the system fault.

TCP network connection failure table

Below are all the TCP exceptions listed in the book that the service can handle as needed:

TCP connections can be refused
TCP connections may time out
The opposite side may reply SYN/ACK and never see again
Opponent or GFW may issue RST
The opposite side may specify a receive window, but does not send enough data
The connection may be established, but no data is sent from the opposite side
The connection is established, but packet loss may cause retransmission delay
The connection is established, but the opposite side does not ACK the received packets, resulting in infinite retransmission
The service receives the request and sends back the response header, but it has not sent the content
The service may send a one-byte response every thirty seconds
The service may send back an HTML instead of json
The service may send back a much larger response than we expected
The service may reject authentication requests in various ways

This article is reprinted from: https://xiaket.github.io/2022/notes-on-release-it.html
This site is for inclusion only, and the copyright belongs to the original author.