Introduction: ARIES (short for Algorithm for Recovery and Isolation Exploiting Semantics) is a data recovery algorithm in the storage engine mentioned in the paper “ARIES: A Transaction Recovery Method Supporting Fine-Franularity Locking and Partial Rollbacks Using Write-Ahead Logging” . This paper can be said to be a must-read paper in the field of storage engine data recovery. The two weekly journals are illustrations of this paper. This is the first one.

Graphic ARIES Papers (Part 1)

Before explaining the principle of the ARIES algorithm, you need to have a certain understanding of the log system of the Page oriented storage engine before continuing to explain the recovery algorithm based on this log system.

question

In a storage system, errors are very common, which involves the need to continue to work when the system recovers after an error occurs, that is, the data cannot be damaged and the entire system cannot run.

Therefore, when the system fails and needs to be restarted and restored, the following two actions are involved:

Undo: A transaction that has not been completed or occurred for various reasons, and its modifications need to be undone, that is, rolled back to the old value before the transaction.
Redo: A transaction that has been committed, and the effect of its modification operation needs to be reflected in the new value.

Take a look at the questions posed in the figure below:

In the image above:

There are transactions T1 and T2 executing at the same time:
- Transaction T1: Modify the value of A to 3, but before the transaction is committed, transaction T2 begins to execute.
- Transaction T2: Modifies the value of B to 8 and commits successfully.
- Transaction T1 terminates: After transaction T2 successfully commits, transaction T1 terminates.

The execution order of this transaction schedule raises several questions:

What needs to be done to roll back the uncommitted transaction T1?
For the uncommitted transaction T1, is the modification operation allowed to take effect on the persistent storage (that is, modify A to 3)?
In the database file on the disk, whether the modification operation of the successfully submitted transaction T2 should be immediately placed on the disk (that is, synchronizing the modified content from the buffer pool to the hard disk).

The first question is put aside for now, let’s look at the next two questions.

“Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage” is called Steal policy :

steal: Allows the modification of uncommitted transactions to take effect on persistent storage.
no steal: vice versa.

Whether a transaction needs to synchronize all changes to persistent storage before committing (whether the DBMS requires that all updates made by a txn are reflected on non-volatile storage before the txn is allowed to commit.), there are also two strategies:

force: All modifications of the transaction must be synchronized to persistent storage before the transaction is allowed to commit.
no-force: vice versa.

Around these two different dimensions, there are different approaches.

Let’s follow the above definition and use the no-steal+force strategy to revisit the previous problem.

In the image above:

Transaction T1: Due to the no-steal strategy, for the transaction T1 that cannot be committed, the modification of the transaction will be in the buffer pool of the memory at most, and will not take effect on the disk, so the operation of rolling back the transaction only needs to The data in the buffer pool of the memory can be undone.
Transaction T2: Due to the force strategy, all modifications must be synchronized to the database file on disk before the transaction is committed.

As you can see, the strategy of using no-steal+force is relatively simple because:

The undo operation only needs to roll back the data in memory, because the database file has not been synchronized to the disk.
For committed transactions, there is no need for redo operations because all changes to the database file must be forced to be synchronized before the transaction is committed.

However, the problem with this combination of strategies is that it cannot support a transaction that modifies a large amount of data. For example, suppose that the buffer pool in memory can only accommodate a 1K modification buffer, and a transaction modifies a total of 2K data. At this time, should the data exceeding the buffer pool size be placed on the disk immediately? Following this strategy will not solve the problem.

Let’s look at the two most common practices:

Shadow page.
WAL (Write Ahead Log).

shadow page

The principle of shadow pages is this, maintaining two independent copies of database files:

Primary copy: saves only the modifications of committed transactions.
Shadow copy: Copy the content of the page to be modified and save the modifications of the uncommitted transaction.
With two different pieces of data, an ongoing transaction only needs to modify the shadow copy, and when an uncommitted transaction needs to be committed, switch the shadow copy to become the primary copy.

The following diagram explains how shadow pages work:

In the picture above:

Different types of pages are marked with three colors: main page, shadow page, and free page. The free page is the main page that has been replaced. Since a new main page has been generated, the previous main page has become Free pages can be reused the next time a shadow page is generated.
Before the modification, there are already 3 main pages, of which page 1 is the current root page.
Suppose a transaction needs to modify these three main pages at the same time. The system finds that there are no free pages currently available, so the database file is expanded to generate three new physical pages. The content of the original page is first copied, and then the transaction is modified on this basis.
When the transaction modification is completed, it is only necessary to modify the root page to the latest page 1, and several previous main pages involved in this modification become free pages, which can be recycled in the next modification.

According to the previous definitions of the two strategies, the shadow page belongs to the scheme that adopts the no-steal+force strategy:

no-steal: When it is not submitted, it cannot be directly modified to become the main page.
force: Before the transaction is committed, the modified shadow page must be switched to the new main page.

Let’s take a look at how the redo and undo of the shadow page are done:

Revocation: It is only necessary to change the physical pages that are allocated during the write transaction into free pages again, which can be recycled next time. Since there is no switching root page numbering, there is no need to do any other work.
Redo: There is no need to redo the operation, because as long as the operation of switching the shadow page to the main page is completed, it is equivalent to the modification of this transaction being visible.

Let’s take a look at the problem with shadow pages:

Since it is a “shadow”, it is required that these pages must copy the data of the original page before modification, which is very expensive.
For data structures like B-Tree, this modification will always be carried out from bottom to top, and the nodes passing through the path need a shadow page to store the modification.
The replaced main page will generate free pages, and the corresponding garbage collection strategy is required to recycle these free pages.
The page content will be fragmented, because the written shadow page may be a newly added physical page or a free page that is recycled. If the logically adjacent pages are physically scattered, it is not friendly to the read-write cache. .
Finally, when the data is placed on the shadow page, it is essentially a random write to the file, and the cost of random write is very high: it needs to address the specified location first, and then modify it.
At the same time, only one write transaction is allowed to be in progress, because if there are multiple write transactions, and different write transactions modify the same page, it cannot be processed.

Example using shadow pages

The implementation of boltdb and the journal implementation of sqlite were analyzed in previous blogs, both of which are examples of shadow pages:

Analysis of the implementation of boltdb 1.3.0 (1) : In boltdb, before modifying the page, first use the mmap mechanism to copy the page to be modified. After the modification is completed, the main page and the shadow page are switched uniformly, and the main page before the modification is set to idle at the same time The page is available for the next recycling.
sqlite3.36 version btree implementation (3) – journal file backup mechanism : sqlite’s journal file, responsible for backing up the data of the modified page before modification, these backed up data are useless after the transaction is committed, if the intermediate transaction is interrupted, the journal can be used directly Backup the contents of the file to restore the database.

If you are interested, you can learn more about the implementation of shadow pages in these two articles. However, the use of the shadow page mechanism is not much, and more is the WAL described below.

WAL (Write Ahead Log)

In addition to the shadow paging strategy, another strategy for saving changes is to use WAL (Write Ahead Log). Compared with the former, the biggest advantage of WAL is that since the changes are added to the log, it can be considered that the changes are saved, and in the file Tail append operations are relatively much faster than random writes.

For a write transaction, if the WAL mechanism is used, the format of the WAL log has the following three types (will be expanded later):

When the transaction begins, writing a BEGIN indicates the beginning of the write transaction.
Record the modification operation of the transaction: For each modification operation in the write transaction, the following four data must be recorded:
- Transaction ID, so that when multiple write transactions are performed at the same time, you can know which transaction corresponds to.
- Modified key value.
- Modifies the previous value, which is used when a transaction is aborted.
- The modified value is used when redoing the transaction.
When a transaction is committed, a COMMIT is written to indicate the end of the transaction.

The following figure demonstrates the use of the WAL strategy for the previous two write transactions:

In the image above:

At the same time, two write transactions are concurrently performed. The three types of logs of each write transaction need to be written to the buffer pool in memory. The format of the log has been given above.
For the files stored on the disk, in addition to the previous database file, a new log file called WAL is added. This file will not be written randomly, but will only be added to the end of the file in the order of the logs recorded in the buffer pool.
The operation log of a write transaction will first be recorded in the buffer pool, instead of being synchronized to the wal log immediately. The operation of synchronizing the log in the buffer pool to the wal can write multiple records in batches at one time, which further improves the write operation. s efficiency.
After the WAL strategy is adopted, the completion of a write operation does not require the modification to be synchronized to the database file, but only requires that all operations of the transaction be written to the WAL file.

It should be noted here that the two dimensions of steal and force mentioned above are related to whether data needs to be synchronized to persistent storage. However, in WAL, since a new WAL file is added, and the file itself is also stored in persistent storage, the scope of the concept of持久化存储file extends from only database files to database files and WAL files.

Therefore, from the above analysis of WAL, we can see that WAL is a strategy of steal+no-force :

steal: Allows the modification of uncommitted transactions to be synchronized to the WAL file on disk.
no-force: Before the transaction is committed, all modification operations must be written to the WAL of the disk before committing, but it is not required to fall into the database file.

Since the持久化存储includes the WAL file and the database file, it can be considered that the data at any time is: the数据库文件+ WAL中的已提交事务数据, so when querying a data, you need to query the two types of files in turn:

First go to the WAL to query the committed transaction data.
If the first step is not found, it means that there is no modification to the data in the WAL, so the data is queried in the database file.

From the above analysis of shadow pages and WAL, it can be seen from the dimension combination of steal and force strategies (a total of four combinations, see the figure below):

Runtime efficiency: steal + no-force is the fastest, and no-steal + force is the slowest.
Recovery efficiency: Compared with the runtime efficiency, the recovery efficiency is reversed. steal + no-force is the slowest, while no-steal + force is the fastest, because the former needs to perform undo and redo operations. The latter is not required.

Checkpoint

As mentioned earlier, for a database using the WAL strategy, the data at any time is:数据库文件+ WAL中的已提交事务数据.

However, it is impossible for the WAL file to keep growing. Otherwise, the file size will be too large, which will affect the read performance, and the data recovery time will be too long. Therefore, every once in a while, the committed transaction data in the WAL file needs to be placed on the database file. This process is called checkpoint .

In order to perform the checkpoint operation, a new type of checkpoint log needs to be added, combined with the aforementioned types of WAL logs, as follows:

When the transaction begins, writing a BEGIN indicates the beginning of the write transaction.
Record the modification operation of the transaction: For each modification operation in the write transaction, the following four data must be recorded:
- Transaction ID, so that when multiple write transactions are performed at the same time, you can know which transaction corresponds to.
- Modified key value.
- Modifies the previous value, which is used when a transaction is aborted.
- The modified value is used when redoing the transaction.
When a transaction is committed, a COMMIT is written to indicate the end of the transaction.
When doing checkpoint , write a CHECKPOINT to indicate that the checkpoint operation is done at this point.

Let’s see how to deal with the recovery problem after the checkpoint operation, as shown in the following figure:

In the image above:

After transaction T1 is committed, transactions T2 and T3 are started again. Before T2 and T3 are not committed, a checkpoint operation is started.
After the checkpoint completed, T2 completes the submission. But after T3 was modified, downtime occurred.

At this point, if data recovery is performed:

There is already a modification of transaction T1 in the database file. Because T1 is a transaction submitted before checkpoint , the modification of this transaction will be synchronized to the database file during checkpoint . However, the two transactions T2 and T3 that have not been completed at the time of checkpoint will not be synchronized to the database file at this time of checkpoint .
Since the COMMIT log of transaction T2 exists in the WAL, it means that transaction T2 has been committed, so when the system recovers:
- Find the location of the last checkpoint before the crash, and perform the following operations for the log after this checkpoint .
- For a transaction with a COMMIT log, such as transaction T2, just replay its WAL log.
- For transactions that do not have a COMMIT log, such as transaction T3, the transaction cannot be replayed because all modifications were not placed in the WAL before the crash.

In the checkpoint process here, there is an obvious problem: when the checkpoint is required, there must be no write transaction in progress. In other words, the checkpoint operation will affect the write transaction. Therefore, if the frequency of checkpoint is too high, it will affect the write transaction; if the frequency is too low, it will make a single checkpoint time very long.

The next issue of the weekly will introduce the idea of ARIES paper on this basis, and the optimization method of checkpoint is given in the paper.

Example using WAL

After the 3.7.0 version of sqlite, the WAL mechanism was adopted, and an article analysis was written before, see btree implementation of sqlite 3.36 version (4) – implementation of WAL

Summarize

According to the strategy of placing orders, it is divided into two dimensions: steal and force .
The shadow page adopts the no-steal+force strategy, while the WAL adopts the steal+no-force strategy.

Other recommendations

The Naval Collection

“Naval Collection” is a record of the proverbs of investor Naval Ravikant , who was first known because of his article “How to Get Rich Without Luck”, which was translated into Chinese by He Caitou:

英文版: Naval: How To Get Rich Without Luck
English version: Naval on Twitter: “How to Get Rich (without getting lucky):” / Twitter

Later, Eric Jorgenson organized Naval’s Twitter text into “Almanack of Naval Ravikant”, and the free electronic version is fully disclosed on the Internet: Almanack of Naval Ravikant

Recently, the Chinese version of the Naval Collection has also been translated and published: Naval Collection (Douban)

“C Language Programming Perspective”

There is not much syntactic sugar in C, and there are not many things to learn at the syntactic level, but the difficulty of C is mainly at the level of dealing with the system. This document focuses on the explanation of this aspect, which is a relatively rare document in this area. .

Introduction – C Programming Perspective

github address: https://ift.tt/tXfqAPU

“Rust, Databend and the Cloud Warehouse (5) from Git to Fuse Engine storage engine”

Brother Hu’s masterpiece Rust, Databend and the Cloud Warehouse (5) from Git to Fuse Engine storage engine [Brother Hu’s blog]

This “amateur” programmer used 50 sheets of 1080Ti to fight cancer.

An amateur programmer who uses artificial intelligence algorithms to identify tumors and admires his ability to act.

This “amateur” programmer used 50 sheets of 1080Ti to fight cancer.

The author’s original post on V2EX: Remember the AI breast platform I made 4 years ago? This is my new project – V2EX

The original post four years ago: I posted it in the D version, but because many friends can’t see the D version, I will put it here and talk about the project I recently made – V2EX

This article is reprinted from: https://www.codedump.info/post/20220514-weekly-15/
This site is for inclusion only, and the copyright belongs to the original author.

Weekly (Issue 15): Illustrated ARIES Papers (Part 1)

Graphic ARIES Papers (Part 1)

question

shadow page

Example using shadow pages

WAL (Write Ahead Log)

Checkpoint

Example using WAL

Summarize

Other recommendations

The Naval Collection

“C Language Programming Perspective”

“Rust, Databend and the Cloud Warehouse (5) from Git to Fuse Engine storage engine”

This “amateur” programmer used 50 sheets of 1080Ti to fight cancer.

Leave a Comment Cancel Reply