[CSDN Editor’s Note] The author of this article uses an example program to demonstrate the performance optimization process of reading and writing data through Linux pipes, increasing the throughput from the initial 3.5GiB/s to the final 65GiB/s. Even if it is just a small example, it involves a lot of knowledge points, including zero-copy operations, ring buffers, paging and virtual memory, synchronization overhead, etc., especially for concepts such as splicing, paging, and virtual memory address mapping in the Linux kernel Analyzed from the source level. From the concept to the code, the article progresses from shallow to deep, layer by layer. Although it focuses on the optimization of pipeline read and write performance, I believe that developers of high-performance applications or Linux kernel will benefit a lot from it.
Original link: https://ift.tt/WEkI2Yw
Note: This article is translated by CSDN organization, unauthorized reprinting is prohibited!
Author | Francesco Translator | Wang Xueying
Produced | CSDN (ID: CSDNnews)
This article investigates how Unix pipes are implemented in Linux by iterative optimization of a test program that writes and reads data through pipes.
We started with a simple program with a throughput of about 3.5GiB/s and gradually increased its performance by a factor of 20. The performance improvement was confirmed by using the perf tooling profiler for Linux, the code is available on GitHub (https://ift.tt/yXFEAhD).
Pipeline Tester Performance Graph
This article was inspired by reading a highly optimized FizzBuzz program. On my laptop, the program pushes output into a pipe at about 35GiB/s. Our first goal is to achieve this speed and will describe each step in the optimization process. An additional performance improvement that is not needed in FizzBuzz will be added later, since the bottleneck is actually compute output, not IO, at least on my machine.
We will proceed as follows:
-
The first is a slow version of the pipeline benchmark;
-
Explain how the pipeline is implemented internally, and why reading and writing from it is slow;
-
Explain how to use the vmsplice and splice system calls to bypass some (but not all!) slownesses;
-
Explain Linux paging, and use huge pages to implement a fast version;
-
Replace polling with a busy loop for final optimization;
-
Summarize
Step 4 is the most important part of the Linux kernel, so it may be of interest even if you are familiar with the other topics discussed in this article. For readers unfamiliar with the subject, we assume that you only know the basics of the C language.
Challenge the first slow version
Let’s first test the performance of the legendary FizzBuzz program according to StackOverflow’s posting rules:
% ./fizzbuzz | pv >/dev/null
422GiB 0:00:16 [36.2GiB/s]
pv stands for “pipe viewer” and is a handy tool for measuring the throughput of data flowing through a pipe. Shown is fizzbuzz producing output at 36GiB/s.
fizzbuzz writes output in chunks as large as the L2 cache to strike a good balance between cheap memory access and minimal IO overhead.
On my machine, the L2 cache is 256KiB. This article also outputs 256KiB blocks, but does not do any “calculation”. We essentially want to measure an upper bound on the throughput of a program writing to a pipe with a reasonable buffer size.
fizzbuzz uses pv to measure speed, and our setup will be slightly different: we’ll execute programs on both ends of the pipe, so we have full control over the code involved in pushing and pulling data from the pipe.
The code is available in my pipes-speed-test rep. write.cpp implements writing, and read.cpp implements reading. write keeps writing the same 256KiB repeatedly, read terminates after reading 10GiB of data, and prints the throughput in GiB/s. Both executable programs accept various command line options to change their behavior.
The first test of reading and writing from the pipe will use the write and read system calls, using the same buffer size as fizzbuzz. The write-side code is shown below:
int main() {
size_t buf_size = 1 << 18; // 256KiB
char* buf = (char*) malloc(buf_size);
memset((void*)buf, 'X', buf_size); // output Xs
while (true) {
size_t remaining = buf_size;
while (remaining > 0) {
// Keep invoking `write` until we've written the entirety
// of the buffer. Remember that write returns how much
// it could write into the destination -- in this case,
// our pipe.
ssize_t written = write(
STDOUT_FILENO, buf + (buf_size - remaining), remaining
);
remaining -= written;
}
}
}
For the sake of brevity, all error checking is omitted in this and subsequent code snippets. In addition to ensuring that the output is printable, memset serves another purpose, which we will discuss later.
All the work is done with the write call, the rest is making sure the entire buffer is written. The read side is very similar, just reads data into buf and terminates when enough data has been read.
Once built, the code in the repository can be run as follows:
% ./write | ./read
3.7GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
We write to the same 256KiB buffer filled with 40960 “X” times and measure throughput. Annoyingly, it’s 10 times slower than fizzbuzz! We just write bytes to the pipe and do nothing else. It turns out that we can’t get much faster than this by using write and read.
write problem
To find out what time is spent running the program, we can use perf:
% perf record -g sh -c './write | ./read'
3.2GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 2.851 MB perf.data (21201 samples) ]
The -g option instructs perf to log the call graph: this allows us to see where the time is spent from top to bottom.
We can use perf report to see where time is spent. Here’s a slightly edited snippet detailing the time spent writing:
% perf report -g --symbol-filter=write
- 48.05% 0.05% write libc-2.33.so [.] __GI___libc_write
- 48.04% __GI___libc_write
- 47.69% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 47.54% ksys_write
- 47.40% vfs_write
- 47.23% new_sync_write
- pipe_write
47%- pipe_write
+ 24.08% copy_page_from_iter
+ 11.76% __alloc_pages
+ 4.32% schedule
+ 2.98% __wake_up_common_lock
0.95% _raw_spin_lock_irq
0.74% alloc_pages
0.66% prepare_to_wait_event
47% of the time is spent on pipe_write, which is what write does when we write to the pipe. This is not surprising – we spent about half the time writing and the other half reading.
In pipe_write, 3/4 of the time is spent copying or allocating pages (copy_page_from_iter and __alloc_page). This makes some sense if we already have an idea of how the communication between the kernel and user space works. Either way, to fully understand what’s going on, we must first understand how pipes work.
What are pipes made of?
The data structures that define the pipes can be found in include/linux/pipe_fs_i.h, and operations on them are in fs/pipe.c.
A Linux pipe is a ring buffer that holds references to pages where data is written and read:
The ring buffer in the picture above has 8 slots, but it may be more or less, and the default is 16. Each page size is 4KiB on the x86-64 architecture, but may vary on other architectures. The pipe can hold a total of 32KiB of data. Here’s a key point: every pipe has an upper limit on the total amount of data it can hold before it fills up.
The shaded part in the figure represents the current pipeline data, and the non-shaded part represents the free space in the pipeline.
Somewhat counterintuitively, head stores the write end of the pipeline. That is, the writer will write to the buffer pointed to by head, increasing head accordingly if it needs to move to the next buffer. In the write buffer, len stores how much we have written in it.
Instead, tail stores the read end of the pipe: the reader will start using the pipe from there. offset indicates where to start reading.
Note that the tail can appear after the head, as shown in the image, since we are using a circular/ring buffer. Also note that when we don’t completely fill the pipe, some slots may not be used – NULL cells in the middle. If the pipe is full (there are no NULLs and free space in the page), the write will block. If the pipe is empty (all NULLs), read will block.
The following is an abridged version of the C data structure in pipe_fs_i.h:
struct pipe_inode_info {
unsigned int head;
unsigned int tail;
struct pipe_buffer *bufs;
};
struct pipe_buffer {
struct page *page;
unsigned int offset, len;
};
We’ve omitted many fields here, and haven’t explained what’s in the struct page, but it’s a key data structure for understanding how to read and write from a pipe.
read and write pipes
Now let’s go back to the definition of pipe_write and try to understand the perf output shown earlier. A brief description of how pipe_write works is as follows:
1. If the pipeline is full, wait for space and restart;
2. If the buffer currently pointed to by head has space, fill the space first;
3. When there are free slots and there are remaining bytes to write, new pages are allocated and filled, updating the head.
Action when writing to the pipe
The above operations are protected by a lock that pipe_write acquires and releases as needed.
pipe_read is a mirror of pipe_write, the difference is that the page is consumed, released after the page is completely read, and the tail is updated.
So the current process creates a very unpleasant situation:
-
Each page is copied twice, once from user memory to kernel and once from kernel to user memory;
-
Copies a 4KiB page at a time, intertwined with other operations such as synchronization between reads and writes, page allocation and deallocation;
-
The memory being processed may not be contiguous due to the constant allocation of new pages;
-
Pipe locks need to be acquired and released all the time during processing.
On this machine, sequential RAM reads are about 16GiB/s:
% sysbench memory --memory-block-size=1G --memory-oper=read --threads=1 run
...
102400.00 MiB transferred (15921.22 MiB/sec)
Considering all the issues listed above, it’s not surprising that it’s 4x slower compared to single-threaded sequential RAM speeds.
Adjusting buffer or pipe sizes to reduce system call and synchronization overhead, or adjusting other parameters won’t help much. Fortunately, there is a way to avoid slow read and write altogether.
Improve with stitching
This copying of buffers from user memory to the kernel and back is a common thorn in the side of people who need fast IO. A common solution is to remove kernel operations from processing and perform IO operations directly. For example, we can interact directly with the network card and network with low latency bypassing the kernel.
Usually when we write to a socket, file or pipe in this case, we first write to some buffer in the kernel and then let the kernel do its work. In the case of pipes, a pipe is a series of buffers in the kernel. All this duplication is not desirable if we care about performance.
Luckily, Linux includes system calls to speed things up without copying when we want to move data around a pipe. in particular:
-
splice moves data from pipes to file descriptors and vice versa;
-
vmsplice moves data from user memory into the pipe.
The point is, both operations work without copying anything.
Now that we know how pipes work, we can roughly imagine how these two operations work: they just “grab” an existing buffer from somewhere and put it into the pipe ring buffer, or vice versa Come over, instead of allocating new pages when needed:
We’ll see how it works soon.
Splicing implementation
We replace write with vmsplice. The vmsplice signature is as follows:
struct iovec {
void *iov_base; // Starting address
size_t iov_len; // Number of bytes
};
// Returns how much we've spliced into the pipe
ssize_t vmsplice(
int fd, const struct iovec *iov, size_t nr_segs, unsigned int flags
);
fd is the destination pipe and struct iovec *iov is the array of buffers that will be moved to the pipe. Note that vmsplice returns the quantity “spliced” into the pipe, which may not be the full quantity, just as write returns the quantity written. Don’t forget that pipes are limited by the number of slots they can have in the ring buffer, while vmsplice is not.
You also need to be careful when using vmsplice. User memory is moved into the pipe without copying, so you must ensure that the read side uses it before reusing the stitched buffer.
For this, fizzbuzz uses a double buffering scheme, which works as follows:
-
Divide the 256KiB buffer into two;
-
Setting the pipe size to 128KiB is equivalent to setting the pipe’s ring buffer to have 128KiB/4KiB=32 slots;
-
Choose between writing the first half of the buffer or using vmsplice to move it into the pipe, and do the same for the other half.
The pipe size is set to 128KiB, and waiting for vmsplice to fully output a 128KiB buffer ensures that when a vmsplic iteration completes, we know the previous buffer has been fully read – otherwise the new 128KiB buffer cannot be fully read vmsplice into a 128KiB pipe.
Now, we haven’t actually written anything to the buffer, but we’ll keep the double-buffering scheme since any program that actually writes something needs a similar scheme.
Our write loop now looks like this:
int main() {
size_t buf_size = 1 << 18; // 256KiB
char* buf = malloc(buf_size);
memset((void*)buf, 'X', buf_size); // output Xs
char* bufs[2] = { buf, buf + buf_size/2 };
int buf_ix = 0;
// Flip between the two buffers, splicing until we're done.
while (true) {
struct iovec bufvec = {
.iov_base = bufs[buf_ix],
.iov_len = buf_size/2
};
buf_ix = (buf_ix + 1) % 2;
while (bufvec.iov_len > 0) {
ssize_t ret = vmsplice(STDOUT_FILENO, &bufvec, 1, 0);
bufvec.iov_base = (void*) (((char*) bufvec.iov_base) + ret);
bufvec.iov_len -= ret;
}
}
}
Here is the result of writing using vmsplice instead of write:
% ./write --write_with_vmsplice | ./read
12.7GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
This halved the amount of replication we needed and more than tripled the throughput to 12.7GiB/s. After changing the read side to use splice, all the copying was eliminated, giving another 2.5x speedup:
% ./write --write_with_vmsplice | ./read --read_with_splice
32.8GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
page improvements
what’s next? Let’s ask perf:
% perf record -g sh -c './write --write_with_vmsplice | ./read --read_with_splice'
33.4GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.305 MB perf.data (2413 samples) ]
% perf report --symbol-filter=vmsplice
- 49.59% 0.38% write libc-2.33.so [.] vmsplice
- 49.46% vmsplice
- 45.17% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 44.30% __do_sys_vmsplice
+ 17.88% iov_iter_get_pages
+ 16.57% __mutex_lock.constprop.0
3.89% add_to_pipe
1.17% iov_iter_advance
0.82% mutex_unlock
0.75% pipe_lock
2.01% __entry_text_start
1.45% syscall_return_via_sysret
Most of the time is spent locking the pipe for writing (__mutex_lock.constprop.0) and moving pages into the pipe (iov_iter_get_pages). Not much can be improved about locking, but we can improve the performance of iov_iter_get_pages.
As the name suggests, iov_iter_get_pages converts the struct iovecs we provide to vmsplice into struct pages to put into the pipeline. In order to understand what this function actually does, and how to speed it up, we must first understand how the CPU and Linux organize pages.
A quick look at pagination
As you know, processes don’t directly reference locations in RAM: instead they allocate virtual memory addresses, which are resolved to physical addresses. This abstraction is called virtual memory, and we won’t cover its various advantages here – but most notably, it greatly simplifies running multiple processes competing for the same physical memory.
In any case, whenever we execute a program and load/store from memory to memory, the CPU needs to translate virtual addresses to physical addresses. It is impractical to store a mapping from every virtual address to every corresponding physical address. Therefore, memory is divided into uniformly sized blocks called pages, and virtual pages are mapped to physical pages:
There’s nothing special about 4KiB: each architecture chooses a size based on various tradeoffs – we’ll explore some of these shortly.
To make this more explicit, let’s imagine using malloc to allocate 10000 bytes:
void* buf = malloc(10000);
printf("%p\n", buf); // 0x6f42430
When we use them, our 10k bytes will appear contiguous in virtual memory, but will be mapped to 3 physical pages that don’t have to be contiguous:
One of the tasks of the kernel is to manage this mapping, which is embodied in a data structure called a page table. The CPU specifies the page table structure (because it needs to understand the page table), and the kernel operates on it as needed. On the x86-64 architecture, the page table is a 4-level 512-way tree, itself located in memory. Each node of the tree is (you guessed it!) 4KiB in size, and each entry within the node that points to the next level is 8 bytes (4KiB/8bytes = 512). These entries contain the address of the next node and other metadata.
Each process has a page table – in other words, each process reserves a virtual address space. When the kernel context switches to a process, it sets the specific register CR3 to the physical address of the root of the tree. Then, whenever a virtual address needs to be translated to a physical address, the CPU splits the address into segments and uses them to traverse the tree and compute the physical address.
To reduce the abstraction of these concepts, here is an intuitive description of how the virtual address 0x0000f2705af953c0 resolves to a physical address:
The search starts at the first level, called the “page global directory”, or PGD, whose physical location is stored in CR3. The first 16 bits of the address are unused. We take the next 9-bit PGD entry and traverse down to the second level “page upper directory”, or PUD. The next 9 bits are used to select entries from the PUD. The process is repeated at the next two levels, PMD (“page middle directory”) and PTE (“page table entry”). The PTE tells us where the actual physical page to look is, then we use the last 12 bits to find the offset within the page.
The sparse structure of the page table allows the mapping to be built up incrementally as new pages are needed. Whenever a process needs memory, the kernel will update the page table with new entries.
The role of struct page
The struct page data structure is a key part of this mechanism: it’s the structure the kernel uses to reference a single physical page, store its physical address, and its various other metadata. For example, we can get the struct page from the information contained in the PTE (the last level of the page table described above). In general, it is widely used in all code that handles page-related transactions.
In the pipeline scenario, struct page is used to save its data in a ring buffer, as we have already seen:
struct pipe_inode_info {
unsigned int head;
unsigned int tail;
struct pipe_buffer *bufs;
};
struct pipe_buffer {
struct page *page;
unsigned int offset, len;
};
However, vmsplice accepts virtual memory as input, while struct page refers directly to physical memory.
So we need to convert any block of virtual memory into a set of struct pages. This is exactly what iov_iter_get_pages does and is where we spend half our time:
ssize_t iov_iter_get_pages(
struct iov_iter *i, // input: a sized buffer in virtual memory
struct page **pages, // output: the list of pages which back the input buffers
size_t maxsize, // maximum number of bytes to get
unsigned maxpages, // maximum number of pages to get
size_t *start // offset into first page, if the input buffer wasn't page-aligned
);
struct iov_iter is a Linux kernel data structure representing various ways of traversing a block of memory, including struct iovec. In our case it will point to a 128KiB buffer. vmsplice uses iov_iter_get_pages to convert the input buffer into a set of struct pages and save them. Now that you know how paging works, you can roughly imagine how iov_iter_get_pages works, which is explained in detail in the next section.
We’ve quickly picked up on many new concepts, which can be summarized as follows:
-
Modern CPUs use virtual memory for processing;
-
Memory is organized in fixed-size pages;
-
The CPU converts virtual addresses to physical addresses using page tables that map virtual pages to physical pages;
-
The kernel adds and removes entries to the page table as needed;
-
Pipes are made of references to physical pages, so vmsplice must convert a series of virtual memory to physical pages and save them.
cost of fetching pages
The time spent in iov_iter_get_pages is actually spent entirely in another function, get_user_page_fast:
% perf report -g --symbol-filter=iov_iter_get_pages
- 17.08% 0.17% write [kernel.kallsyms] [k] iov_iter_get_pages
- 16.91% iov_iter_get_pages
- 16.88% internal_get_user_pages_fast
11.22% try_grab_compound_head
get_user_pages_fast is a simplified version of iov_iter_get_pages:
int get_user_pages_fast(
// virtual address, page aligned
unsigned long start,
// number of pages to retrieve
int nr_pages,
// flags, the meaning of which we won't get into
unsigned int gup_flags,
// output physical pages
struct page **pages
)
Here “user” (as opposed to “kernel”) refers to converting virtual pages into references to physical pages.
To get struct pages, get_user_pages_fast does exactly what the CPU does, but in software: it traverses the page table to collect all physical pages, and stores the result in struct pages. Our example is a 128KiB buffer and 4KiB pages, so nr_pages = 32. get_user_page_fast needs to traverse the page table tree, collect 32 leaves, and store the result in 32 struct pages.
get_user_pages_fast also needs to ensure that physical pages are not reused until the caller no longer needs them. This is achieved by using a reference count stored in the struct page in the kernel, which is used to know when the physical page can be freed and reused in the future. The caller of get_user_pages_fast must at some point free the page again using put_page to reduce the reference count.
Finally, get_user_pages_fast behaves differently depending on whether the virtual address is already in the page table. This is where the _fast suffix comes from: the kernel will first try to traverse the page table to get the already existing page table entry and corresponding struct page, which is relatively cheap, and then return to generate the struct page by other more expensive methods. The fact that we memset memory at the beginning, will ensure that we never end up in the “slow” path of get_user_pages_fast, since page table entries will be created when the buffer is full of “X”.
Note that the get_user_pages family of functions is not only useful for pipes – in fact it is central to many drivers. A typical usage is related to the kernel bypass we mentioned: a network card driver might use it to convert some user memory area into physical pages, then pass the physical page location to the network card, and let the network card interact with that memory area directly , without kernel involvement.
large pages
The page size we’ve rendered so far has always been the same – 4KiB on the x86-64 architecture. But many CPU architectures, including x86-64, include larger page sizes. In the case of x86-64, we have not only 4KiB pages (“standard” size), but also 2MiB or even 1GiB pages (“huge” pages). In the remainder of this article, we’ll only discuss 2MiB huge pages, since 1GiB pages are quite rare and wasteful for our task.
Page size in today’s common schema, from Wikipedia
The main advantage of huge pages is that they are less expensive to manage because fewer pages are needed to cover the same amount of memory. In addition, other operations are cheaper, such as resolving virtual addresses to physical addresses, because one less page table is required: a 21-bit offset replaces the original 12-bit offset in the page, thereby reducing Primary page table.
This relieves pressure on the part of the CPU that handles this conversion, thus improving performance in many cases. But in our case, the stress is not on the hardware walking the page tables, but the software running in the kernel.
On Linux, we can allocate 2MiB hugepages in several ways, such as allocating memory aligned to 2MiB and then using madvise to tell the kernel to use hugepages for the provided buffer:
void* buf = aligned_alloc(1 << 21, size);
madvise(buf, size, MADV_HUGEPAGE)
Switching to huge pages gave our program another ~50% performance boost:
% ./write --write_with_vmsplice --huge_page | ./read --read_with_splice
51.0GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
However, the reasons for the boost are not entirely obvious. We might be naive to think that by using huge pages, the struct page will only refer to 2MiB pages, not 4KiB pages.
Unfortunately, this is not the case: the kernel code assumes that the struct page refers to a page of “standard” size for the current architecture. The way this implementation acts on huge pages (usually what Linux calls “composite pages”) is that the “header” struct page contains actual information about the physical page behind it, while the contiguous “tail” pages only contain pointers to the head page .
So to represent a huge page of 2MiB, there will be 1 “head” struct page and a maximum of 511 “tail” struct pages. Or for our 128KiB buffer, 31 tail struct pages:
Even if we need all these struct pages, the code that generates it at the end will be significantly faster. After the first entry is found, the subsequent struct pages can be generated in a simple loop instead of traversing the page table multiple times. This improves performance!
Busy looping
We’ll be done soon, I promise! Take another look at the output of perf:
- 46.91% 0.38% write libc-2.33.so [.] vmsplice
- 46.84% vmsplice
- 43.15% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 41.80% __do_sys_vmsplice
+ 14.90% wait_for_space
+ 8.27% __wake_up_common_lock
4.40% add_to_pipe
+ 4.24% iov_iter_get_pages
+ 3.92% __mutex_lock.constprop.0
1.81% iov_iter_advance
+ 0.55% import_iovec
+ 0.76% syscall_exit_to_user_mode
1.54% syscall_return_via_sysret
1.49% __entry_text_start
A lot of time is now spent waiting for the pipe to be writable (wait_for_space), and waking up readers waiting for the pipe to fill up (__wake_up_common_lock).
To avoid these synchronization costs, if the pipe cannot be written to, we can have vmsplice return and perform a busy loop until it is written – doing the same thing when reading with splice:
...
// SPLICE_F_NONBLOCK will cause `vmsplice` to return immediately
// if we can't write to the pipe, returning EAGAIN
ssize_t ret = vmsplice(STDOUT_FILENO, &bufvec, 1, SPLICE_F_NONBLOCK);
if (ret < 0 && errno == EAGAIN) {
continue; // busy loop if not ready to write
}
...
With the busy loop, we get another 25% performance improvement:
% ./write --write_with_vmsplice --huge_page --busy_loop | ./read --read_with_splice --busy_loop
62.5GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
Summarize
By looking at perf output and Linux source code, we systematically improved program performance. Pipes and concatenation are not really hot topics when it comes to high-performance programming, and the topics we cover here are: zero-copy operations, ring buffers, paging and virtual memory, synchronization overhead.
Although I left out some details and interesting topics, this blog post has gotten out of hand and is getting too long:
-
In real code, buffers are allocated separately, reducing page table contention by placing them in different page table entries (FizzBuzz programs do this too).
-
Remember that when a page table entry is fetched with get_user_pages, its refcount is incremented and put_page is decremented. If we use two page table entries for the two buffers instead of one page table entry for both buffers, there is less contention when modifying the refcount.
-
Tests are run by binding the ./write and ./read processes to two cores with taskset.
-
The code in the repository contains many other options that I’ve tried, but ended up not being discussed as they were irrelevant or not interesting enough.
-
Also included in the repository is a synthetic benchmark for get_user_pages_fast that can be used to precisely measure how much slower it is to run with or without huge pages.
-
Splice in general is a somewhat dubious/dangerous concept that continues to haunt kernel developers.
Please let me know if this article is useful, interesting or not necessarily!
Thanks
Many thanks to Alexandru Scvorţov, Max Staudt, Alex Appetiti, Alex Sayers, Stephen Lavelle, Peter Cawley and Niklas Hambüchen for reviewing drafts of this article. Max Staudt also helped me understand some of the subtleties of get_user_page .
1. This will be similar in style to my atan2f performance survey (https://ift.tt/oBrmduT), although the procedure in question is for learning only. Also, we will optimize the code at different levels. Tuning atan2f is a micro-optimization guided by the assembly language output, and tuning the pipeline involves looking at perf events and reducing various kernel overheads.
2. This test was run on an Intel Skylake i7-8550U CPU and Linux 5.17. Your environment may vary, as the Linux internals that affect the programs described in this article have been changing over the past few years and may be adjusted in future releases.
3. “FizzBuzz” is allegedly a common coding interview question, and while I’ve never personally been asked it, I have solid evidence that it did.
4. Even though we fixed the buffer size, even if we used a different buffer size, the (throughput) numbers wouldn’t actually be very different considering the impact of other bottlenecks.
5. For interesting details, always refer to the library. In general, I won’t copy the code verbatim here because the details don’t matter. Instead, I’ll post updated code snippets.
6. Note that here we profile a shell call that includes pipe reads and writes – by default, perf record tracks all child processes.
7. When analyzing the program, I noticed that the output of perf was polluted with information from the “Pressure Stall Information” infrastructure (PSI). So these numbers are taken from a kernel compiled with PSI disabled. This can be achieved by setting CONFIG_PSI=n in the kernel build configuration. In NixOS:
boot.kernelPatches = [{
name = "disable-psi";
patch = null;
extraConfig = ''
PSI n
'';
}];
Also, in order for perf to correctly display the time spent in system calls, the kernel debug symbols must be present. How to install symbols varies by distribution. In the latest NixOS versions, they are installed by default.
8. If you ran perf record -g, you can use + to expand the call graph in perf report.
9. A single “spare page” called tmp_page is actually reserved by pipe_read and reused by pipe_write. However since there’s always just one page here, I can’t take advantage of it to achieve higher performance because the page reuse is offset by the fixed overhead when calling pipe_write and pipe_read.
10. Technically, vmsplice also supports transferring data in the other direction, albeit to a lesser extent. As the man page states:
vmsplice really only supports true splicing from user memory to pipes. In the opposite direction, it actually just copies the data to user space.
11. Travis Downs points out that this scheme may still be unsafe, as pages may be further stitched together, extending their lifespan. This question also appeared in the original FizzBuzz post. In fact, I’m not entirely sure if vmsplice without SPLICE_F_GIFT is really unsafe – the vmsplic man page states that it is. However, in this case, extreme care is absolutely required to implement a zero-copy pipeline while remaining safe. In the test program, the read side splices the pipe into /dev/null, so it’s possible that the kernel knows it can splicing pages without copying, but I haven’t verified that this is actually what happens.
12. Here we present a simplified model where physical memory is a simple flat linear sequence. The reality is a little more complicated, but a simple model speaks for itself.
13. The physical address corresponding to the virtual page allocated to the current process can be checked by reading /proc/self/pagemap and multiplying the “page frame number” by the page size.
14. Starting with Ice Lake, Intel expanded the page table to level 5, increasing the maximum addressable memory from 256TiB to 128PiB. But this feature must be explicitly turned on, because some programs rely on the upper 16 bits of the pointer not being used.
15. The address in the page table must be a physical address, otherwise we will end up in an infinite loop.
16. Note that the upper 16 bits are unused: this means that we can handle up to 248 − 1 byte per process, or 256TiB of physical memory.
17. struct page 可能指向尚未分配的物理页,这些物理页还没有物理地址和其他与页相关的抽象。它们被视为对物理页面的完全抽象的引用,但不一定是对已分配的物理页面的引用。这一微妙观点将在后面的旁注中予以说明。
18. 实际上,管道代码总是在nr_pages = 16 的情况下调用get_user_pages_fast,必要时进行循环,这可能是为了使用一个小的静态缓冲区。但这是一个实现细节,拼接页面的总数仍将是32。
19. 以下部分是本文不需要理解的微妙之处!
如果页表不包含我们要查找的条目,get_user_pages_fast 仍然需要返回一个struct page。最明显的方法是创建正确的页表条目,然后返回相应的struct page。
然而,get_user_pages_fast 仅在被要求获取struct page 以写入其中时才会这样做。否则它不会更新页表,而是返回一个struct page,给我们一个尚未分配的物理页的引用。这正是vmsplice 的情况,因为我们只需要生成一个struct page 来填充管道,而不需要实际写入任何内存。
换句话说,页面分配会被延迟,直到我们实际需要时。这节省了分配物理页的时间,但如果页面从未通过其他方式填充错误,则会导致重复调用get_user_pages_fast 的慢路径。
因此,如果我们之前不进行memset,就不会“手动”将错误页放入页表中,结果是不仅在第一次调用get_user_pages_fast 时会陷入慢路径,而且所有后续调用都会出现慢路径,从而导致明显地减速(25GiB/s而不是30GiB/s):
% ./write --write_with_vmsplice --dont_touch_pages | ./read --read_with_splice
25.0GiB/s, 256KiB buffer, 40960 iterations (10GiB piped)
此外,在使用大页时,这种行为不会表现出来:在这种情况下,get_user_pages_fast 在传入一系列虚拟内存时,大页支持将正确地填充错误页。
如果这一切都很混乱,不要担心,get_user_page 和friends 似乎是内核中非常棘手的一角,即使对于内核开发人员也是如此。
20. 仅当CPU 具有PDPE1GB 标志时。
21. 例如,CPU包含专用硬件来缓存部分页表,即“转换后备缓冲区”(translation lookaside buffer,TLB)。TLB 在每次上下文切换(每次更改CR3 的内容)时被刷新。大页可以显著减少TLB 未命中,因为2MiB 页的单个条目覆盖的内存是4KiB 页面的512 倍。
22. 如果你在想“太烂了!”正在进行各种努力来简化和/或优化这种情况。最近的内核(从5.17开始)包含了一种新的类型,struct folio,用于显式标识头页。这减少了运行时检查struct page 是头页还是尾页的需要,从而提高了性能。其他努力的目标是彻底移除额外的struct pages,尽管我不知道怎么做的。
本文文字及图片出自CSDN
本文转自https://www.techug.com/post/how-fast-can-the-linux-pipeline-be124c628228c659e6e8cd/
This site is for inclusion only, and the copyright belongs to the original author.