Implemented a Lua performance analyzer again

Original link: https://blog.gotocoding.com/archives/1851

When I was learning Go language last year, a classmate said something that I still remember deeply: “We have enough tools to perform performance analysis to find the root cause of performance problems.”

Later I discovered that the performance analysis tool of Go language is indeed very powerful. More importantly, it is designed to sample online data directly in a production environment.

However, when I write Lua code, I don’t feel confident that I can say the same thing. Although I have implemented Lua profiler many times before.

The implementation principles of these profilers are similar to gprof , but the details are slightly different. The entry time of the function is recorded when the code block enters, and the execution time and number of executions of the function are counted when it exits.

In order to accurately evaluate the CPU time of functions such as rpc:call , an option has been added to remove coroutine yield time.

However, these profilers have some disadvantages:

First, they have a significant impact on the performance of the host program. When performing time-consuming statistics based on functions as intervals, the performance impact may even reach 1000% . Therefore, it cannot be used in an online environment and can only be used for self-testing during the development period.

Secondly, they can only count Lua functions (including closures and lightCFunction written in C ), but cannot count C function overhead within the C module. When using other C performance analysis tools, the time-consuming related to Lua functions cannot be analyzed. This can lead to a very disconnected feeling when profiling.

Furthermore, we lose contextual information when profiling using C ‘s profiler. Since Lua is a virtual machine written in C language, when we find that a certain C function is very time-consuming, we cannot determine which piece of Lua code is causing it. For example, when it is discovered that the CPU usage of tremove function is high, there is no way to know which piece of Lua code is causing it.

Finally, these profilers are implemented in the host process. If the host process falls into an infinite loop, no performance analysis data will be obtained.

But I didn’t find a good way to solve the above problems at the time. It wasn’t until recently that I started studying eBPF . I finally felt that I could solve these shortcomings and implement a performance analyzer similar to Go language.

Looking back now, a year has passed.

The new performance profiler is based on the same stack sampling technology as Go ‘s performance profiler, which can have minimal impact on the performance of the target program.

The difference from Go is that Lua performance analyzer I implemented this time is an independent program like perf under linux .

In this way, there is no intrusion into the target program, and it can still run normally even if the target program is in an infinite loop.

At first thought, this wasn’t too difficult. Just get C callstack and Lua callstack in the bpf program and merge them in user space.

Finally, output in the flame graph format and generate a flame graph.

The whole process is not complicated.

However, when I started to actually implement it, things developed far beyond my expectations, and the whole process touched the blind spot of my knowledge.

I thought that eBPF has been developed for nearly 9 years, and obtaining C callstack in the kernel space should be just an API matter. However, reality gave me a heavy blow.

Modern compilers will erase the stack frame pointer by default as long as optimization is turned on. The built-in API in bpf can only easily obtain the entire callstack when the stack frame pointer is retained.

I’m faced with two choices: require that the process being profiled must be compiled with the -fomit-frame-pointer compile option, or I must perform a stack traceback manually.

My goal is to perform non-intrusive performance analysis of the target program, and I believe that forcing the target program to use -fomit-frame-pointer is also a form of intrusion.

The target program needs to continuously endure the performance loss caused by -fomit-frame-pointer .

Moreover, I cannot require that so files provided by systems such as libc must retain the stack frame pointer.

So, I only have one option left, which is to manually perform stack backtracing.

There are two options for manual stack backtracing.

One is to copy the complete stack data of the target process to user space in the bpf program, and then use libunwind to perform stack traceback.

The other is to perform stack backtracing directly in bpf program, count the number of occurrences of the call stack, and then only send the statistical results back to the user space.

Soon, I rejected the first option.

The original purpose of bpf is to filter network data packets, so the eBPF program should be based on this design idea.

That is, the required data is processed and processed in the bpf program, and then only the required data is transferred back to the user space.

Moreover, if we perform stack backtracing in user space, due to the asynchronous nature of ringbuffer , we cannot sample Lua call stack in time that matches the collected stack data (because we need to backtrace C callstack first to obtain L pointer, and then Then perform stack traceback on L , during which Lua call stack of the target program has already changed).

Manually backtracing callstack that has erased the stack frame pointer completely touches the blind spot of my knowledge.

Initially, I considered imitating gdb ‘s solution and performing stack backtracing through debugging information.

However, the amount of debugging information is too large to be easily transferred to the kernel. Moreover, tasks such as parsing debugging information are not suitable for bpf to complete.

By chance, I discovered a section named eh_frame in the elf file, whose full name is exception handling frame . It is designed to provide a complete stack traceback when an exception occurs, which is exactly what we need.

eh_frame is composed of a series of CFI instructions used to provide guidance information for stack backtracing. These CFI instructions are executed in function order, that is, when the program executes to a certain line of code, to trace back the status of all registers, it is necessary to execute all CFI instructions from the beginning of the function to that line of code.

Fortunately, when backtracing, we only need to obtain caller ‘s EIP and the value of the register containing luaState *L variable, so the backtracing information of most registers can be ignored.

In order to speed up the execution of the bpf program, we can preprocess eh_frame data before sending it to bpf .

By simulating the execution of CFI instructions, we can obtain the traceback information of all registers corresponding to each line of assembly.

In this way, when we obtain the corresponding EIP in bpf , we can use binary search to quickly obtain the backtracking rules of all registers.

In order to better utilize the cache, we can also generate an array similar to eh_frame_header , which only contains EIP and is specially used for binary search by the bpf program.

Once the index is obtained, we can query the real eh_frame information.

Although the above solution is fine, it ignores one condition.

What we read from the elf file is the relative virtual address ( PC ), while what we get in bpf program is the absolute virtual address ( VA ) mapped by the kernel.

When preprocessing eh_frame , we need to convert PC in it to VA . This requires using Program Header Table information of the elf file.

Program Header Table provides the segmentation status of the entire elf file mapped in the process space. According to the mapping information in /proc/<pid>/smaps , we can convert PC and VA .

The specific conversion logic is that when <program Header Table>.p_offset is the same as offset in /proc/<pid>/smaps , it means that they belong to the same mapping area of the file.

Once we have obtained PC in eh_frame , we only need to calculate its offset in the ELF file mapping block, plus the mapping base address in /proc/</pid><pid>/smaps , to get PC in the process An absolute virtual address ( VA ) in space.

Now, we can finally perform C stack backtracing in the bpf program.

Lua ‘s call stack is relatively simple, just traverse the entire L->ci linked list.

But there is a special problem. Since the functions in Lua are all dynamic, it is possible that a certain function exists at the current moment of analysis, but will be deleted by garbage collection ( GC ) after a while.

Therefore, when tracing back Lua ‘s call stack, we need to retain all current file information, otherwise we may not be able to obtain them later.

However, storing file paths and line numbers directly in Lua ‘s call stack wastes a lot of space.

A simple calculation, if the maximum Lua call stack depth we want to support is 128 , and the maximum length of each file path is 64 bytes, then each call stack needs to waste 128 * 64 + 4 bytes of storage space.

This level of storage is unacceptable and also results in a severe loss of performance when counting call stacks.

In order to simplify the design, I created a string mapping table strings in the bpf program.

When tracing back the Lua call stack, we convert the string to id of type uint32_t through strings , and then use id << 32 | line_num to build a virtual address.

In order to mark this as a synthetic address rather than a real virtual address, the final result needs to be modified to (uint64)id << 32 | line_num | (1 << 63) .

The reason why this method is effective is that in user space, bit63 of the address is always 0 .

It’s worth noting that strings is a hash table, so there’s no guarantee that collisions will never occur.

When strings conflict, we send the old string and corresponding id back to user space, let user space store it, and assign a new id to the slot.

We take advantage of the fact that most functions in Lua are resident, so their source file TString pointers are likely to be the same.

Although conflicts exist, we don’t care much about them. Therefore, we can regard the pointer of the source file TString as the hash value of the string. When the hash values are different, we directly think that these are two different strings.

In bpf_helper , there is a helper function bpf_get_stackid that maps the entire callstack to an id. This is useful for generating flame graphs.

Since we are doing stack backtracing manually, we need to implement this functionality ourselves. However, this also brings some benefits.

By communicating with user space, we can take advantage of the large memory of user space to allow us to do things that bpf_get_stackid cannot do.

First, we need to define a hash table called stacks .

When we obtain a stack traceback data, we simultaneously calculate the hash values of内核空间调用栈,用户空间调用栈and Lua调用栈. Then, the corresponding slot in stacks is determined based on the hash value.

If there is already a value in the slot, we will compare it to see if it is the same as the current callstack, and if so the number will be incremented by one.

If different, bpf_get_stackid will choose to either discard the old callstack on the current slot or discard the newly inserted callstack .

Since we can communicate with userspace, we can choose to send the old callstack back to userspace and let the new callstack occupy the slot.

Combining Lua call stack with C call stack is not smooth sailing either.

Starting from Lua 5.4 , Lua supports the use of yield function in C functions.

This may cause a certain C function or C closure to appear in L->ci ( Lua call information list), but the corresponding information does not exist in the C call stack.

The current solution is to employ a heuristic matching strategy.

When C function in L->ci linked list matches C function in the C call stack, we believe that the part from the top of Lua call stack to the current C function position is generated by C function in the current C call stack. and merge.

Some side chapters.

When I first learned about eBPF programs, I heard that the kernel has a bpf verifier that ensures that bpf programs you write will never corrupt kernel data.

I always thought this was amazing, and I was wondering if this technology would be invincible if applied to the inspection of applications.

Finally, I learned that the original bpf checker adopts a method of checking that it is better to kill a thousand errors than to miss one person. Various error reports can make people feel confused and frustrated.

A very important point to know is that the bpf validator starts validating after compilation.

If you write the corresponding if check, but the validator still reports that you did not check, it may be because your check was optimized away by the compiler, and you need to use various extraordinary methods to prevent the compiler’s optimization ( I spent a whole weekend on this problem).

The final source code is here

The post Reimplementing a Lua Performance Analyzer first appeared on Return to Chaos BLOG .

This article is reproduced from: https://blog.gotocoding.com/archives/1851
This site is only for collection, and the copyright belongs to the original author.