Original link: https://blog.gotocoding.com/archives/1851
When I was learning Go
language last year, a classmate said something that I still remember deeply: “We have enough tools to perform performance analysis to find the root cause of performance problems.”
Later I discovered that the performance analysis tool of Go
language is indeed very powerful. More importantly, it is designed to sample online data directly in a production environment.
However, when I write Lua
code, I don’t feel confident that I can say the same thing. Although I have implemented Lua
profiler many times before.
The implementation principles of these profilers are similar to gprof
, but the details are slightly different. The entry time of the function is recorded when the code block enters, and the execution time and number of executions of the function are counted when it exits.
In order to accurately evaluate the CPU time of functions such as rpc:call
, an option has been added to remove coroutine
yield time.
However, these profilers have some disadvantages:
First, they have a significant impact on the performance of the host program. When performing time-consuming statistics based on functions as intervals, the performance impact may even reach 1000%
. Therefore, it cannot be used in an online environment and can only be used for self-testing during the development period.
Secondly, they can only count Lua
functions (including closures and lightCFunction
written in C
), but cannot count C
function overhead within the C
module. When using other C
performance analysis tools, the time-consuming related to Lua
functions cannot be analyzed. This can lead to a very disconnected feeling when profiling.
Furthermore, we lose contextual information when profiling using C
‘s profiler. Since Lua
is a virtual machine written in C
language, when we find that a certain C
function is very time-consuming, we cannot determine which piece of Lua
code is causing it. For example, when it is discovered that the CPU usage of tremove
function is high, there is no way to know which piece of Lua
code is causing it.
Finally, these profilers are implemented in the host process. If the host process falls into an infinite loop, no performance analysis data will be obtained.
But I didn’t find a good way to solve the above problems at the time. It wasn’t until recently that I started studying eBPF
. I finally felt that I could solve these shortcomings and implement a performance analyzer similar to Go
language.
Looking back now, a year has passed.
The new performance profiler is based on the same stack sampling technology as Go
‘s performance profiler, which can have minimal impact on the performance of the target program.
The difference from Go is that Lua
performance analyzer I implemented this time is an independent program like perf
under linux
.
In this way, there is no intrusion into the target program, and it can still run normally even if the target program is in an infinite loop.
At first thought, this wasn’t too difficult. Just get C
callstack
and Lua
callstack
in the bpf
program and merge them in user space.
Finally, output in the flame graph format and generate a flame graph.
The whole process is not complicated.
However, when I started to actually implement it, things developed far beyond my expectations, and the whole process touched the blind spot of my knowledge.
I thought that eBPF
has been developed for nearly 9 years, and obtaining C
callstack
in the kernel space should be just an API matter. However, reality gave me a heavy blow.
Modern compilers will erase the stack frame pointer by default as long as optimization is turned on. The built-in API
in bpf
can only easily obtain the entire callstack
when the stack frame pointer is retained.
I’m faced with two choices: require that the process being profiled must be compiled with the -fomit-frame-pointer
compile option, or I must perform a stack traceback manually.
My goal is to perform non-intrusive performance analysis of the target program, and I believe that forcing the target program to use -fomit-frame-pointer
is also a form of intrusion.
The target program needs to continuously endure the performance loss caused by -fomit-frame-pointer
.
Moreover, I cannot require that so
files provided by systems such as libc
must retain the stack frame pointer.
So, I only have one option left, which is to manually perform stack backtracing.
There are two options for manual stack backtracing.
One is to copy the complete stack data of the target process to user space in the bpf
program, and then use libunwind
to perform stack traceback.
The other is to perform stack backtracing directly in bpf
program, count the number of occurrences of the call stack, and then only send the statistical results back to the user space.
Soon, I rejected the first option.
The original purpose of bpf
is to filter network data packets, so the eBPF
program should be based on this design idea.
That is, the required data is processed and processed in the bpf
program, and then only the required data is transferred back to the user space.
Moreover, if we perform stack backtracing in user space, due to the asynchronous nature of ringbuffer
, we cannot sample Lua
call stack in time that matches the collected stack data (because we need to backtrace C
callstack
first to obtain L
pointer, and then Then perform stack traceback on L
, during which Lua
call stack of the target program has already changed).
Manually backtracing callstack
that has erased the stack frame pointer completely touches the blind spot of my knowledge.
Initially, I considered imitating gdb
‘s solution and performing stack backtracing through debugging information.
However, the amount of debugging information is too large to be easily transferred to the kernel. Moreover, tasks such as parsing debugging information are not suitable for bpf
to complete.
By chance, I discovered a section
named eh_frame in the elf
file, whose full name is exception handling frame
. It is designed to provide a complete stack traceback when an exception occurs, which is exactly what we need.
eh_frame
is composed of a series of CFI instructions used to provide guidance information for stack backtracing. These CFI instructions are executed in function order, that is, when the program executes to a certain line of code, to trace back the status of all registers, it is necessary to execute all CFI instructions from the beginning of the function to that line of code.
Fortunately, when backtracing, we only need to obtain caller
‘s EIP
and the value of the register containing luaState *L
variable, so the backtracing information of most registers can be ignored.
In order to speed up the execution of the bpf
program, we can preprocess eh_frame
data before sending it to bpf
.
By simulating the execution of CFI instructions, we can obtain the traceback information of all registers corresponding to each line of assembly.
In this way, when we obtain the corresponding EIP
in bpf
, we can use binary search to quickly obtain the backtracking rules of all registers.
In order to better utilize the cache, we can also generate an array similar to eh_frame_header
, which only contains EIP
and is specially used for binary search by the bpf
program.
Once the index is obtained, we can query the real eh_frame
information.
Although the above solution is fine, it ignores one condition.
What we read from the elf
file is the relative virtual address ( PC
), while what we get in bpf
program is the absolute virtual address ( VA
) mapped by the kernel.
When preprocessing eh_frame
, we need to convert PC
in it to VA
. This requires using Program Header Table
information of the elf
file.
Program Header Table
provides the segmentation status of the entire elf
file mapped in the process space. According to the mapping information in /proc/<pid>/smaps
, we can convert PC
and VA
.
The specific conversion logic is that when <program Header Table>.p_offset
is the same as offset
in /proc/<pid>/smaps
, it means that they belong to the same mapping area of the file.
Once we have obtained PC
in eh_frame
, we only need to calculate its offset in the ELF file mapping block, plus the mapping base address in /proc/</pid><pid>/smaps
, to get PC
in the process An absolute virtual address ( VA
) in space.
Now, we can finally perform C
stack backtracing in the bpf
program.
Lua
‘s call stack is relatively simple, just traverse the entire L->ci
linked list.
But there is a special problem. Since the functions in Lua
are all dynamic, it is possible that a certain function exists at the current moment of analysis, but will be deleted by garbage collection ( GC
) after a while.
Therefore, when tracing back Lua
‘s call stack, we need to retain all current file information, otherwise we may not be able to obtain them later.
However, storing file paths and line numbers directly in Lua
‘s call stack wastes a lot of space.
A simple calculation, if the maximum Lua
call stack depth we want to support is 128
, and the maximum length of each file path is 64
bytes, then each call stack needs to waste 128 * 64 + 4
bytes of storage space.
This level of storage is unacceptable and also results in a severe loss of performance when counting call stacks.
In order to simplify the design, I created a string mapping table strings
in the bpf
program.
When tracing back the Lua call stack, we convert the string to id
of type uint32_t
through strings
, and then use id << 32 | line_num
to build a virtual address.
In order to mark this as a synthetic address rather than a real virtual address, the final result needs to be modified to (uint64)id << 32 | line_num | (1 << 63)
.
The reason why this method is effective is that in user space, bit63
of the address is always 0
.
It’s worth noting that strings
is a hash table, so there’s no guarantee that collisions will never occur.
When strings conflict, we send the old string and corresponding id
back to user space, let user space store it, and assign a new id
to the slot.
We take advantage of the fact that most functions in Lua
are resident, so their source file TString
pointers are likely to be the same.
Although conflicts exist, we don’t care much about them. Therefore, we can regard the pointer of the source file TString
as the hash value of the string. When the hash values are different, we directly think that these are two different strings.
In bpf_helper
, there is a helper function bpf_get_stackid
that maps the entire callstack to an id. This is useful for generating flame graphs.
Since we are doing stack backtracing manually, we need to implement this functionality ourselves. However, this also brings some benefits.
By communicating with user space, we can take advantage of the large memory of user space to allow us to do things that bpf_get_stackid
cannot do.
First, we need to define a hash table called stacks
.
When we obtain a stack traceback data, we simultaneously calculate the hash values of内核空间调用栈
,用户空间调用栈
and Lua调用栈
. Then, the corresponding slot in stacks
is determined based on the hash value.
If there is already a value in the slot, we will compare it to see if it is the same as the current callstack, and if so the number will be incremented by one.
If different, bpf_get_stackid
will choose to either discard the old callstack
on the current slot or discard the newly inserted callstack
.
Since we can communicate with userspace, we can choose to send the old callstack
back to userspace and let the new callstack
occupy the slot.
Combining Lua
call stack with C
call stack is not smooth sailing either.
Starting from Lua 5.4
, Lua
supports the use of yield
function in C
functions.
This may cause a certain C
function or C
closure to appear in L->ci
( Lua
call information list), but the corresponding information does not exist in the C
call stack.
The current solution is to employ a heuristic matching strategy.
When C
function in L->ci
linked list matches C
function in the C
call stack, we believe that the part from the top of Lua
call stack to the current C
function position is generated by C
function in the current C
call stack. and merge.
Some side chapters.
When I first learned about eBPF
programs, I heard that the kernel has a bpf
verifier that ensures that bpf
programs you write will never corrupt kernel data.
I always thought this was amazing, and I was wondering if this technology would be invincible if applied to the inspection of applications.
Finally, I learned that the original bpf
checker adopts a method of checking that it is better to kill a thousand errors than to miss one person. Various error reports can make people feel confused and frustrated.
A very important point to know is that the bpf
validator starts validating after compilation.
If you write the corresponding if check, but the validator still reports that you did not check, it may be because your check was optimized away by the compiler, and you need to use various extraordinary methods to prevent the compiler’s optimization ( I spent a whole weekend on this problem).
The final source code is here
The post Reimplementing a Lua Performance Analyzer first appeared on Return to Chaos BLOG .
This article is reproduced from: https://blog.gotocoding.com/archives/1851
This site is only for collection, and the copyright belongs to the original author.