Remember a difficult debugging process

Original link: https://blog.codingnow.com/2023/05/debug.html

Before May 1st, I encountered a very difficult bug, and it took me two days to solve it. Its trickiness can be included in the top three of my career, and it is very worth recording.

The problem is that two days before the festival, many colleagues have asked for leave. I also plan to take a good rest and play with my children for a few days. Just after I routinely updated the warehouse of the game project, I suddenly found that the program crashed. At first, I didn’t think much of it. Because our game is mainly developed with Lua, it does not need to be recompiled after updating. I only build the project about once a week. Maybe it’s just because I haven’t built in too long, so I built it randomly. But the problem still exists, but the probability of occurrence is low, about one time after starting three times, there will be a problem. This is unusual because I haven’t seen a Segmentation fault for a long time.

Our game engine is specially designed for mobile phones. In order to pay attention to the effect on the mobile phone in real time during development, unlike other game engines, we run a Windows or Mac version on the development machine during development, and only output a mobile version when needed. We wrote the core of the engine in C/C++/Obj-C, compiled an App on iOS, and built a virtual file system into the engine. During development, run a process called fileserver on the development machine, and map a local directory of the development machine to the virtual file system. In this way, when the engine kernel on the mobile phone is running, it will directly read the directory of the development machine, and any modification we make on the development machine can be immediately reflected on the App of the mobile phone.

The engine kernel running on the mobile phone is compiled and released regularly by the project team. Generally, I do not compile it myself. This time it was my local fileserver process that crashed. It doesn’t actually run any game business logic, it just provides Lua code and resources to the engine core through a simple protocol. Although it will also do some more complicated work: such as debugging the Lua program on the mobile phone at runtime, compiling local shaders, textures, models, animations and other resources on demand, but this crash occurred during the startup phase, and it seems that it has not yet been involved. These.

The strange thing is that only my development machine in the entire project team has problems. Debugging looks like it has to be done by me.

fileserver itself is also written in Lua, and only depends on a small number of C modules. At first I thought it would take a few minutes to find out, but after an hour of debugging, I realized the problem was not that simple.

First of all, I usually run the release version with debug information removed. Because this program has been running stably for a long time, it never occurred to me that I need to debug it at the C level. When I recompile a debug version with debug information, almost no crashes. (Reduced from 1/3 chance to less than 1/20)

I checked the version since the last stable run, and the last commit that caused the bug simply modified a few lines of Lua code that looked completely harmless. When I tried to insert a few log lines in the relevant Lua file, the bug disappeared. Even if I just insert a line of lua code that has no side effects at all, such as local x = 1, the bug will disappear. I quickly gave up this debugging method and started thinking about the reasons behind it.

Why does the bug only appear on my machine, but the project team has not encountered so many development machines? A more obvious difference is that I use mingw64 for development, while others use vc to build projects. But then, another colleague tried to build with mingw on his development machine, but it still failed to reproduce. He suspects that my windows development environment is different. It happened that my windows reminded me to upgrade, so I restarted the system and calmed down by the way.

The problem is still. I ruled out the idea that there is a problem with the development environment. After all, it was normal for the past few months, and I used git to roll back several commits and compared it repeatedly to determine that the problem happened recently. Even if the newly updated Lua code is harmless, it does change the behavior of the program. The occasional Segmentation fault is a fact. It must be under certain conditions that the C code accesses invalid memory. There must be a reason for this, and I can’t turn a blind eye . As for adding a line of harmless Lua code to cover up the problem, I can only explain that different lengths of Lua code have subtly different behaviors in the parser phase, and this difference will change the timing of lua gc in the parser phase.

Our engine is based on the multi-threaded library ltask . It’s very similar to skynet, but simpler, and I’ve optimized it a bit differently for the client. Our program is run by multiple Lua virtual machines in parallel, which is the reason for the probability of this bug. I suspect that the final crash is related to lua gc, so I tried to close the gc of several virtual machines at startup, and it really didn’t crash.

Because lua 5.4.5 rc2 has been updated recently, the impact of Lua upgrade should be ruled out first. After rolling back the Lua version, the problem remains. And I think if it’s a problem with Lua itself, the crashes should be more frequent. This is obviously related to multithreading, not to Lua.

Is it possible that a long-hidden problem with ltask has been triggered? I think it’s possible, after all it’s a multi-threaded framework, and concurrent code can easily breed bugs. But on the other hand, I think it is very unlikely that the problem lies in ltask. Because it has not undergone major modifications for two years and has been running stably. The fileserver that crashed this time is a program with a very single business. Our game engine also runs on ltask, which is much more complex itself, without problems.

Perhaps, using git bisect to determine the commit that caused the bug may have more clues.

As mentioned earlier, there was a commit A that directly caused this bug on my machine. But this commit looks completely harmless, it doesn’t even run during the fileserver startup phase (rendering related code). So the bug must have preceded this. I first simplified most of the modifications in commit A, and only kept a few lines of Lua changes to determine that they could cause problems, so I regenerated a patch. Then start to do bisect one week ago. After each bisect, re-cherry-pick back to the newly made patch for verification. Just like that, targeting another commit B three days ago caused the problem.

This commit B is a colleague who optimized the batch file reading protocol of the virtual file system. Feels like it’s a bit close this time, as it affects the work of the fileserver. It is worth mentioning that the bootstrap process of our engine core is quite complex. This is because the virtual file system vfs itself is also an ltask service written by lua, but the lua code is stored in vfs. This is a chicken-and-egg question. bootstrap itself is also written in lua, it also needs to be able to update itself.

Bootstrap does not run in ltask at the beginning, it needs to bootstrap to obtain a simple vfs module, and then communicate with the fileserver through this vfs module to obtain the latest version of the bootstrap code, if it is found to be different, update itself and use the new version again Run it once. Then, the ltask framework is loaded in, and replaces the simple vfs module used by bootstrap, and officially replaces it with a service of ltask itself, which can be used when the subsequent code runs.

It can be seen that bootstrap may be executed repeatedly, and the virtual machine it belongs to is constantly migrating between threads. Any modification to the vfs module is bound to affect the entire bootstrap process.

After double-checking commit B, I still think it’s harmless. It just adds several lines of Lua code. But even without running these newly added codes, just adding the code itself is enough to cause bugs. I cut commit B and merged a new patch enough to cause bugs with the previous commit A, and continued to roll back with git bisect. After each rollback, test the patch.

At the same time, I rewrote the startup code of the fileserver, removed as much irrelevant business as possible, and only kept the modules that caused bugs. After some attempts, it even increased the probability of bugs (approximately 1/2 chance), and if there is no problem, it can also exit quickly and normally, which can facilitate my automated testing.

When I rolled back to the February engine version, I realized that this was an inefficient solution for debugging (even though I had automated the testing process with scripts). And unsustainable. I’ve racked up several possible bug-causing modifications, but they’re all plain Lua code, completely harmless on the surface. But without either modification, the bug disappears. The real reason is still missing. And I keep going back and forth like this, and there will always be a version that cannot run normally (because there are too many subsequent modification patches).

Although this process did not find the problem, at least I confirmed a few points:

The bug has been dormant for a long time, only to be exposed in a series of (unrelated) changes over the past few months.
What directly caused the crash was that the memory managed by the Lua virtual machine was accidentally modified, resulting in the crash of the lua code. It mostly occurs in lua gc (because gc will traverse the data of the virtual machine), and in a few cases, it crashes in other running processes of lua.
This is a concurrency related bug.

This last point is what I found when trying to modify ltask’s configuration. By default, I will open 8 worker threads for the fileserver, and after reducing the number of worker threads, the bug disappears. This may also be why it cannot be reproduced on a colleague’s machine. My development machine does not have Cloud Shell installed, and my colleagues’ development machines all run a background monitoring program called Cloud Shell. It often fills up one cpu core, which can lead to insufficient concurrency when actually running the fileserver.

Also, I found a recent small modification of ltask that also interferes with bugs. Last appointment, my colleagues suggested that I use SRWLOCK instead of CRITICAL SECTION on windows . This is certainly an insignificant minor modification. But when I rolled back to the CRITICAL SECTION version, the bug disappeared. This proves that the bug is caused by concurrency. It can also explain why I add a few lines of log to the C code to make the bug disappear, not to mention that I can’t touch it when I turn on gdb debugging.

On the last day before the May Day holiday, I was stuck. With a slight modification of the Lua code, the crashes may disappear. But it is true that a few lines are scattered in multiple files, and the Lua code in multiple virtual machines may cause Segmentation fault. When debugging with gdb, the bug hides. If you increase the log in C, the probability of bug occurrence will decrease. Just adding a few lines of lua code seems to be fine, but I know the bug is there and I can’t ignore it.

On the first day of the holiday, I got up early and arrived at the company at seven o’clock. Playing with the children until late the day before, they should not get up until noon. I think I have a morning to untie my knots.

It takes half an hour to walk to the office. Walking by the Pearl River, I sorted out my thoughts. It is too inefficient to find the wrong place through git bisect. Debugging also looks difficult. It seems that we have to rely on the way of reading code, which is also the debugging method I am best at. But too much code is written by my colleagues. I have not yet established a relevant model in my brain, and the relevant people are on vacation, and no one asks. This is the biggest obstacle.

Originally, the most direct tool to find Segmentation fault is Address Sanitizer. But mingw does not support it yet. Although it is supported by VC compilation, the version compiled by VC is not wrong. I have a whole morning and estimated that it is enough to make a simple version of Address Sanitizer to assist debugging.

Our project is based on lua as the main framework, and only necessary modules are implemented in C/C++. This simplifies checking for memory out-of-bounds issues. I can assume that neither lua nor ltask has anything to do with this bug. Then the problem will only be in the C module loaded in the fileserver boot stage, and the C module imported into lua has a clear boundary.

The memory managed by Lua and the memory managed by C/C++ are also clearly separated. The C module I implemented by myself has perfect memory hooks. What is a little troublesome is that several modules written by my colleagues in C++ are difficult to accurately hook memory management functions.

If I can hook the memory management of the C/C++ module, change the memory allocation in the CRT from the heap to directly apply for virtual addresses for each block (leave invalid pages at both ends), and log out after not in use, so that it will be faster There is a problem with positioning. At least, I’m more likely to observe bugs in gdb. Address Sanitizer is achieved in a similar way, and it is not troublesome to make one for specific situations.

After checking, there seems to be an api called _CrtSetAllocHooK under Windows, which I haven’t used before; I also found a thing called _setheaphook(_HEAPHOOK _NewHook) in the .h file of mingw and windows SDK, but I googled it, and there is no document at all. illustrate.

I don’t think one morning is enough to figure out how to use these CRT memory hooks on Windows, and I don’t want to spend the whole May Day holiday in it and be complained by my children. Or a less work-intensive method.

Since there is a clear boundary between the memory managed by Lua and the memory managed by C/C++, and the memory management of C/C++ is more difficult to hook, we might as well start with Lua. Lua VM allows us to customize memory management functions. I redefine a lua Alloc with Windows’ HeapAlloc, so that each virtual machine uses an independent heap, and separates it from the C/C++ module. Sure enough, when I stripped out the memory allocated by Lua VM, the crash disappeared.

This also proves that the bug has nothing to do with the code of the Lua interpreter itself. The C/C++ module must have corrupted the memory used by the Lua VM.

At this time, I suddenly had an interesting idea: I can normally use malloc to apply for memory in the custom lua Alloc, which can keep the same process as before. But this time, I don’t use the allocated memory, but apply another memory twice as big from the extra heap. Half of it is reserved for the program, and the other half takes a snapshot of the memory requested by malloc.

If the program is normal, then the memory requested by malloc will always be exactly the same as the snapshot I made, and if there is a bug that changes it, I can find it accurately. In other words, the memory requested by malloc is a honeypot that I provide to the bug.

Implementing this idea was not difficult and took just over an hour. Then, I caught the bug in gdb. The point of capture is when the program exits, and my proofreading code detects an unusual modification to the honeypot. There is a byte in the middle of a small block of memory that is incremented by 1.

Because none of the C++ modules related to importing Lua were implemented by me, so I couldn’t think of which line of code would be the behavior. I left a message to my colleagues on DingTalk, describing my debugging ideas and my new discoveries. It’s not that I expect him to think of something, just do a little yellow duck debugging method.

In the second step, I think I should continue to make bugs more easily exposed, and opportunistically narrow down the scope of C++ modules that may have bugs. I did two things:

Perfect lua Alloc, string all the memory blocks with a doubly linked list. But we have multiple virtual machines running concurrently, and we need to maintain multiple linked lists. This way, I can check all honeypots in ltask’s scheduler.
Modified the lua binding of three related C++ modules written by colleagues. Use macros to add a shell to all C functions imported into Lua. This shell can log every function call from Lua to C.

In the first job, I added a concurrency check by the way. If the lua Alloc function of the same Lua virtual machine is executed concurrently, it can be sensed, which means that there is a concurrency-related bug in ltask. As expected, this concurrency check is not triggered.

The second job involves three C++ modules, which are the encapsulation of os file system functions (in order to realize vfs), the encapsulation of socket api, and the api that bypasses ltask for inter-thread communication (used in the bootstrap process, because there are some opportunities earlier than ltask).

I enabled honeypot checking in the ltask scheduler and sure enough I caught the problem early (that mysterious specific byte plus one came up again). At the same time, there are several C++ api calls happening nearby, which are the most suspect.

I read the relevant code carefully. One of them is the implementation of socket select, which has a slight problem but no bugs. Discuss this with colleagues after the festival. The other is the function of inter-thread communication. I almost let it go at first glance, but after half an hour, I found the clue. “If you rule out all the impossible situations, the rest, no matter how incredible…” .

The process is roughly like this: when a lua vm wants to bypass ltask to communicate with a vm on another thread, a colleague wrote such a thing: serialize the message to be sent, use malloc to pack it into a data block, and then put the address in a C structure. Then wait on a std::binary_semaphore. After receiving the data block pointer, the receiver wakes up the sender through this semaphore. The sender can continue to do subsequent processing.

This is an open source module, so I can show the code .

It didn’t look wrong at first, but I had the good fortune to do something similar in skynet development, so I realized the bug was here.

The C structure of this message is placed in a userdata of the sender’s lua vm. It does not have a gc method, but relies on lua’s gc to reclaim the memory of userdata. When the sender receives the signal triggered by the receiver, logically speaking, the receiver has used up this C structure, but in practice it has not. Therefore, if the sender reclaims the memory fast enough, it may make the last modification to this memory before the receiver.

After lua 5.4 adds generational gc, the userdata of this kind of temporary application may be recycled immediately after it goes out of scope. If it is a coincidence, it will be a few instruction cycles faster than the receiver. In this way, before the receiver finally rewrites the semaphore, the memory may be allocated again on another virtual machine.

The specific byte plus one we observed should be the release operation of the semaphore (guess, not yet confirmed, wait for colleagues to deliberate and modify after the festival).

This is why this bug is very easy to be interfered by lua gc, and is only triggered under high concurrency conditions.

Finally, I can spend the holidays with my children in peace.

In the evening, the DingTalk message was read, and a colleague gave it a thumbs up

This article is transferred from: https://blog.codingnow.com/2023/05/debug.html
This site is only for collection, and the copyright belongs to the original author.