Inside the Go CPU profiler

Original link: https://colobu.com/2022/09/29/Inside-the-Go-CPU-profiler/

Original text: Inside the Go CPU profiler , by Sümer Cip. I’m guessing Felix Geisendörfer will comment, and sure enough.

Go is one of those languages ​​that comes with its own profiler. Its runtime includes a powerful variety of profilers in their own right. Other languages, such as Ruby, Python, and Node.js, also include profilers or some API for writing profilers, but they provide limited profiler functionality compared to Go’s out-of-the-box profiler. If you want to learn more about these observable tools provided by Go, I highly recommend reading The Busy Developer’s Guide to Go Profiling, Tracing and Observability by Felix Geisendörfer.

As a curious engineer who loves to dig into how things work under the hood, I’ve always wanted to learn how the Go CPU profiler works under the hood. This article is the result of that exploration. Whenever I read Go runtime code, I always learn something new, and this time was no exception.

basic knowledge

There are currently two types of profilers on the market:

  • tracing : Measure when predefined events occur. For example, function is called, function exits, etc.
  • Sampling : Take measurements at regular intervals.

Go CPU profiler is a sampling profiler. Of course, Go also has an execution tracer, which is a tracing profiler that tracks certain events, such as request locks, GC-related events, and so on.

The sampling profiler usually consists of two basic parts:

  • Sampler: The callback is called periodically, and usually analyzes the data to collect stack traces. Different profilers use different strategies to trigger callbacks.
  • Data collection: This is where the profiler collects its data: it may consume memory or call counts, basically some metrics associated with stack traces.

Small survey on how other profilers work

Linux perf uses PMU (Performance Monitor Unit) counters for sampling. You can instruct the PMU to generate an interrupt after some event occurs N times, for example, every 1000 CPU clock cycles.
A detailed article by Denis Bakhvalov explains how tools like perf and VTune use PMU counters to achieve this. Once the data collection callback function is triggered periodically, all that’s left is to collect stack traces and aggregate them properly. For completeness, Linux perf uses perf_event_open(PERF_SAMPLE_STACK_USER,…) to collect stack information. Captured stack traces are written to user space via mmap ring buffers.

pyspy and rbspy are well-known sampling profilers for Python and Ruby. They all run as external processes and periodically read the target application memory to capture stack traces of running threads. In Linux, they use the process_vm_readv method to collect data, if I remember correctly, this API pauses the target application for a few milliseconds during a memory read. They then trace pointers in the memory they read to find the currently running thread structure and stack trace information. As one might guess, this is an error-prone and complicated approach, but it works very well. If I recall correctly, pyflame uses a similar approach.

Recent profilers such as Parca (and several others) use eBPF . eBPF is a state-of-the-art technology that allows user-space code to run within a kernel VM. This is an excellent technique used in many areas such as security, networking, and observability. I highly recommend reading some information about eBPF, it’s a very large topic and well beyond the scope of this post.

How can I trigger the profiler periodically?

Go CPU profiler is a sampling profiler. In Linux, the Go runtime uses setitimer/timer_create/timer_settime to set the SIGPROF signal handler. This processor fires at the periodic interval set by runtime.SetCPUProfileRate , which by default is 100Mz (10ms). By the way, prior to Go 1.18, there were some serious issues with the Go CPU profiler’s sampler, details of which you can see here . If we remember correctly, the settimer API is the recommended way in Linux to trigger time-based signals per thread: technically, it works just like you’d expect a process signal handling mechanism. But it is not a good mechanism for multi-core analysis.

Let’s see how to manually enable the Go CPU profiler:


1
2
3
4
5
6
7
8

f, err := os.Create( “profile.pb.gz” )
if err != nil {
}
if err := pprof.StartCPUProfile(f); err != nil {
}
defer pprof.StopCPUProfile()

Once pprof.StartCPUProfile is called, the Go runtime can generate the SIGPROF signal at the specified interval. The Linux kernel sends the SIGPROF signal to one of the running threads in the application. Since Go uses non-blocking I/O, goroutines waiting for I/O will not be counted as running goroutines, and the Go CPU profiler will not capture the data of such goroutines. BTW: This is the basic reason for implementing fgprof . fgprof uses runtime.GoroutineProfile to get on-CPU and off-CPU data.

A picture is worth a thousand words; below is how Go runtime handles SIGPROF signal:

A picture is worth a thousand words. Here’s how the Go runtime handles the SIGPROF signal:

How does the profiler collect data?

A random running goroutine receives a SIGPROF signal, it is interrupted and the signal handler is executed. The stack trace of the interrupted goroutine is obtained in the context of this signal handler and then saved to a lock-free log structure along with the current profiler label (each captured stack trace can be associated with a custom label, which you can then filter it). This special lock-free structure is named profBuf , it is defined in runtime/profbuf.go , and it explains in detail how it works. It is a single-writer, single-reader (single-writer, single-reader) lock-free ringbuffer, similar to the data structure presented here. The Writer is the signal handler, and the reader is the profileWriter goroutine. The profileWriter goroutine periodically reads this buffer and aggregates the results into the final hashmap. This final hashmap is named profMap and is defined in runtime/pprof/map.go .

Here is a visualization that puts together what we just introduced:

As you can see, the final structure is very similar to a regular pprof.Profile object: it is a hashmap where the keys are stack traces + labels, and the values ​​are the number of times this call stack was observed in the application. When pprof.StopCPUProfile() is called, the profiling stops and the profileWriter goroutine calls build() . This function is responsible for writing the profMap structure to the io.Writer provided by the initial call to pprof.StartCPUProfile. Basically this is how the final pprof.Profile is produced.

Pseudocode for profileWriter may be more helpful for you to understand the process:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

func profileWriter(wi io. Writer) {
for {
time.Sleep (100 * time.Millisecond)
data, tags := readProfile()
if e := b.addCPUData(data, tags); e != nil {
}
if profilerStopped {
break
}
}
b.build() // -> generates final pprof.Profile
}

Once I have a high-level understanding of design, I ask myself the following questions:
Why does Go spend so much effort implementing a unique lock-free structure just to hold temporary analysis data? Why isn’t everything written to the hashmap periodically?

The answer lies in the design itself.

The first thing to look at for the SIGPROF handler is to disable memory dispatch. Also, the profiler code does not contain hots and locks, and even the maximum depth of the stack is hardcoded. As of Go 1.19 , the maximum depth is 64 . All these details can give the profiler a more efficient and predictable overhead. Low and predictable performance is the key to a production-ready profiler.

overhead

By design, the profiler overhead is constant? Well, it depends on the situation. Let me explain why: in a single profiler interrupt, the following happens:

  • A randomly running goroutine performs a context switch to run the SIGPROF handler,
  • A stack walk is performed, and the Go runtime saves the stack trace into the lock-free ring buffer,
  • goroutine resumes.

In theory, it seems that all of the above should run in constant time, since no memory allocations and lock races occur. All I remember is true: all of the above happens in about 1 microsecond (on a typical CPU). In practice, however, things get worse. If you search for “overhead of the Go CPU profiler”, you will see numbers from %0.5 to %1-%5 (no public mention of this number was found, nor proper evidence). The reason behind this is mostly related to how the CPU works. Modern CPUs are complex monsters. They use the cache aggressively. A typical single CPU core has three layers of cache: L1, L2, and L3. These caches are fully utilized when certain CPU-intensive code is running. High cache utilization is especially true for some applications, where small (data that can fit in the cache) and sequential data is read heavily.

A good example is matrix multiplication: during matrix multiplication, the CPU accesses a large number of individual cells stored sequentially in memory. These cache-friendly applications can incur severe overhead for the sampling profiler. While it’s easy to use prof to do some benchmarking to verify this claim, it’s beyond the scope of this post.

To sum up, the Go runtime does a good job of keeping the profiler overhead as predictable and low as possible. If you don’t believe me and you have doubts, perhaps a discussion dialogue quoted below will convince you:

At Google, we continue to analyze Go production services, and it is safe to do so.

Another code commit from datadog’s ongoing profiler confirms this as well:

After testing this default on many high volume internal workloads (with the profiler turned on by default), we have determined that this default is safe for production.

Finally, based on the above theory, we can make the following observations:

On typical I/O intensive applications, the profiler overhead will be very small.

This is because when there are many sleeping/idle goroutines, CPU cache discarding doesn’t make much difference. We observed this over and over again during the Go CPU profiler benchmark: on a typical I/O-intensive application, the overhead is practically zero (or statistically insignificant). But again, providing empirical evidence is beyond the scope of this post, one can do this by observing the throughput during load testing of a Go web application with the profiler turned on and off.

in conclusion

What I like about this design is that it proves that you can optimize your code based on how well you understand the access patterns of the underlying data structures. In this case, the Go runtime uses lock-free constructs, even though this is a complete overkill in most cases. The Go runtime is full of clever optimizations like this, and I highly recommend you take the time to dig into the parts you find interesting.

I hope you like it!

References

  1. https://github.com/DataDog/go-profiler-notes/blob/main/guide/README.md
  2. https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics
  3. https://www.instana.com/blog/go-profiler-internals/
  4. https://www.datadoghq.com/blog/engineering/profiling-improvements-in-go-1-18/
  5. http://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf
  6. https://easyperf.net/blog/2018/06/01/PMU-counters-and-profiling-basics

This article is reprinted from: https://colobu.com/2022/09/29/Inside-the-Go-CPU-profiler/
This site is for inclusion only, and the copyright belongs to the original author.