How to optimize? I mean Go program

Original link: https://colobu.com/2022/10/17/a-first-look-at-arena/

Go language is a very easy to use language, and the optimization routines of Go programs are basically unknown to everyone. If you have the heart, you can search for many Go program optimization skills on the Internet. Some articles may only introduce Several optimization points, some articles from CPU architecture to slice pre-allocation, to finding performance bottlenecks through pprof, etc., comprehensively introduce the optimization of Go programs, so the visible methods are basically understood by everyone. Tapir asked a question, as shown below, you can see how deeply everyone has been optimizing the Go language.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

const N = 1000
var a [N] int
//go:noinline
func g0(a *[N] int ) {
for i := range a {
a[i] = i // line 12
}
}
//go:noinline
func g1(a *[N] int ) {
_ = *a // line 18
for i := range a {
a[i] = i // line 20
}
}

Go officials are not idle either. Although the Go language did not aim to match the performance of the C++ language at the beginning of its creation, the Go team has been working on the compilation and runtime optimization of the Go language.

Recently, the Go language is also adding two new performance optimization features, one is cmd/compile: profile-guided optimization , this proposal has been accepted, and we will introduce the subsequent functions after the initial formation. Another increases the memory arena .

In addition to the optimization of common general languages, one of the biggest problems affecting the performance of Go programs is garbage collection, so one of the reasons why programmers who use C++ and Rust develop diss Go programs. However, this is also a feature that cannot be bypassed by garbage collection programming languages. Basically, there is an unavoidable cost of STW. Even without STW, resources will be consumed for object convenience and inspection during garbage collection, so theoretically, Go performance is similar. Comparing C++/Rust language performance will always be worse, unless you disable garbage collection and do pure CPU calculations.

Debian’s benchmark’s game website has tested and published the performance comparison of some scenarios in many languages. For example, the following is a performance comparison of several implementation versions of Rust and Go:

It can be seen that Go’s performance is much worse than Rust’s in this binary tree scenario. But the best performing Rust implementation uses arena ‘s memory allocation:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26

use bumpalo::Bump;
use rayon::prelude::*;
#[derive(Debug, PartialEq, Clone, Copy)]
struct Tree<‘a> {
left: Option<&’a Tree<‘a>>,
right: Option<&’a Tree<‘a>>,
}
fn item_check (tree: &Tree) -> i32 {
if let (Some(left), Some(right)) = (tree.left, tree.right) {
1 + item_check(right) + item_check(left)
} else {
1
}
}
fn bottom_up_tree <‘r>(arena: &’r Bump, depth: i32 ) -> &’r Tree<‘r> {
let tree = arena.alloc(Tree { left: None, right: None });
if depth > 0 {
tree.right = Some(bottom_up_tree(arena, depth – 1 ));
tree.left = Some(bottom_up_tree(arena, depth – 1 ));
}
tree
}

Arena is a memory pool technology. Generally speaking, arena will create a large contiguous memory block. This memory block only needs to be pre-allocated once. The creation and release of this memory are performed manually.

The Go language is ready to add new arena features and provide a new package in the standard library: arena . At present, this proposal is still in the state of holding, but the relevant code has been mentioned in the master branch one after another, so the distribution approval is basically impossible to run, and it should be tried in Go 1.20, which is the version next spring. (Of course, there are also developers who are dissatisfied with this approach of Go, because the idea of ​​external developers is basically rejected or not paid attention to, and the people in the Go team have this idea and can immediately implement it, even before the proposal is approved).

Package arena currently provides several methods:

  • NewArena(): Create an Arena, you can create multiple Arenas, create a batch of objects in batches, and release them manually. It is not thread safe.
  • Free(): Free Arena and all objects created on it. You should not use the freed object any more, or it may cause unexpected errors.
  • New T any *T: Create an object
  • MakeSlice T any []T: Creates a Slice in Arena.
  • Clone T any : Clone an object on Arena, which can only be a pointer, slice or string. If the incoming object is not allocated in Arena, return the original object directly, otherwise create a new object from Arena.

    There is currently no method for creating maps and channels on Arena such as MakeMap and MakeChan , which may be added in the future.

    The arena function creates a piece of memory for a group of Go objects, and releases it manually as a whole at one time, which can avoid garbage collection. After all, we also mentioned that garbage collection is one of the biggest performance killers of Go programs.

Officials suggest that when creating a large number of Go objects in batches, it is more efficient to use Mib to allocate memory every time, and even they found a scene: protobuf deserialization.

Because of the problems of garbage collection and memory allocation, this function is not simple to implement, and it involves the transformation of runtime code. Regardless of how the arena is handled by garbage collection, the main implementation of arena is in arena.go at runtime. Since this feature is still under development, there may be changes to this file.

Next, we use the binary tree example of the debian benchmark’s game to make a comparison with and without arena:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122

package main
import (
“arena”
“flag”
“fmt”
“strconv”
“time”
)
// gotip run -tags “goexperiment.arenas” main.go -arena 21
// GOEXPERIMENT=arenas gotip run main.go -arena 21
var n = 0
type Node struct {
left, right *Node
value[] byte
}
func bottomUpTree(depth int ) *Node {
if depth <= 0 {
return &Node{}
}
return &Node{bottomUpTree(depth – 1 ), bottomUpTree(depth – 1 ), make ([] byte , 128 , 128 )}
}
func bottomUpTreeWithArena(depth int , a *arena.Arena) *Node {
node := arena.New[Node](a)
node.value = arena.MakeSlice[ byte ](a, 128 , 128 )
if depth <= 0 {
return node
}
node.left = bottomUpTreeWithArena(depth -1 , a)
node.right = bottomUpTreeWithArena(depth -1 , a)
return node
}
func (n *Node) itemCheck() int {
if n.left == nil {
return 1
}
return 1 + n.left.itemCheck() + n.right.itemCheck()
}
const minDepth = 4
var useArena = flag.Bool( “arena” , false , “use arena” )
func main() {
flag.Parse()
if flag.NArg() > 0 {
n, _ = strconv.Atoi(flag.Arg (0 ))
}
appStart := time.Now()
defer func () {
fmt.Printf( “benchmark took: %v\n” , time.Since(appStart))
}()
if *useArena {
maxDepth := n
if minDepth +2 > n {
maxDepth = minDepth + 2
}
stretchDepth := maxDepth + 1
a := arena.NewArena()
start := time.Now()
check := bottomUpTreeWithArena(stretchDepth, a).itemCheck()
a.Free()
fmt.Printf( “stretch tree of depth %d\t check: %d, took: %v\n” , stretchDepth, check, time.Since(start))
a = arena.NewArena()
longLiveStart := time.Now()
longLivedTree := bottomUpTreeWithArena(maxDepth, a)
defer a.Free()
for depth := minDepth; depth <= maxDepth; depth += 2 {
iterations := 1 << uint (maxDepth-depth+minDepth)
check = 0
start := time.Now()
for i := 1 ; i <= iterations; i++ {
a := arena.NewArena()
check += bottomUpTreeWithArena(depth, a).itemCheck()
a.Free()
}
fmt.Printf( “%d\t trees of depth %d\t check: %d, took: %v\n” , iterations, depth, check, time.Since(start))
}
fmt.Printf( “long lived tree of depth %d\t check: %d, took: %v\n” , maxDepth, longLivedTree.itemCheck(), time.Since(longLiveStart))
} else {
maxDepth := n
if minDepth +2 > n {
maxDepth = minDepth + 2
}
stretchDepth := maxDepth + 1
start := time.Now()
check := bottomUpTree(stretchDepth).itemCheck()
fmt.Printf( “stretch tree of depth %d\t check: %d, took: %v\n” , stretchDepth, check, time.Since(start))
longLiveStart := time.Now()
longLivedTree := bottomUpTree(maxDepth)
for depth := minDepth; depth <= maxDepth; depth += 2 {
iterations := 1 << uint (maxDepth-depth+minDepth)
check = 0
start := time.Now()
for i := 1 ; i <= iterations; i++ {
check += bottomUpTree(depth).itemCheck()
}
fmt.Printf( “%d\t trees of depth %d\t check: %d, took: %v\n” , iterations, depth, check, time.Since(start))
}
fmt.Printf( “long lived tree of depth %d\t check: %d, took: %v\n” , maxDepth, longLivedTree.itemCheck(), time.Since(longLiveStart))
}
}

In this program, we use the -arena parameter to control whether or not to use arena . First you must install or update gotip to the latest version (if you have gotip installed, execute gotip downloamd , if not, go install golang.org/dl/gotip@latest ).

  • enable -arena : run GOEXPERIMENT=arenas gotip run -arena main.go 21
  • without -arena : run GOEXPERIMENT=arenas gotip run -arena=false main.go 21

However, this feature is still under development and the function is not perfect.

I tested it on MacOS, and the performance of arena will be significantly improved, but when tested under windows, the performance has dropped.

This article is reprinted from: https://colobu.com/2022/10/17/a-first-look-at-arena/
This site is for inclusion only, and the copyright belongs to the original author.