Many years old driver, facing these Go concurrency problems, hesitant

Original link: https://colobu.com/2022/09/12/go-synchronization-is-hard/

Go has always been known for its simplicity and ease of learning. I have also met students who say that it takes only half a day to master the Go language, and two or three years of Go development experience is called an expert.

If you want to compare programming languages ​​such as Rust, the Go language is indeed easy to learn. You will also see the Go language specification. Compared with other programming languages, its language specification is very short and can indeed be read in an hour. But if you dig deeper into the language (and other languages ​​are similar), you’ll find a lot of details that need to be honed in with time and effort. No, let me test you with a few Go concurrent source codes and see if you can answer it?

Several source code problems with Go concurrency

Write order seen by another goroutine

Is it possible to output 1, 0 in the following code?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (2 )
var x, y int
go func () {
x = 1
y = 1
wg.Done()
}()
go func () {
r1 := y
r2 := x
fmt.Println(r1, r2) // ❶ r1 = 1, r2 = 0 possible?
wg.Done()
}()
wg.Wait()
}

Two goroutines cross read/write

Is it possible for the following code to output r1=0, r2=0?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (2 )
var x, y int
go func () {
x = 1
r1 := y
fmt.Println(r1) // ❶
wg.Done()
}()
go func () {
y = 1
r2 := x
fmt.Println(r2) // ❷
wg.Done()
}()
wg.Wait()
}

The write order seen by the two goroutines

In the following code, can ❶ and ❷ output two different results, 1,0 and 0,1 ?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (4 )
var x, y int
go func () {
x = 1
wg.Done()
}()
go func () {
y = 1
wg.Done()
}()
go func () {
r1 := x
r2 := y
fmt.Println(r1, r2) // ❶ 1,0
wg.Done()
}()
go func () {
r3 := x
r4 := y
fmt.Println(r3, r4) // ❷ 0,1
wg.Done()
}()
wg.Wait()
}

sync.Once

The following is the implementation of sync.Once in the Go standard library. Although it is a very simple few lines of code, do you have any questions?


1
2
3
4
5
6
7
8
9
10
11
12
13
14

func (o *Once) Do(f func ()) {
if atomic.LoadUint32(&o.done) == 0 { // ❶
o.doSlow(f)
}
}
func (o *Once) doSlow(f func ()) {
omLock() // ❷
defer omUnlock() // ❸
if o. done == 0 { // ❹
defer atomic.StoreUint32(&o.done, 1 ) // ❺
f()
}
}

Why not use atomic at ❹? Then is it okay not to use atomic at ❶ and ❺?

sync.WaitGroup

The following is a sample code of WaitGroup , do you need to use atomic.Load at ❶? If not used, must output 10?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

func main() {
var count int64
var wg sync.WaitGroup
wg.Add (10 )
for i := 0 ; i < 10 ; i++ {
go func () {
atomic.AddInt64(&count, 1 )
wg.Done()
}()
}
wg.Wait()
fmt.Println(count) // ❶
}

TryLock

The following code is the implementation of the TryLock method of Mutex. ❶Why not use atomic, will it cause the lock to be released but the TryLock cannot acquire the lock?


1
2
3
4
5
6
7
8
9
10
11
12

func (m *Mutex) TryLock() bool {
old := m.state // ❶
if old&(mutexLocked|mutexStarving) != 0 {
return false
}
if !atomic.CompareAndSwapInt32(&m.state, old, old|mutexLocked) {
return false
}
return true
}

You can think about these problems first, we will not analyze them for the time being, let’s take a look at the latest Go memory model, and then analyze these topics.

Go memory model

The new version (Go 1.19) of the memory model has made significant changes to the original memory model text:

  • Document Go’s overall implementation of the memory model
    Data race refers to concurrent write-write or write-read to the same memory location. Unless atomic is used for atomic operations, there may be data races, so you see, read-read does not have data races. Programs without data races behave as if all goroutines run on a single processor, a property sometimes called DRF-SC (data-race-free programs execute in a sequentially consistent manner).

  • Documenting multiword races can cause crashes

  • Document happens-before rules for runtime.SetFinalizer
  • Document (or add joins ) happens-before rules for more synchronization primitives
  • Document happens-before rules for sync/atomic , corresponding to sequential consistency atomic for C++ (and other languages ​​Java, JavaScript, Rust, Swift, C, …)
  • Documentation disallows compiler optimizations

Memory operations can be divided into three categories:

  • read-like: such as read, atomic read, mutex lock and channel receive
  • write-like: such as write, atomic write, mutex unlock, channel send and channel close
  • Coexistence of read-like and write-like: such as atomic compare-and-swap

The memory model has proposed two relationships: sequenced before and synchronized before .
Statements and expressions executed in a goroutine follow sequenced before , and are executed in the order of control flow and expression evaluation defined in the Go language specification.
If a synchronously executed read-like memory operation r observes a synchronously executed write-like memory operation w , then w synchronized before r.
The happens before relationship is defined as a transitive combination of sequenced before and synchronized before .

In my understanding, the Go memory model is equivalent to refining the happens before relationship.

An asynchronous data read r, read the memory address x, if the write w is visible to r, that is, if r can observe w, you need to ensure:

  • w happens before r
  • w does not happen before any other write w’ (to x) that happens before r

In fact, I don’t want to talk too much about the theory of the Go memory model. After all, it is to translate the official documents, and I have translated it a few times and found that the translation is not good. It is better to read the official documents directly. In addition, I personally feel that this official document is not well written. Although there are some definitions in the front, some terms or arguments are not well explained or strictly demonstrated (a family’s words). For example, when you say some , you have to strictly define which scenarios some include. When you say a specific term or phrase, you define the term or add a citation.

Anyway, what is important, or easy to understand for Gopher, is that it gives strict guarantees (definitions) for various synchronization scenarios, such as init , creation and execution of goroutines, channels, Mutex, RWMutex, Once, and new Added SetFinalizer , but most importantly, finally provides atomic synchronization definitions.

atomic guarantee

The sync/atomic package provides a set of atomic operations. If atomic operation A can be observed by atomic operation B , then we say A synchronized before B.
All atomic operations in a program are executed as if they were executed sequentially in sequential consistency.

Same as sequentially consistent atomic in C++, and volatile variables in Java.

Because different CPU instructions are different, the specific atomic implementation is also different for different architectures. Let’s take the AMD64 architecture as an example.

For read operations, such as atomic.Load does not actually do any processing, because it is a single word read:


1
2
3
4
5
6
7
8
9
10
11

func Load(ptr * uint32 ) uint32 {
return *ptr
}
func Loadp(ptr unsafe. Pointer) unsafe. Pointer {
return *(*unsafe.Pointer)(ptr)
}
func Load64(ptr * uint64 ) uint64 {
return *ptr
}

For write (read and write) operations, use the LOCK prefix:


1
2
3
4
5
6
7
8

TEXT Cas64(SB), NOSPLIT, $0 -25
MOVQ ptr +0 (FP), BX
MOVQ old +8 (FP), AX
MOVQ new +16 (FP), CX
LOCK
CMPXCHGQ CX, 0 (BX)
SETEQ ret +24 (FP)
RET

The LOCK prefix ensures that shared memory can be used exclusively and that modifications to the data are visible to other processors via MESI. LOCK may use the lock bus method in the old Intel CPU, but in the modern CPU, by locking the cache line method. A clearer explanation can be found in this article x86 LOCK prefix .

Store uses the XCHG instruction and does not need the LOCK prefix. It has the function of LOCK itself.


1
2
3
4
5

TEXT Store64(SB), NOSPLIT, $0 -16
MOVQ ptr +0 (FP), BX
MOVQ val +8 (FP), AX
XCHGQ AX, 0 (BX)
RET

Although we are looking at the atomic implementation of the AMD64 architecture, we might as well look at the Load64 of the x86 architecture, which is a very interesting point of knowledge.
x86 is a 32-bit architecture, and word is 32 bits. When reading the value of int64, it is a multiword scenario. Without atomic, it is possible to read partially modified values. Therefore, in the case of 32 bits, atomic reads the value of int64 Need special treatment, do not want to read directly under AMD64 architecture:


1
2
3
4
5
6
7
8
9
10

TEXT Load64(SB), NOSPLIT, $0 -12
NO_LOCAL_POINTERS
MOVL ptr +0 (FP), AX
TESTL $7 , AX
JZ 2 (PC)
CALL PanicUnaligned(SB)
MOVQ (AX), M0
MOVQ M0, ret +4 (FP)
EMMS
RET

First it will check if it is 8byte aligned, if not it will panic. Then, use the MMX0 register (64bit) to cleverly implement atomic read.

WaitGroup, Cond, sync.Pool, sync.Map

The latest Go memory model also defines the memory model for various concurrency primitives in the go doc, although I feel it is more reasonable to document it in the Go memory model documentation.

For example WaitGroup:

In the terminology of the Go memory model, a call to Done “synchronizes before” the return of any Wait call that it unblocks.

For example Cond:

In the terminology of the Go memory model, Cond arranges that a call to Broadcast or Signal “synchronizes before” any Wait call that it unblocks.

For example Pool:

In the terminology of the Go memory model, a call to Put(x) “synchronizes before” a call to Get returning that same value x. Similarly, a call to New returning x “synchronizes before” a call to Get returning that same value x.

For example sync.Map:

In the terminology of the Go memory model, Map arranges that a write operation “synchronizes before” any read operation that observes the effect of the write.

For example RWMutex:

In the terminology of the Go memory model, the n’th call to Unlock “synchronizes before” the m’th call to Lock for any n < m, just as for Mutex. For any call to RLock, there exists an n such that the n’th call to Unlock “synchronizes before” that call to RLock, and the corresponding call to RUnlock “synchronizes before” the n+1’th call to Lock.

For example Mutex:

In the terminology of the Go memory model, the n’th call to Unlock “synchronizes before” the m’th call to Lock for any n < m. A successful call to TryLock is equivalent to a call to Lock. A failed call to TryLock does not establish any “synchronizes before” relation at all.

For example Once:

In the terminology of the Go memory model, the n’th call to Unlock “synchronizes before” the m’th call to Lock for any n < m. A successful call to TryLock is equivalent to a call to Lock. A failed call to TryLock does not establish any “synchronizes before” relation at all.

For example runtime.SetFinalizer:

In the terminology of the Go memory model, a call SetFinalizer(x, f) “synchronizes before” the finalization call f(x). However, there is no guarantee that KeepAlive(x) or any other use of x “synchronizes before” f(x).

Why understand these guarantees, or rather the Go memory model? Just to be able to write data-race-free programs without concurrency problems.

For example, the following concurrent code:


1
2
3
4
5
6
7
8
9
10
11
12
13
14

var a string
var done bool
func setup() {
a = “hello, world”
done = true
}
func main() {
go setup()
for !done { // idling, waiting for done=true
}
print (a)
}

Here, the read and write of done is not synchronized with atomic, so there is no guarantee in the main goroutine that “assignment of a” happens before “assignment of done”, so this program may output an empty string, and even extreme main goroutine keeps observing An assignment that is not done causes the program to never complete.

TSO for x86

Different CPU architectures handle memory order differently, and the Go memory model guarantees consistency across all supported architectures. Some concurrency issues are fine on x86, but may be problematic on ARM architectures.

Let’s look at the modern x86 memory model. Here is one of their architectures:

Each processor has its own write queue and is connected to a shared main memory. When the processor reads, it first queries its own write queue, and then queries the main memory. Processors do not query the write queues of other processors.

Although the processor will see its own writes and temporarily cannot see the writes of other processors, but when it comes to the main memory, all writes are sequential, and the main memory writes seen by all processors are in order Yes, so the memory model of x86 is also called the total store order model (TSO, total store order).

With the x86 memory model, as soon as a value is written to main memory, future reads will see it (until a new write overwrites it).

ARM/POWER uses a more relaxed memory model, where each processor reads and writes its own copy of memory, and each write propagates independently to other processors. Reordering is allowed during propagation, and there is no store-total order like x86.

But for ARM/POWER, different threads write to the same memory address in total order, that is, the write of the same memory address observed by one thread is in the same order as the write of the same memory address observed by another thread. Yes, although the observed order of writes to different addresses may be different, this is called Coherence .

Analyze the above problem

Having said so much, in fact, I still talk about the background of a memory model, and these knowledges are also introduced in the hardware memory model of Russ Cox, and there are many related articles dedicated to memory order and other related knowledge. In order to revise the Go memory model, Russ Cox has done in-depth research on it, and has written three articles to introduce it.

This is all theory, but what does the Go memory model have to do with real concurrent programs? Let’s analyze the questions we raised at the beginning one by one.

Write order seen by another goroutine

The first question, is it possible for the following code ❶ to output 1, 0 ?

For the x86 architecture, the answer is: no .
The x86 architecture ensures TSO and storage order. For the first write goroutine, the write order of x and y is guaranteed to be in order, and the write queue of the first thread guarantees x sequenced before y, and because the TSO is stored in order, the order observed by other goroutines Also x sequenced before y.

For the ARM architecture, the answer is: yes . For the first write goroutine, the write order of x and y is ordered, but since the arm architecture is not TSO, the writes to x and y may be reordered when propagated to other processors, so there may be The processor sees x, y write, and some sees y, x write.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (2 )
var x, y int
go func () {
x = 1
y = 1
wg.Done()
}()
go func () {
r1 := y
r2 := x
fmt.Println(r1, r2) // ❶ r1 = 1, r2 = 0 possible?
wg.Done()
}()
wg.Wait()
}

Two goroutines cross read/write

The second question, is it possible for the following code to quickly get the result of r1=0, r2=0?

In normal understanding, x=1 or y=1 will happen first, or the output will always have a 1, so r1= 0, r2 =0 is unlikely to happen.

However , under the x86 architecture, two goroutines (threads) put their writes into the write buffer, and then read another variable. At this time, their writes may not be synchronized to the main memory, which will cause their reads to is still zero. So under x86 the answer is: yes , it can happen.

For ARM/POWER, similarly, the goroutine’s respective writes may not have propagated to the other’s processor, so the answer is also: yes , it may happen.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (2 )
var x, y int
go func () {
x = 1
r1 := y
fmt.Println(r1) // ❶
wg.Done()
}()
go func () {
y = 1
r2 := x
fmt.Println(r2) // ❷
wg.Done()
}()
wg.Wait()
}

The write order seen by the two goroutines

In the following code, can ❶ and ❷ output two different results, 1,0 and 0,1 ?

For the x86 architecture, the answer is: no , it can’t happen. Since the third goroutine has observed 1,0 , because x86 is TSO, the storage is ordered, so x=1 sychronized before y=1 , the fourth goroutine can only observe 0,0 , 1,0 or 1,1 , cannot be 0,1 .

For the ARM architecture, the answer is: yes , it is possible that the order of writes to different memory addresses observed by different threads may be different.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

package main
import (
“fmt”
“sync”
)
func main() {
var wg sync.WaitGroup
wg.Add (4 )
var x, y int
go func () {
x = 1
wg.Done()
}()
go func () {
y = 1
wg.Done()
}()
go func () {
r1 := x
r2 := y
fmt.Println(r1, r2) // ❶ 1,0
wg.Done()
}()
go func () {
r3 := x
r4 := y
fmt.Println(r3, r4) // ❷ 0,1
wg.Done()
}()
wg.Wait()
}

sync.Once

The above are some test examples. In order to illustrate the complexity of memory synchronization, let’s take a look at the special processing for memory synchronization in the actual standard library.

The following is the implementation of sync.Once in the Go standard library. Although it is a very simple few lines of code, do you have any questions?

This short few lines of code often confuse students who are deeply researching Go code, why not use atomic in ❹? So is it okay not to use atomic in ❶ and ❺?

A double check is performed at ❹, and a write to o.done is performed at ❺. Will they not be data race?

Assuming that g1 executes ❺, a concurrent g2 executes exactly ❷.

Because according to g1, ❺ sequenced before ❸, according to Mutex’s guarantee, g2’s ❷ synchronized before ❹, so when g2 gets the acquisition and executes to 4, he can see that g1 has marked o.done as 1.

So is it okay not to use atomic in ❶ and ❺? no! Strictly speaking, for performance, no!

In order to achieve synchronization at ❶ and ❺, synchronization primitives must be used, and the simplest one here is to use atomic. The writing at ❺ will definitely be seen by the reading at ❶.

If ❶ and ❺ do not use atomic, then the extreme case ❶ always checks o.done==0, and then always enters doSlow for double check, which consumes performance.

For x86 architecture, atomic can be omitted at ❶, but atomic must be used at ❺ so that done=1 can be written to main memory.


1
2
3
4
5
6
7
8
9
10
11
12
13
14

func (o *Once) Do(f func ()) {
if atomic.LoadUint32(&o.done) == 0 { // ❶
o.doSlow(f)
}
}
func (o *Once) doSlow(f func ()) {
omLock() // ❷
defer omUnlock() // ❸
if o. done == 0 { // ❹
defer atomic.StoreUint32(&o.done, 1 ) // ❺
f()
}
}

sync.WaitGroup

The following is a sample code of WaitGroup , do you need to use atomic.Load at ❶? If not used, must output 10?

The answer is: no , it must output 10.

❶ suquenced before ❷, ❷ synchronized before ❸, ❸ sequenced before ❹, so ❹ will definitely see the write of count by each goroutine, so atomic is not needed at ❹.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

func main() {
var count int64
var wg sync.WaitGroup
wg.Add (10 )
for i := 0 ; i < 10 ; i++ {
go func () {
atomic.AddInt64(&count, 1 ) // ❶
wg.Done() // ❷
}()
}
wg.Wait() // ❸
fmt.Println(count) // ❹
}

TryLock

The following code is the implementation of the TryLock method of Mutex. ❶Why not use atomic, will it cause the lock to be released but the TryLock cannot acquire the lock?

The answer is : yes, under some CPU architectures, it is possible that the lock has been released, but subsequent goroutines will not be able to acquire the lock.

As far as the memory model is concerned, l.TryLock (or l.TryRLock ) may be considered to be able to return false even when the mutex l is unlocked.
As far as the Go memory model is concerned, l.TryLock (or l.TryRLock ) may return false even if l has released the lock.

For us ordinary developers, this is called a bug or data race, but for the Go team, they call it a special design for performance optimization.

The memory model is defining when one event is synchronized before another. That sentence is saying that in a sequence Lock -> Unlock -> TryLock -> Lock there is no promise that the TryLock is synchronized before the Lock. The TryLock can fail to lock the mutex even though the mutex is unlocked.

From Ian Lance Taylor

But for x86 this is not a problem. When a goroutine releases the lock, the release of the lock can be observed in time at ❶. However, for the Arm architecture, after the lock is released, the release of the lock may not be obtained in time at ❶, so even if the lock is released, it may still return false, which is only a very extreme case.


1
2
3
4
5
6
7
8
9
10
11
12

func (m *Mutex) TryLock() bool {
old := m.state // ❶
if old&(mutexLocked|mutexStarving) != 0 {
return false
}
if !atomic.CompareAndSwapInt32(&m.state, old, old|mutexLocked) {
return false
}
return true
}

References

  1. https://research.swtch.com/hwmm
  2. https://github.com/golang/go/issues/5045
  3. https://go.dev/ref/mem
  4. https://github.com/golang/go/discussions/47141
  5. https://hackmd.io/@vesuppi/Syvoiw1f8
  6. https://www.chanmufeng.com/posts/concurrency/%E7%BC%93%E5%AD%98%E4%B8%80%E8%87%B4%E6%80%A7%E4%B8%8E %E5%86%85%E5%AD%98%E5%B1%8F%E9%9A%9C.html

This article is reprinted from: https://colobu.com/2022/09/12/go-synchronization-is-hard/
This site is for inclusion only, and the copyright belongs to the original author.