Replace the map of the standard library to implement SwissTable faster

Original link: https://colobu.com/2023/06/29/replace-std-map-faster/

In the latest weekly meeting of the Go development team, they discussed Google3 re-evaluating the performance of SwissTable, and it may be worth following up to replace the standard library map.

  • SwissTable
    • [MichaelP]: Google3 may want to do some benchmarking on this. Maps are used heavily in google3. This may be of value to us.
    • So this may be worth investing in.

Some students may not know what SwissTable (or Swiss Tables) is? What is Google3?

Google3 may be familiar to students inside Google, but this concept is rarely introduced outside. Google3 refers to Google’s monorepo. 95% of Google’s code is under this large warehouse, except for some special projects such as Chrome and Android.

SwissTable is an implementation of a hash table (hash table, map) open sourced by Google. It was announced at the cppcon 2017 conference. Google internally uses it to replace std:unordered_map . Its exquisite design makes it very efficient, and it is also ported In other languages, for example, there is a library hashbrown in the Rust language that is implemented based on SwissTable, and began to replace the HashMap of the standard library in Rust 1.36.

The advantages of SwissTable are introduced by official documents and some developers, such as

Therefore, the Go development team has also begun to pay attention to SwissTable, and may implement it in a future version and replace the map of the standard library. After all, the improvement of these basic data types and algorithms will help thousands of projects that use Go for development, such as sorting. Algorithms. Last year, the classmates of Byte submitted a pdqsort algorithm, which replaced the sorting algorithm in the standard library. Now someone has proposed dual-pivot quicksort . According to the test performance, it is better and more concise. Everyone has been using these for decades. People’s basic algorithms and data structures are still being optimized tirelessly.

Although the Go development team is only paying attention to and discussing SwissTable at present, it may not be implemented in the future, but some people in the community have recently implemented SwissTable: swiss .

I was witty in an article last year! Let’s introduce the author’s maphash for the hash function of the map. It seems that the author has been preparing for his SwissTable. This year he submitted the Go language implementation of SwissTable. The method of calculating the hash is to use his maphash .

The author specially wrote an article to introduce his swiss design. We can at least understand how the Go language designs SwissTable from this library. In some optimization scenarios, you can use this library to replace the built-in map privately, just like the author did.

The built-in map adopts the open addressing method ( open-hashing , also known as the zipper method), and the key-value pairs with hash value conflicts (the same hash value) will be placed in a bucket. When searching, first according to the hash Value finds the corresponding bucket, and then traverses the elements in the bucket to find the corresponding value. Of course, there are still some optimizations in it, and the extra bit is used for faster inspection. For details of the built-in map, please refer to the sharing of the 2016 GopherCon conference. : Inside the Map Implementation .

SwissTable uses a different hashing scheme called ” Closed Hashing Hashing” (also known as the open address method). Each hash value will have its own slot ( slot ), the choice of the slot is determined by the hash value, the easiest way is to start searching from the slot of hash(key) mod size , and keep searching until the corresponding keys or empty slots (keys that do not exist), which is also a common routine of the open address method. SwissTable also uses a short hash ( short hash ) like the built-in map to support quick checks, but its metadata is stored independently, separate from the hash value storage.

In this way, it is more friendly to the CPU cache, and more importantly, 16 short hashes can be compared in parallel through SSE instructions.

Because the find method is the basis of other methods Get() , Has() , Put() , and Delete() , it needs to find the corresponding slot, so let’s see how the swiss library implements the above logic (pseudocode ):


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

func (m Map) find(key) (slot int , ok bool ) {
h1, h2 := hashOf(key) // High and low hash bits, similar to the built-in map processing, to achieve short addresses
s := modulus(h1, len (m.keys) /16 ) // Calculate the position where probing starts
for {
// Probe implemented using SSE instruction, using h2 for short address matching
matches := matchH2(m.metadata[s:s +16 ], h2)
for _, idx := range matches {
if m.keys[idx] == key {
return idx, true // found, return the corresponding slot
}
}
// Use the SSE instruction to find an empty slot
matches = matchEmpty(m. metadata[s:s +16 ])
for _, idx := range matches {
return idx, false // Found an empty slot, return
}
s += 16
}
}

This library also provides the benchmark code, we tested the performance of it and the built-in map on the Apple M2 and the virtual machine respectively.

Apple M1


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25
26

goos: darwin
goarch: arm64
pkg: github.com/dolthub/swiss
BenchmarkStringMaps/n= 16 /runtime_map- 8 190298112 6.716 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 16 /swiss.Map- 8 114514650 9.000 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 128 /runtime_map- 8 181336688 6.934 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 128 /swiss.Map- 8 124560243 13.63 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 1024 /runtime_map- 8 100000000 10.09 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 1024 /swiss.Map- 8 100000000 11.27 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 8192 /runtime_map- 8 60224208 19.38 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 8192 /swiss.Map- 8 88910022 13.28 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 131072 /runtime_map- 8 53933996 22.12 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 131072 /swiss.Map- 8 76083596 16.36 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 16 /runtime_map- 8 262228116 4.678 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 16 /swiss.Map- 8 227993193 5.439 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 128 /runtime_map- 8 242425221 4.708 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 128 /swiss.Map- 8 185926908 5.876 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 1024 /runtime_map- 8 173284822 6.709 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 1024 /swiss.Map- 8 186861410 6.550 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 8192 /runtime_map- 8 71231763 16.57 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 8192 /swiss.Map- 8 139595205 8.635 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 131072 /runtime_map- 8 64039132 19.05 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 131072 /swiss.Map- 8 100000000 11.56 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/dolthub/swiss 33.538 s

Linux


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
twenty one
twenty two
twenty three
twenty four
25

goos: linux
goarch: amd64
pkg: github.com/dolthub/swiss
cpu: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz
BenchmarkStringMaps/n= 16 /runtime_map- 2 97587910 12.65 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 16 /swiss.Map- 2 61206505 19.38 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 128 /runtime_map- 2 92861481 13.35 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 128 /swiss.Map- 2 58353951 20.83 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 1024 /runtime_map- 2 51516268 22.70 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 1024 /swiss.Map- 2 51832698 23.09 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 8192 /runtime_map- 2 40324459 29.54 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 8192 /swiss.Map- 2 45826951 26.19 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 131072 /runtime_map- 2 26659296 43.71 ns/op 0 B/op 0 allocs/op
BenchmarkStringMaps/n= 131072 /swiss.Map- 2 30621675 39.84 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 16 /runtime_map- 2 128090280 9.332 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 16 /swiss.Map- 2 87047056 15.01 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 128 /runtime_map- 2 125906628 9.837 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 128 /swiss.Map- 2 79239448 15.52 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 1024 /runtime_map- 2 65208208 17.22 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 1024 /swiss.Map- 2 77075527 15.81 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 8192 /runtime_map- 2 48505800 25.28 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 8192 /swiss.Map- 2 64617066 18.19 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 131072 /runtime_map- 2 36938596 32.38 ns/op 0 B/op 0 allocs/op
BenchmarkInt64Maps/n= 131072 /swiss.Map- 2 41026358 28.41 ns/op 0 B/op 0 allocs/op

It can be seen that when the number of keys is relatively small, swiss cannot take advantage of its SSE instructions, but when there are many key values, swiss has obvious advantages.
The author’s maphash library was also introduced earlier. On some platforms that support AES instructions, the ability of the hardware can be used to calculate the hash value faster.

More than that, swiss also has its advantages in terms of memory usage. According to the author’s test, swiss can save more memory:
chunk-index-memory.jpg

When the number of keys in the built-in map increases, the memory will increase in steps. The reason is that everyone generally adopts the bit-hacking optimization mode. In order to improve the method of finding the remainder, the general divisor will use an exponent of 2, so the number of keys When the memory needs to be expanded when it grows, the allocated memory will be doubled. So you can see that the memory of Go’s built-in map will increase in steps, which is sometimes very wasteful.

The swiss library uses a fast remainder (more precisely, modulo reduction) algorithm proposed by Daniel Lemire . This idea seems simple, but it is actually very clever (x * N) / 232 :


1
2
3

func fastModN(x, n uint32 ) uint32 {
return uint32 (( uint64 (x) * uint64 (n)) >> 32 )
}

This method limits the type of n to uint32 , which supports a maximum of 2 ^ 32 , and because each bucket has 16 elements, the swiss library supports a maximum of 2 ^ 36 elements, which is sufficient for most scenarios.

This article is transferred from: https://colobu.com/2023/06/29/replace-std-map-faster/
This site is only for collection, and the copyright belongs to the original author.