Permalink to this article – https://ift.tt/mTtyrvR
Recently , bitmap index (bitmap index) was used when implementing a data structure. This article will briefly talk about bitmap (bitmap).
1. What is a bitmap
The bitmap index is implemented using a bit array (bit array, also called bitset, usually called a bitmap (bitmap), and the name bitmap is used below). A bitmap is a mapping from some field (usually an integer range) to values in the set {0, 1}:
映射:f(x) -> {0, 1}, x是[0, n)的集合中的元素。
Take the set {1, 2, 5} of n=8 as an example:
f(0) = 0 f(1) = 1 f(2) = 1 f(3) = 0 f(4) = 0 f(5) = 1 f(6) = 0 f(7) = 0
If we use bit to represent the value obtained after mapping, we will get a binary number 0b00100110 (the value on the rightmost bit indicates the existence of the value 0 in the set), so we can use a byte-sized value 0b00100110 To represent the existence of values at each position in the set {1, 2, 5}.
We see that compared to using a byte array to represent the set {1, 2, 5} (even if it is 8 values, at least 8×8=64 bytes), bitmap undoubtedly has higher space utilization . At the same time, through bitmap operations such as AND, OR, and XOR, we can easily and high-performance obtain the results of set operations such as intersection, union, and Top-K.
However, the traditional bitmap does not always save space. For example, if we want to represent a collection like {1, 2, 10, 50000000}, using the traditional bitmap will bring a lot of space overhead. For such a collection with sparse element characteristics, the traditional bitmap implementation loses its advantages, and compressed bitmap (compressed bitmap) becomes a better choice.
2. Compressed bitmap
Compressed bitmaps can support sparse collections well, while retaining the advantages of traditional bitmap space and high-performance collection operations. The most common bitmap compression scheme is RLE (run-length encoding). A rough understanding of this scheme is to count consecutive 0s and 1s separately. For example, the following bitmap can be compressed and encoded into n 0s and m and 1 :
0b0000....00001111...111
The RLE scheme (and its variants) has a good compression ratio and the codec is also very efficient. However, its disadvantage is that it is difficult to access a certain bit randomly, and each time a specific bit is accessed, it must be decompressed from the beginning. If you want to intersect two big bitmaps, you have to decompress the whole big bitmap.
A compressed bitmap scheme called roaring bitmap can solve the above problems.
3. Introduction to the working principle of roaring bitmap
The roaring bitmap works like this: it divides the integer number [0, 4294967296) that can be represented by a 32-bit integer into 2^16 chunks (for example, [0, 2^16), [2^16, 2×2^16),…). When adding a number to the roaring bitmap or obtaining the existence of a number from the roaring bitmap, the roaring bitmap determines which trunk the number is in through the first 16 bits of the number. Once the trunk is determined, the container that actually stores the last 16-bit value of the number can be found through the container pointer associated with the trunk, and located in the container through a search algorithm:
As shown in the figure above: there are more than one type of container associated with the trunk of the roaring bitmap:
- array container: This is an ordered 16-bit integer array, which is also the default container type, and can store up to 4096 values. When this amount is exceeded, the bitset container will be considered for storage;
- bitset container: It is an uncompressed bitmap with 2^16 bits;
- run container: This is a container type that uses RLE compression and is suitable for storing continuous values. It can also be seen from the above figure that this container stores a number pair <s,l>, and the range of values represented is [ s, s + l].
The roaring bitmap will select the appropriate container type according to the characteristics of the number in the trunk, and this selection is dynamic, with the goal of minimizing memory usage. When we add or delete values to the roaring bitmap, the container type corresponding to the trunk may change. However, from an overall perspective, no matter which container is used, the roaring bitmap supports fast random access to a certain bit. At the same time, roaring bitmap is also easier to use the high-performance instructions provided by modern CPUs at the implementation level, and it is cache-friendly.
Four. The effect of roaring bitmap
The roaring bitmap officially provides implementations in multiple mainstream languages, among which the implementation of the Go language is the roaring package . The use of the roaring package is very simple, here is a simple example:
package main import ( "fmt" "github.com/RoaringBitmap/roaring" ) func main() { rb := roaring.NewBitmap() rb.Add(1) rb.Add(100000000) fmt.Println(rb.String()) fmt.Println(rb.Contains(1)) fmt.Println(rb.Contains(2)) fmt.Println(rb.Contains(100000000)) fmt.Println("cardinality:", rb.GetCardinality()) fmt.Println("rb size=", rb.GetSizeInBytes()) }
Running the example yields the following results:
{1,100000000} true false true cardinality: 2 rb size= 16
We see that the sparse set of {1, 100000000} mapped to roaring bitmap takes only 16 bytes of space (compared to uncompressed bitmap).
The following is an example of mapping a set of random integers within 3000w to a roaring bitmap:
func main() { rb := roaring.NewBitmap() for i := 0; i < 30000000; i++ { rb.Add(uint32(rand.Int31n(30000000))) } fmt.Println("cardinality:", rb.GetCardinality()) fmt.Println("rb size=", rb.GetSizeInBytes()) }
The following is the result of its execution:
cardinality: 18961805 rb size= 3752860
We see that a total of nearly 19 million numbers have been added to the collection, and the roaring bitmap occupies a total of 3.6MB of memory space, which does not widen the gap with the non-compressed bitmap.
The following is an example of mapping a collection of consecutive 3000w numbers to a roaring bitmap:
func main() { rb := roaring.NewBitmap() for i := 0; i < 30000000; i++ { rb.Add(uint32(i)) } fmt.Println("cardinality:", rb.GetCardinality()) fmt.Println("rb size=", rb.GetSizeInBytes()) }
Its execution results are as follows:
cardinality: 30000000 rb size= 21912
Obviously, for such a set of continuous numbers, the space efficiency of roaring bitmap is very obvious.
Five. Serialization of roaring bitmap
The above is a rough introduction to roaring bitmap. If you are interested in roaring bitmap, you can go to its official website or the homepage of open source projects for in-depth understanding and learning. But what I want to talk about here is the serialization problem of roaring bitmap (it can be transmitted and stored persistently after serialization), taking serialization to JSON and deserialization from JSON as an example.
Considering performance issues, I chose byte open source sonic project for json serialization. Although sonic is a Go open source project, due to its extreme optimization requirements for JSON parsing, currently the proportion of Go code in this project is less than 30%, and more than 60% is assembly code . Sonic provides a functional interface compatible with the Go standard library json package, and sonic also supports streaming I/O mode, which supports serializing a specific type of object to io.Writer or deserializing data from io.Reader into a specific type of object, This is also not supported by the standard library json package. When encountering a large JSON, the streaming I/O mode is very idiomatic. io.Writer and Reader can prevent your Go application from allocating a large amount of memory in an instant, or even being killed by oom.
However, roaring bitmap does not natively provide functions/methods for serialization (marshal) to JSON (or deserialization), so how do we serialize a roaring bitmap into a JSON text? The json package in the Go standard library provides the Marshaler and Unmarshaler interfaces. For any custom type that implements these two interfaces, the json package can support the serialization and deserialization of the custom type. In this regard, the sonic project remains compatible with the Go standard library json package .
However, the roaring.Bitmap type does not implement the Marshaler and Unmarshaler interfaces. The serialization and deserialization of roaring.Bitmap need to be done by ourselves.
Then, the first thing we think of is to customize a new type based on roaring.Bitmap, such as MyRB:
// https://github.com/bigwhite/experiments/blob/master/roaring-bitmap-examples/bitmap_json.go type MyRB struct { RB *roaring.Bitmap }
Then, we give the implementation of MyRB’s MarshalJSON and UnmarshalJSON methods to meet the requirements of the Marshaler and Unmarshaler interfaces:
// https://github.com/bigwhite/experiments/blob/master/roaring-bitmap-examples/bitmap_json.go func (rb *MyRB) MarshalJSON() ([]byte, error) { s, err := rb.RB.ToBase64() if err != nil { return nil, err } r := fmt.Sprintf(`{"rb":"%s"}`, s) return []byte(r), nil } func (rb *MyRB) UnmarshalJSON(data []byte) error { // data => {"rb":"OjAAAAEAAAAAAB4AEAAAAAAAAQACAAMABAAFAAYABwAIAAkACgALAAwADQAOAA8AEAARABIAEwAUABUAFgAXABgAGQAaABsAHAAdAB4A"} _, err := rb.RB.FromBase64(string(data[7 : len(data)-2])) if err != nil { return err } return nil }
We use the ToBase64 method provided by roaring.Bitmap to convert the roaring bitmap into a base64 string, and then serialize it into JSON; deserialization uses FromBase64 to decode the JSON data. Let’s test the conversion between MyRB type and JSON:
// https://github.com/bigwhite/experiments/blob/master/roaring-bitmap-examples/bitmap_json.go func main() { var myrb = MyRB{ RB: roaring.NewBitmap(), } for i := 0; i < 31; i++ { myrb.RB.Add(uint32(i)) } fmt.Printf("the cardinality of origin bitmap = %d\n", myrb.RB.GetCardinality()) buf, err := sonic.Marshal(&myrb) if err != nil { panic(err) } fmt.Printf("bitmap2json: %s\n", string(buf)) var myrb1 = MyRB{ RB: roaring.NewBitmap(), } err = sonic.Unmarshal(buf, &myrb1) if err != nil { panic(err) } fmt.Printf("after json2bitmap, the cardinality of new bitmap = %d\n", myrb1.RB.GetCardinality()) }
Run the example:
the cardinality of origin bitmap = 31 bitmap2json: {"rb":"OjAAAAEAAAAAAB4AEAAAAAAAAQACAAMABAAFAAYABwAIAAkACgALAAwADQAOAA8AEAARABIAEwAUABUAFgAXABgAGQAaABsAHAAdAB4A"} after json2bitmap, the cardinality of new bitmap = 31
The output is as expected.
Based on MyRB that supports serialization, by the way, let’s take a look at the benchmark comparison between sonic and standard library json, and we write a simple comparison test case:
// https://github.com/bigwhite/experiments/blob/master/roaring-bitmap-examples/benchmark_test.go type Foo struct { N int `json:"num"` Name string `json:"name"` Addr string `json:"addr"` Age string `json:"age"` RB MyRB `json:"myrb"` } func BenchmarkSonicJsonEncode(b *testing.B) { var f = Foo{ N: 5, RB: MyRB{ RB: roaring.NewBitmap(), }, } for i := 0; i < 3000; i++ { f.RB.RB.Add(uint32(i)) } b.ReportAllocs() b.ResetTimer() for i := 0; i < bN; i++ { _, err := sonic.Marshal(&f) if err != nil { panic(err) } } } func BenchmarkSonicJsonDecode(b *testing.B) { var f = Foo{ N: 5, RB: MyRB{ RB: roaring.NewBitmap(), }, } for i := 0; i < 3000; i++ { f.RB.RB.Add(uint32(i)) } buf, err := sonic.Marshal(&f) if err != nil { panic(err) } var f1 = Foo{ RB: MyRB{ RB: roaring.NewBitmap(), }, } b.ReportAllocs() b.ResetTimer() for i := 0; i < bN; i++ { err = sonic.Unmarshal(buf, &f1) if err != nil { panic(err) } } } func BenchmarkStdJsonEncode(b *testing.B) { var f = Foo{ N: 5, RB: MyRB{ RB: roaring.NewBitmap(), }, } for i := 0; i < 3000; i++ { f.RB.RB.Add(uint32(i)) } b.ReportAllocs() b.ResetTimer() for i := 0; i < bN; i++ { _, err := json.Marshal(&f) if err != nil { panic(err) } } } func BenchmarkStdJsonDecode(b *testing.B) { var f = Foo{ N: 5, RB: MyRB{ RB: roaring.NewBitmap(), }, } for i := 0; i < 3000; i++ { f.RB.RB.Add(uint32(i)) } buf, err := json.Marshal(&f) if err != nil { panic(err) } var f1 = Foo{ RB: MyRB{ RB: roaring.NewBitmap(), }, } b.ReportAllocs() b.ResetTimer() for i := 0; i < bN; i++ { err = json.Unmarshal(buf, &f1) if err != nil { panic(err) } } }
Execute this benchmark:
$go test -bench . goos: darwin goarch: amd64 pkg: demo ... ... BenchmarkSonicJsonEncode-8 71176 16331 ns/op 49218 B/op 13 allocs/op BenchmarkSonicJsonDecode-8 85080 13710 ns/op 37236 B/op 11 allocs/op BenchmarkStdJsonEncode-8 24490 49345 ns/op 47409 B/op 10 allocs/op BenchmarkStdJsonDecode-8 20083 59593 ns/op 29000 B/op 15 allocs/op PASS ok demo 6.166s
From our benchmark results, we can see that sonic is 3-4 times faster than the standard library json package.
The code in this article can be downloaded here .
6. References
- Roaring Bitmap : June 2015 report – https://ift.tt/LhgROxl
- Roaring Bitmap official website – https://ift.tt/qWlHthr
- Roaring Bitmap Spec – https://ift.tt/CDi6IWS
- Roaring Bitmap Go implementation – https://ift.tt/kvCQIo1
- ByteDance’s sonic project – https://ift.tt/NeqylKf
- paper: Consistently faster and smaller compressed bitmaps with Roaring – https://ift.tt/6bD8XKC
- Accurate deduplication and user behavior analysis based on Bitmap – https://ift.tt/xdJ27co
- paper: Roaring Bitmaps: Implementation of an Optimized Software Library – https://ift.tt/GuNwK4M
“Gopher Tribe” knowledge planet aims to create a boutique Go learning and advanced community! High-quality first release of Go technical articles, “three days” first reading right, analysis of the development status of Go language twice a year, fresh Gopher daily 1 hour in advance every day, online courses, technical columns, book content preview, must answer within six hours Guaranteed to meet all your needs about the Go language ecology! In 2023, the Gopher tribe will further focus on how to write elegant, authentic, readable, and testable Go code, pay attention to code quality and deeply understand Go core technology, and continue to strengthen interaction with star friends. Everyone is welcome to join!
The well-known cloud hosting service provider DigitalOcean released the latest hosting plan. The entry-level Droplet configuration is upgraded to: 1 core CPU, 1G memory, 25G high-speed SSD, and the price is 5$/month. Friends who need to use DigitalOcean, you can open this link address : https://ift.tt/H1qTWdN to start your DO host road.
Gopher Daily (Gopher Daily News) archive repository – https://ift.tt/bDZoj9G
my contact information:
- Weibo (temporarily unavailable): https://ift.tt/ZmkRa2y
- Weibo 2: https://ift.tt/tQd6jXs
- Blog: tonybai.com
- github: https://ift.tt/eRxuI9X
Business cooperation methods: writing, publishing, training, online courses, partnerships, consulting, advertising cooperation.
© 2023, bigwhite . All rights reserved.
This article is transferred from https://tonybai.com/2023/02/01/serialize-roaring-bitmap-to-json/
This site is only for collection, and the copyright belongs to the original author.