Go string comparison, someone finally made it clear

Permalink to this article – https://ift.tt/mvTcNw8

Thea (Thea) is a girl programmer who has just started the Go language, and today she encountered a problem that made her “surprise”. Here is the Go code that confused the girl Thea:

 func main() { s1 := "12345" s2 := "2" fmt.Println(`"12345" > "2":`, s1 > s2) // false s3 := "零" s4 := "一" s5 := "二" fmt.Println(`"一" > "零":`, s4 > s3) // false fmt.Println(`"二" > "零":`, s5 > s3) // false fmt.Println(`"二" > "一":`, s5 > s4) // true }

In this code on Go string comparison:

  • Why does the expression “12345” > “2” evaluate to false?
  • Why do the expressions “one” > “zero” and “two” > “zero” both evaluate to false?
  • And the evaluation result of “two” > “one” is true?

The four results left Thea puzzled! So Thea searched the Internet for Go technical materials that could solve her puzzles.

She saw a “Little Yellow Book” called “The Road to Go Language Improvement” on the Internet. It is said that this book contains detailed explanations about the principles of Go strings and string comparison.

Thea inadvertently caught a glimpse of a yellow, heavy book on the table next to her colleague Tony, isn’t that what she wanted to read! So Thea asked Tony to borrow a book. Tony has always been “defeating every battle” in the face of the “Beauty Offensive”, so Thea successfully obtained the two volumes of “The Road to Go Language Improvement”. During her lunch break, Thea spent 1.5 hours studying three chapters about Go strings in the book: “Understanding String Implementation Principles and Efficient Use” in Section 15, “Mastering the Principles of Character Sets and Efficient Use” in Section 52 Converting Between Character Encoding Schemes” and Section 56, “Mastering the Basic Operations of the bytes and Strings Packages”. After reading it, I shouted Wonderful! The explanation in the book completely answered Thea’s question.

At this time, Thea remembered the author’s suggestion on learning the Go language method in the closing remarks “Welcome to the Golden Decade of Go with You ” in the “First Lesson Column of Go Language” : output Dafa! Through the output, the knowledge learned is truly internalized into her own knowledge, so Thea recorded her understanding of the contents of the book. It happened that Tony next to him had just woken up from his nap, and Thea decided to be a teacher again. Tony just got dragged over and acted as a student :).

The following is Thea’s explanation.

1. String type in Go language

String type is one of the most commonly used data types in modern programming languages. In the C language, one of the ancestors of the Go language, the string type is not explicitly defined, but a string literal value

Presented as a constant or a ‘\0’-terminated character type (char) array.

The Go language fixes this “defect” of the C language, and the string type is built-in natively, which unifies the abstraction of “string”. In the Go language, whether it is a string constant, a string variable or a string literal appearing in the code, their type is uniformly set to string .

Go’s string type design fully draws on the experience and lessons of C language string design, and combines the best practices in string type design in other mainstream languages. The string type finally presented for Gopher has the following functional characteristics:

  • String type data is immutable

That is, once an identifier of type string is declared, whether it is a constant or a variable, the data that the identifier refers to cannot be changed during the life cycle of the entire program.

  • zero value available

The Go string type supports the idea that zero values ​​are available. Go strings do not need to consider the trailing ‘\0’ character as in C, so their zero value is “” and length is 0.

  • The time complexity of getting the length is O(1) level

  • Various comparison operators are supported: ==, != , >=, <=, > and <

Given that Go strings are immutable, if two strings are not the same length, you can conclude that the two strings are different without comparing the specific string data; if the lengths are the same

At the same time, it is necessary to further judge whether the data pointer points to the same underlying storage data. If they are the same, the two strings are equivalent. If they are different, the actual data content needs to be further compared. As for how to compare, I will talk about it next.

  • Native support for non-ASCII characters

This feature involves the question of what characters are in the Go string and what character encoding is used. Let’s take a look.

2. Character set encoding used by Go strings

The Go language uses the Unicode character set by default and adopts the UTF-8 encoding scheme. Go also provides the rune native type to represent Unicode characters. Unicode (Universal Code/Unicode) was released in 1994, it is a unified character set for the purpose of containing all human characters. The Unicode character set is to uniformly queue and number the vast majority of commonly used characters in the world. For example, the following is a fragment of the Unicode character set table:

serial number character
U+0000  …
 …  …
U+0031 1
U+0032 2
 …  …
U+4E2D middle
 …  …
U+4EBA people
 …  …
U+56FD country
 …  …
U+10FFFF  …

We see that each Unicode character (such as “1”, “中”, etc. in the table) has its own unique serial number. This serial number is called the code point of the character. The rune type in Go can be used to represent the code point.

Well, here comes the problem! With the Unicode character set table, how does Go store these characters in memory? At present, there are various storage schemes in the industry, such as: UTF-32 (that is, 4 bytes to represent each Unicode character code point), UTF-16 (2 bytes or 4 bytes to represent each Unicode character code point) and UTF-8.

UTF-8 encodes (the code points of) Unicode characters using variable-length bytes. The number of bytes used in encoding is related to the sequence number of Unicode characters in the code point table: the number of bytes used for characters with small sequence numbers (code points) is less, and the number of bytes used for characters with large sequence numbers (code points) is many .

The number of bytes used by UTF-8 encoding varies from 1 to 4. The first 128 code points (U+0000~U+007F) that coincide with ASCII characters are represented by 1 byte; Latin, Greek, Cyrillic, Arabic, etc. with diacritics use 2 bytes to represent Representation; while East Asian characters (including Chinese characters) are represented by 3 bytes; characters in other rarely used languages ​​are represented by 4 bytes.

Such an encoding scheme is compatible with the memory representation of ASCII characters, which means that when using the UTF-8 scheme to represent Unicode characters in memory, existing ASCII characters can be directly stored and transmitted as Unicode characters without any changes. Compared with the UTF-16 and UTF-32 schemes, the UTF-8 scheme also has the highest space utilization. Moreover, when utf8 decoding and encoding, there is no need to consider the endianness issue.

Therefore, the Go language uses the Utf8 encoding scheme to store Unicode characters in memory.

Taking the character “中” as an example, its code point (serial number) is U+4E2D, and its Utf8 encoding is “0xE4 0xB8 0xAD”, that is, Go actually uses three bytes in memory to represent the Unicode “中” character.

3. Go string comparison

The above is paved with so much content, just to clear the way for string comparison. Regarding Go string comparison, there is only one sentence in the Go language specification : String values ​​are comparable and ordered, lexically byte-wise . What does that mean? This sentence expresses three meanings:

  • Qualitative: Strings are comparable
  • Quantitative: Strings are ordered
  • Method: byte by byte

Let me explain the examples at the beginning one by one, first look at the following code:

 s1 := "12345" s2 := "2" fmt.Println(`"12345" > "2":`, s1 > s2)

The characters in the two strings s1 and s2 are in the category of ASCII characters, and the encoding of each character in memory is a byte. Following the principle of Go string comparison, we compare s1 and s2 byte by byte. First compare the first character “1” of s1 with the first character “2” of s2. The byte of character “2” in memory is 0×32, and the byte of character “1” in memory is 0×31. Obviously, 0×32 is greater than 0×31. The size has been compared here, and the program does not It will continue to compare subsequent characters. This is why the expression s1 > s2 is false.

What if s2 = “12346”? Then according to the principle of Go string comparison, the program is equal when comparing the first 4 characters of s1 and s2, so the size of the two strings can only be determined by the fifth character, the fifth character of s2 “6” Obviously greater than the fifth character “5” of s1, so when s2 = “12346”, s2 is greater than s1.

Let’s take another look at the example of a string containing Chinese characters:

 s3 := "零" s4 := "一" s5 := "二" fmt.Println(`"一" > "零":`, s4 > s3) // false fmt.Println(`"二" > "零":`, s5 > s3) // false fmt.Println(`"二" > "一":`, s5 > s4) // true

In order to facilitate the subsequent description, we first calculate the Utf8 encoding of the three Chinese characters “zero”, “one” and “two”:

  • The UTF8 encoding of “zero” is: 0xE9 0x9B 0xB6
  • The UTF8 encoding of “one” is: 0xE4 0xB8 0x80
  • The UTF8 encoding of “two” is: 0xE4 0xBA 0x8C

We can see that the Utf8 encoding of three Chinese characters are all three bytes.

Okay, let’s compare s4 (“one”) and s3 (“zero”). According to the Go string comparison principle, the program compares s3 and s4 byte by byte, the first byte of the “zero” character is 0xE9, and the first byte of the “one” character is 0xE4, we know 0xE9 > 0xE4, then the comparison is stopped, and it is determined: s3 > s4.

Similarly, s3 > s5.

When comparing s4 (“one”) and s5 (“two”), since their first bytes are both 0xE4, the second byte determines their size, 0xBA > 0xB8, so s5 > s4.

4. Compare function in Go strings package

The Go standard library provides the Compare function in the strings package to compare the size of two strings. But according to the comment of the Go team, the meaning of this function is to keep the API as consistent as possible with the bytes package, and it is also implemented using the native sorting comparison operator:

 // $GOROOT/src/strings/compare.go func Compare(a, b string) int { if a == b { return 0 } if a < b { return -1 } return +1 }

In practical applications, we seldom use strings.Compare and more directly use the sorting comparison operator to compare string type variables, which is more intuitive, and the performance will be higher in most scenarios, after all, there is one less function call.

“Okay, that’s what I’m going to tell you, do you understand?” Thea gleefully said to Tony, who was awake by the time.

“It’s really good. It’s more thorough than what I’ve written in my book.” Tony clapped and smiled. “Programmer girl Thea Thea finally made the Go string comparison clear.”

Thea was surprised! “What’s your book”?

Tony pointed to the little yellow book on the desk and said, “I wrote this book ^_^”.

A blush appeared on Thea’s face… .

“Gopher Tribe” Knowledge Planet aims to create a high-quality Go learning and advanced community! High-quality first published Go technical articles, “three-day” first published reading rights, analysis of the current situation of Go language development twice a year, read the fresh Gopher daily 1 hour in advance every day, online courses, technical columns, book content preview, must answer within six hours Guaranteed to meet all your needs about the Go language ecosystem! In 2022, the Gopher tribe will be fully revised, and will continue to share knowledge, skills and practices in the Go language and Go application fields, and add many forms of interaction. Everyone is welcome to join!






I love texting : Enterprise-level SMS platform customization development expert https://51smspush.com/. smspush : A customized SMS platform that can be deployed within the enterprise, with three-network coverage, not afraid of large concurrent access, and can be customized and expanded; the content of the SMS is determined by you, no longer bound, with rich interfaces, long SMS support, and optional signature. On April 8, 2020, China’s three major telecom operators jointly released the “5G Message White Paper”, and the 51 SMS platform will also be newly upgraded to the “51 Commercial Message Platform” to fully support 5G RCS messages.

The famous cloud hosting service provider DigitalOcean released the latest hosting plan. The entry-level Droplet configuration is upgraded to: 1 core CPU, 1G memory, 25G high-speed SSD, and the price is 5$/month. Friends who need to use DigitalOcean can open this link : https://ift.tt/bohp4In to open your DO host road.

Gopher Daily Archive Repository – https://ift.tt/ldLIG9J

my contact information:

  • Weibo: https://ift.tt/siPxDZn
  • WeChat public account: iamtonybai
  • Blog: tonybai.com
  • github: https://ift.tt/B5POzjt
  • “Gopher Tribe” Planet of Knowledge: https://ift.tt/HAqXr7e

Business cooperation methods: writing, publishing books, training, online courses, partnership entrepreneurship, consulting, advertising cooperation.

© 2022, bigwhite . All rights reserved.

This article is reprinted from: https://tonybai.com/2022/04/18/inside-go-string-comparison/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment

Your email address will not be published.