Deeply understand string - LiNPX in Golang

string

Almost any program can not do without text (string). String in Go is a built-in type, and it has similar properties with ordinary slice types. For example, slice operations can be performed, which makes Go lack some functions dealing with string types, such as none substring Such a function, however, can facilitate such operations. At the same time, string is immutable (read-only), so it cannot be modified, and can only be used len() Get length, unable to use cap() Get capacity. In addition, there are several packages in the Go standard library dedicated to text processing.

strings The package provides many simple functions for manipulating strings. Generally, common string manipulation requirements can be found in this package.
strconv The package provides conversion between basic data types and strings. In Go, there is no implicit type conversion. General type conversion can do this: int32(i) , will i (such as int type) conversion int32 , however, string types and int、float、bool The conversion between other types is not so simple.

Compare with C

Go is often compared with C, after all, there are the same creators. It can be said that the string of Go is to solve the problem of C using strings.

Here are the characteristics of Go string and the improvements compared with C:

Go string itself is the biggest improvement, because C doesn't have it, it can only be replaced by character array;
The length is O (1) because there is a len field, while C is O (n) because it needs to traverse the entire character array;
Go string is immutable, and the character array of C is variable;
support ==, ! =, >, <, >=, <= Comparison operator;
support +, += Operator for string concatenation;
Each character is a Unicode character, which is stored in utf-8 code;
The declaration of multiline strings is supported.

Source code reading

Go standard library builtin All built-in types are defined. Source code in src/builtin/builtin.go , the description of string is as follows:

 // string is the set of all strings of 8-bit bytes, conventionally but not // necessarily representing UTF-8-encoded text. A string may be empty, but // not nil. Values of string type are immutable. type string string

So string is a collection of 8-bit bytes, usually but not necessarily UTF-8 encoded text.
In addition, two important points are mentioned:

String can be empty (length is 0), but not nil;
String objects cannot be modified.

stay runtime The structure of the string under the package is:

 type stringStruct struct { str unsafe.Pointer len int }

Its data structure is simple:

stringStruct.str : the first address of the string;
stringStruct.len : The length of the string.

The string data structure is somewhat similar to that of a slice, except that the slice also has a member that represents the capacity. In fact, the string and slice, to be exact, the byte slice is often converted.

stay reflect The structure of the string under the package is:

 // StringHeader is the runtime representation of a string. // It cannot be used safely or portably and its representation may // change in a later release. // Moreover, the Data field is not sufficient to guarantee the data // it references will not be garbage collected, so programs must keep // a separate, correctly typed pointer to the underlying data. type StringHeader struct { Data uintptr Len  int }

The two are similar, but different. from StringHeader It can be seen from the notes of: StringHeader It is the representation of string at runtime. It does not store string data, but only contains a pointer to string underlying storage (Data uinptr) and an int field representing string length (Len int). The underlying storage of string is an array of byte type.

2021-04-23T08:55:37.png

about uintptr Go Language Specification https://golang.org/ref/spec It is described in, an unsigned integer large enough to store the uninterpreted bits of a pointer value， Is an unsigned integer large enough to store the value of any pointer. In contrast, uint is platform related unsigned shaping. It is 32-bit unsigned shaping on 32-bit machines and 64 bit unsigned shaping on 64 bit machines.

Common Operations

string comparison

The Compare function is used to compare the sizes of two strings. If the two strings are equal, it returns 0. If a is less than b, return - 1, otherwise return 1. This function is not recommended. Use==directly! =A series of operators such as><>=<=are more intuitive.

 func Compare(a, b string) int

EqualFold function to calculate whether s and t are equal after ignoring the case of letters.

 func EqualFold(s, t string) bool

Whether a character or substring exists

 //The substring substr in s returns true func Contains(s, substr string) bool //If any Unicode code point in chars is in s, return true func ContainsAny(s, chars string) bool //Unicode code point r in s returns true func ContainsRune(s string, r rune) bool

Number of substrings (string matching)

In data structures and algorithms, the following string matching algorithms may be explained:

Naive matching algorithm
KMP algorithm
Rabin Karp algorithm
Boyer Moore algorithm

There are other algorithms, which are not listed here. You can search online if you are interested.

In Go, find the occurrence times of substrings, that is, string pattern matching, to achieve Rabin-Karp Algorithm. The Count function is as follows:

 func Count(s, sep string) int

String split into [] string

The strings package provides six three sets of splitting functions: Fields and FieldsFunc, Split and SplitAfter, SplitN and SplitAfterN.

Fields uses one or more consecutive spaces to separate the string s, and returns an array of substrings (slice). If the string s contains only spaces, an empty list is returned (the length of the [] string is 0). The definition of space is unicode. IsSpace, which has been introduced before.

 fmt.Printf("Fields are: %q", strings.Fields("  foo bar  baz   ")) Fields are: ["foo" "bar" "baz"]

FieldsFunc is separated by the Unicode code point c: if f (c) is satisfied, true will be returned. This function returns [] string. If all the unicode code points in the string s satisfy f (c) or s is empty, FieldsFunc returns an empty slice.

 fmt.Println(strings.FieldsFunc("  foo bar  baz   ", unicode.IsSpace))

In fact, the Fields function is implemented by calling FieldsFunc:

 func Fields(s string) []string { return FieldsFunc(s, unicode.IsSpace) }

Split and SplitAfter, SplitN and SplitAfterN are all implemented through the same internal function.

 func Split(s, sep string) []string { return genSplit(s, sep, 0, -1) } func SplitAfter(s, sep string) []string { return genSplit(s, sep, len(sep), -1) } func SplitN(s, sep string, n int) []string { return genSplit(s, sep, 0, n) } func SplitAfterN(s, sep string, n int) []string { return genSplit(s, sep, len(sep), n) }

Their underlying implementations have all called genSplit Function.

Split(s, sep) and SplitN(s, sep, -1) Equivalence; SplitAfter(s, sep) and SplitAfterN(s, sep, -1) Equivalence.

Split The sep in s will be removed, and SplitAfter The sep is retained. As follows:

 fmt.Printf("%q\n", strings.Split("foo,bar,baz", ",")) fmt.Printf("%q\n", strings.SplitAfter("foo,bar,baz", ",")) ["foo" "bar" "baz"] ["foo," "bar," "baz"]

Whether the string has a prefix or suffix

 //Start with prefix in s func HasPrefix(s, prefix string) bool { return len(s) >= len(prefix) && s[0:len(prefix)] == prefix } //Whether to end with suffix in s func HasSuffix(s, suffix string) bool { return len(s) >= len(suffix) && s[len(s)-len(suffix):] == suffix }

If prefix or suffix is "", the return value is always true.

 fmt.Println(strings.HasPrefix("Gopher", "Go")) fmt.Println(strings.HasPrefix("Gopher", "C")) fmt.Println(strings.HasPrefix("Gopher", "")) true false true

The position of the character or substring in the string

 //Find the first occurrence of sep in s and return the index of the first occurrence func Index(s, sep string) int //Find the first occurrence of byte c in s and return the index of the first occurrence func IndexByte(s string, c byte) int //The first occurrence position of any Unicode code point in chars in s func IndexAny(s, chars string) int //Find the first occurrence of the character c in s, where c meets f (c) and returns true func IndexFunc(s string, f func(rune) bool) int //The first occurrence position of Unicode code point r in s func IndexRune(s string, r rune) int //There are three corresponding locations for the last occurrence of the search func LastIndex(s, sep string) int func LastIndexByte(s string, c byte) int func LastIndexAny(s, chars string) int func LastIndexFunc(s string, f func(rune) bool) int

String JOIN Operation

The string array (or slice) can be connected through Join:

 fmt.Println(Join([]string{"name=xxx", "age=xx"}, "&")) name=xxx&age=xx

Character substitution

Map function: map and replace each character of s according to the rules of mapping. If the return value of mapping is<0, the character will be discarded. This method can only process each character, but the processing method is very flexible, which can easily filter and filter Chinese characters.

 mapping := func(r rune) rune { switch { Case r>='A '&&r<='Z'://convert uppercase letters to lowercase letters return r + 32 Case r>='a '&&r<='z'://Lower case letters are not processed return r Case unicode. Is (unicode. Han, r)://Chinese character line breaking return '\n' } Return - 1//Filter all non alphabetic and Chinese characters } Fmt. Println (strings. Map (mapping, "Hello you # ￥%...  n ('World  n, good Hello ^ (&(* gopher..."))) hello world hello gopher

String substring replacement

When performing string replacement, consider the performance problem, and try not to use the function here instead of regular.

 //Replace old in s with new. There are n replacements. //If n<0, there is no limit to the number of replacements, that is, all replacements func Replace(s, old, new string, n int) string //This function directly calls the function Replace (s, old, new, - 1) func ReplaceAll(s, old, new string) string

toggle case

 func ToLower(s string) string func ToLowerSpecial(c unicode.SpecialCase, s string) string func ToUpper(s string) string func ToUpperSpecial(c unicode.SpecialCase, s string) string

trim

 //Remove the characters matching any character in the cutset on the left and right of s func Trim(s string, cutset string) string //Remove the character matching any character in the cutset on the left side of s func TrimLeft(s string, cutset string) string //Remove the character matching any character in the cutset on the right side of s func TrimRight(s string, cutset string) string //If the prefix of s is prefix, the string with the prefix removed is returned; otherwise, s does not change. func TrimPrefix(s, prefix string) string //If the suffix of s is suffix, the string with the suffix removed will be returned; otherwise, s will not change. func TrimSuffix(s, suffix string) string //Remove the spacers on the left and right of s. Common spacers include: ' t', ' n', ' v', ' f', ' r', '', U+0085 (NEL) func TrimSpace(s string) string //Remove the characters matching f on the left and right of s func TrimFunc(s string, f func(rune) bool) string //Remove the character matching f to the left of s func TrimLeftFunc(s string, f func(rune) bool) string //Remove the character matching f on the right side of s func TrimRightFunc(s string, f func(rune) bool) string

High order operation

Replacer type

This is a structure that does not export any fields. The instantiation passes func NewReplacer(oldnew ...string) *Replacer Function, where the indefinite parameter oldnew is an old new pair, that is, multiple replacements are made.

 r := strings.NewReplacer("<", "&lt;", ">", "&gt;") fmt.Println(r.Replace("This is <b>HTML</b>!")) This is &lt; b&gt; HTML&lt; /b&gt;!

Builder Type

 type Builder struct { addr *Builder // of receiver, to detect copies by value buf  []byte }

This type implements the Writer, ByteWriter, StringWriter and other interfaces under the io package, and can write data to the object. Builder does not implement the Reader and other interfaces, so this type is unreadable, but provides a String method to obtain data in the object.

 //This method writes a byte to b func (b *Builder) WriteByte(c byte) error //The WriteRune method writes a character to b func (b *Builder) WriteRune(r rune) (int, error) //WriteRune method writes byte array p to b func (b *Builder) Write(p []byte) (int, error) //The WriteRune method writes the string s to b func (b *Builder) WriteString(s string) (int, error) //The Len method returns the data length of b. func (b *Builder) Len() int //The Cap method returns the cap of b. func (b *Builder) Cap() int //The Grow method increases b's cap by at least n (possibly more). If n is negative, it will cause panic. func (b *Builder) Grow(n int) //The Reset method will clear all contents of b. func (b *Builder) Reset() //The String method returns the data of b as string. func (b *Builder) String() string

The problem of transferring to slice

Let's take a look at an example:

 func main() { Fmt. Println (testing. AllocsPerRun (1, convert1))//Output 1 Fmt. Println (testing. AllocsPerRun (1, convert2))//Output 0 } func convert1() { S:="China welcomes you, Beijing welcomes you" sl := []byte(s) for _, v := range sl { _ = v } } func convert2() { S:="Welcome to China" sl := []byte(s) for _, v := range sl { _ = v } }

Why? convert1 Method generates a memory copy, and convert2 Method does not generate a memory copy?

Because memory copy will not be performed for strings whose byte length is less than 32. The following is the source code:

 // The constant is known to the compiler. // There is no fundamental theory behind this number. const tmpStringBufSize = 32 type tmpBuf [tmpStringBufSize]byte func stringtoslicebyte(buf *tmpBuf, s string) []byte { var b []byte if buf ! = nil && len(s) <= len(buf) { *buf = tmpBuf{} b = buf[:len(s)] } else { b = rawbyteslice(len(s)) } copy(b, s) return b } // rawbyteslice allocates a new byte slice. The byte slice is not zeroed. func rawbyteslice(size int) (b []byte) { cap := roundupsize(uintptr(size)) p := mallocgc(cap, nil, false) if cap ! = uintptr(size) { memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size)) } *(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)} return }

If the string exceeds 32, but you still want to achieve 0 memory copies, what should you do?

The first scheme Yes Use for range 。

 func main() { Fmt. Println (testing. AllocsPerRun (1, convert))//Output 0 } func convert() { S:="China welcomes you, Beijing welcomes you" for _, v := range []byte(s) { _ = v } }

The second plan Is to use strong conversion based on reflect and unsafe

 func main() { Fmt. Println (testing. AllocsPerRun (1, convert))//Output 0 } func convert() { S:="China welcomes you, Beijing welcomes you" sl := StringToBytes(s) for _, v := range sl { _ = v } } func StringToBytes(s string) (b []byte) { sh := *(*reflect.StringHeader)(unsafe.Pointer(&s)) bh := (*reflect.SliceHeader)(unsafe.Pointer(&b)) bh.Data, bh.Len, bh.Cap = sh.Data, sh.Len, sh.Len return b }

Performance test:

 Var x="China welcomes you, Beijing welcomes you" //BenchmarkBytesToString strong conversion func BenchmarkBytesToString(b *testing.B) { for i := 0;  i <= b.N;  i++ { _ = StringToBytes(x) } } //BenchmarkBytesToStringNormal Native func BenchmarkBytesToStringNormal(b *testing.B) { for i := 0;  i <= b.N;  i++ { _ = []byte(x) } } BenchmarkBytesToString BenchmarkBytesToString-4                1000000000               0.3269 ns/op BenchmarkBytesToStringNormal BenchmarkBytesToStringNormal-4          36955354                34.13 ns/op

Obviously, strong conversion ratio []byte(string) The performance should be much higher.

String concatenation

In Go language, there are five common string splicing methods as follows:

use +

 func plusConcat(n int, str string) string { s := "" for i := 0;  i < n;  i++ { s += str } return s }

use fmt.Sprintf

 func sprintfConcat(n int, str string) string { s := "" for i := 0;  i < n;  i++ { s = fmt.Sprintf("%s%s", s, str) } return s }

use strings.Builder

 func builderConcat(n int, str string) string { var builder strings.Builder for i := 0;  i < n;  i++ { builder.WriteString(str) } return builder.String() }

use bytes.Buffer

 func bufferConcat(n int, s string) string { buf := new(bytes.Buffer) for i := 0;  i < n;  i++ { buf.WriteString(s) } return buf.String() }

use []byte

 func byteConcat(n int, str string) string { buf := make([]byte, 0) for i := 0;  i < n;  i++ { buf = append(buf, str...) } return string(buf) }

Performance test:

 BenchmarkPlusConcat-4                 15         107333204 ns/op        530997245 B/op     10018 allocs/op BenchmarkSprintfConcat-4               7         154160789 ns/op        832915797 B/op     34114 allocs/op BenchmarkBuilderConcat-4            6637            171604 ns/op          514801 B/op         23 allocs/op BenchmarkBufferConcat-4             7231            164885 ns/op          368576 B/op         13 allocs/op BenchmarkByteConcat-4               7371            248181 ns/op          621298 B/op         24 allocs/op

From the benchmark results, strings.Builder > bytes.Buffer > []byte > + > fmt.Sprintf

reference resources

This article is written by Chakhsu Lau Creation, adoption Knowledge Sharing Attribution 4.0 International License Agreement.
All articles on this website are original or translated by this website, except for the reprint/source. Please sign your name before reprinting.