How to get the number of characters in a string

You can try RuneCountInString from the utf8 package.

returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:

package main
    
import "fmt"
import "unicode/utf8"
    
func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.

And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string)) and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%

Stefan Steiger points to the blog post "Text normalization in Go"

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

The definition of a character may vary depending on the application.
For normalization we will define it as:

a sequence of runes that starts with a starter,

a rune that does not modify or combine backwards with any other rune,

followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).

The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:

package main
    
import "fmt"
import "golang.org/x/text/unicode/norm"
    
func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"

Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.

package uniseg
    
import (
    "fmt"
    
    "github.com/rivo/uniseg"
)
    
func main() {
    gr := uniseg.NewGraphemes("ðð¼!")
    for gr.Next() {
        fmt.Printf("%x ", gr.Runes())
    }
    // Output: [1f44d 1f3fc] [21]
}

Two graphemes, even though there are three runes (Unicode code points).

You can see other examples in "How to manipulate strings in GO to reverse them?"

dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
ð¦°red hair (1f9b0)

There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):

package main

import "fmt"

func main() {
    russian := "Спутник и погром"
    english := "Sputnik & pogrom"

    fmt.Println("count of bytes:",
        len(russian),
        len(english))

    fmt.Println("count of runes:",
        len([]rune(russian)),
        len([]rune(english)))

}

count of bytes 30 16

count of runes 16 16

How to get the number of characters in a string

What is a character?

Tags:

String

Character

String Length

Go

Related

Recent Posts