iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💻

Splitting Unicode Strings into "Character" Units

に公開

As I mentioned previously in "Troublesome Japanese", a Unicode string is not simply "1 code point = 1 character". Emojis are particularly tricky, and I have summarized this topic on my own blog.

https://text.baldanders.info/remark/2021/03/terrible-emoji/
https://text.baldanders.info/remark/2021/04/emoji-list/

As briefly introduced in those articles, there is a Go package called github.com/rivo/uniseg that apparently segments UTF-8 strings into "character" units.

Let's try running the sample code from github.com/rivo/uniseg (with a few minor adjustments) right away.

sample1.go
// +build run

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    text := "👍🏼!"
    fmt.Println("Text:", text)
    gr := uniseg.NewGraphemes(text)
    for gr.Next() {
        rs := gr.Runes()
        fmt.Printf("%v : %U\n", string(rs), rs)
    }
}

Running this yields:

$ go run sample1.go
Text: 👍🏼!
👍🏼 : [U+1F44D U+1F3FC]
! : [U+0021]

Now, let's try changing the input text to:

sample2.go
text := "ペンギン ペンギン"

And see what happens.

$ go run sample2.go
Text: ペンギン ペンギン
ペ : [U+30D8 U+309A]
ン : [U+30F3]
ギ : [U+30AD U+3099]
ン : [U+30F3]
  : [U+0020]
ペ : [U+FF8D U+FF9F]
ン : [U+FF9D]
ギ : [U+FF77 U+FF9E]
ン : [U+FF9D]

I see. It correctly recognizes combining characters like Dakuten and Handakuten. Very impressive.

Now, let's try it with various emoji patterns:

sample3.go
text := "|#️⃣|☝️|☝🏻|🇯🇵|🏴󠁧󠁢󠁥󠁮󠁧󠁿|👩🏻‍❤️‍💋‍👨🏼|"
$ go run sample3.go
Text: |#️⃣|☝️|☝🏻|🇯🇵|🏴󠁧󠁢󠁥󠁮󠁧󠁿|👩🏻‍❤️‍💋‍👨🏼|
| : [U+007C]
#️⃣ : [U+0023 U+FE0F U+20E3]
| : [U+007C]
☝️ : [U+261D U+FE0F]
| : [U+007C]
☝🏻 : [U+261D U+1F3FB]
| : [U+007C]
🇯🇵 : [U+1F1EF U+1F1F5]
| : [U+007C]
🏴󠁧󠁢󠁥󠁮󠁧󠁿 : [U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F]
| : [U+007C]
👩🏻‍❤️‍💋‍👨🏼 : [U+1F469 U+1F3FB U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468 U+1F3FC]
| : [U+007C]

Wow! It separated them perfectly. For reference, each emoji can be classified as follows:

Emoji Sequence Type Name
#️⃣ emoji keycap sequence keycap: #
☝️ emoji presentation sequence index pointing up
☝🏻 emoji modifier sequence index pointing up: light skin tone
🇯🇵 emoji flag sequence flag: Japan
🏴󠁧󠁢󠁥󠁮󠁧󠁿 emoji tag sequence flag: England
👩🏻‍❤️‍💋‍👨🏼 emoji zwj sequence kiss: woman, man, light skin tone, medium-light skin tone

(Note that emojis further down the list are limited to certain platforms for display). The last one is particularly brutal, consisting of a sequence of 10 code points where these 4 characters are connected with ZWJ (U+200D) to form a single emoji 👩🏻‍❤️‍💋‍👨🏼:

Emoji Code Point Name
👩🏻 U+1F469 U+1F3FB woman: light skin tone
❤️ U+2764 U+FE0F red heart
💋 U+1F48B KISS MARK
👨🏼 U+1F468 U+1F3FC man: medium-light skin tone

Anyway, I've confirmed that Unicode strings can be separated into "character" units, including emojis. Mission accomplished.

GitHubで編集を提案

Discussion