iTranslated by AI
Splitting Unicode Strings into "Character" Units
As I mentioned previously in "Troublesome Japanese", a Unicode string is not simply "1 code point = 1 character". Emojis are particularly tricky, and I have summarized this topic on my own blog.
As briefly introduced in those articles, there is a Go package called github.com/rivo/uniseg that apparently segments UTF-8 strings into "character" units.
Let's try running the sample code from github.com/rivo/uniseg (with a few minor adjustments) right away.
// +build run
package main
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
text := "👍🏼!"
fmt.Println("Text:", text)
gr := uniseg.NewGraphemes(text)
for gr.Next() {
rs := gr.Runes()
fmt.Printf("%v : %U\n", string(rs), rs)
}
}
Running this yields:
$ go run sample1.go
Text: 👍🏼!
👍🏼 : [U+1F44D U+1F3FC]
! : [U+0021]
Now, let's try changing the input text to:
text := "ペンギン ペンギン"
And see what happens.
$ go run sample2.go
Text: ペンギン ペンギン
ペ : [U+30D8 U+309A]
ン : [U+30F3]
ギ : [U+30AD U+3099]
ン : [U+30F3]
: [U+0020]
ペ : [U+FF8D U+FF9F]
ン : [U+FF9D]
ギ : [U+FF77 U+FF9E]
ン : [U+FF9D]
I see. It correctly recognizes combining characters like Dakuten and Handakuten. Very impressive.
Now, let's try it with various emoji patterns:
text := "|#️⃣|☝️|☝🏻|🇯🇵|🏴|👩🏻❤️💋👨🏼|"
$ go run sample3.go
Text: |#️⃣|☝️|☝🏻|🇯🇵|🏴|👩🏻❤️💋👨🏼|
| : [U+007C]
#️⃣ : [U+0023 U+FE0F U+20E3]
| : [U+007C]
☝️ : [U+261D U+FE0F]
| : [U+007C]
☝🏻 : [U+261D U+1F3FB]
| : [U+007C]
🇯🇵 : [U+1F1EF U+1F1F5]
| : [U+007C]
🏴 : [U+1F3F4 U+E0067 U+E0062 U+E0065 U+E006E U+E0067 U+E007F]
| : [U+007C]
👩🏻❤️💋👨🏼 : [U+1F469 U+1F3FB U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468 U+1F3FC]
| : [U+007C]
Wow! It separated them perfectly. For reference, each emoji can be classified as follows:
| Emoji | Sequence Type | Name |
|---|---|---|
| #️⃣ | emoji keycap sequence | keycap: # |
| ☝️ | emoji presentation sequence | index pointing up |
| ☝🏻 | emoji modifier sequence | index pointing up: light skin tone |
| 🇯🇵 | emoji flag sequence | flag: Japan |
| 🏴 | emoji tag sequence | flag: England |
| 👩🏻❤️💋👨🏼 | emoji zwj sequence | kiss: woman, man, light skin tone, medium-light skin tone |
(Note that emojis further down the list are limited to certain platforms for display). The last one is particularly brutal, consisting of a sequence of 10 code points where these 4 characters are connected with ZWJ (U+200D) to form a single emoji 👩🏻❤️💋👨🏼:
| Emoji | Code Point | Name |
|---|---|---|
| 👩🏻 | U+1F469 U+1F3FB | woman: light skin tone |
| ❤️ | U+2764 U+FE0F | red heart |
| 💋 | U+1F48B | KISS MARK |
| 👨🏼 | U+1F468 U+1F3FC | man: medium-light skin tone |
Anyway, I've confirmed that Unicode strings can be separated into "character" units, including emojis. Mission accomplished.
Discussion