iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
👋

Handling Grapheme Clusters in Python

に公開

To handle grapheme clusters in Python, you can use regex or PyICU. For the standard Debian Python, apt packages are available.

sudo apt install python3-regex、python3-icu

For Python installed via development environment tools like mise or asdf, you need to install the C library for ICU in advance.

sudo apt install libicu-dev

Install the modules in a virtual environment created with venv.

python3 -m venv .venv
source .venv/bin/activate
python -m pip install regex
python -m pip install pyicu

First, let's try using regex.

>>> import regex
>>> str = 'ハ\u309Aハ\u309A'
>>> iter = regex.compile(r'\X').finditer(str)
>>> [m.group() for m in iter]
['パ', 'パ']

In the case of PyICU, write it as follows:

import icu

str = 'ハ\u309A\u309A'
b = icu.BreakIterator.createCharacterInstance(icu.Locale())
b.setText(str)

i = 0
for j in b:
    print(str[i:j])
    i = j

Modify it slightly to create a generator:

import icu

def breakIter(str):
    b = icu.BreakIterator.createCharacterInstance(icu.Locale())
    b.setText(str)
    i = 0
    for j in b:
        yield str[i:j]
        i = j

str = 'ハ\u309A\u309A'

for g in breakIter(str):
    print(g)

Discussion