iTranslated by AI
The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
👋
Handling Grapheme Clusters in Python
To handle grapheme clusters in Python, you can use regex or PyICU. For the standard Debian Python, apt packages are available.
sudo apt install python3-regex、python3-icu
For Python installed via development environment tools like mise or asdf, you need to install the C library for ICU in advance.
sudo apt install libicu-dev
Install the modules in a virtual environment created with venv.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install regex
python -m pip install pyicu
First, let's try using regex.
>>> import regex
>>> str = 'ハ\u309Aハ\u309A'
>>> iter = regex.compile(r'\X').finditer(str)
>>> [m.group() for m in iter]
['パ', 'パ']
In the case of PyICU, write it as follows:
import icu
str = 'ハ\u309Aハ\u309A'
b = icu.BreakIterator.createCharacterInstance(icu.Locale())
b.setText(str)
i = 0
for j in b:
print(str[i:j])
i = j
Modify it slightly to create a generator:
import icu
def breakIter(str):
b = icu.BreakIterator.createCharacterInstance(icu.Locale())
b.setText(str)
i = 0
for j in b:
yield str[i:j]
i = j
str = 'ハ\u309Aハ\u309A'
for g in breakIter(str):
print(g)
Discussion