📙

日本語に関連するunicodeの範囲一文字を指定する正規表現マッチオブジェクトを作成する関数の作成

2023/09/10に公開

注意

  • ここで日本語に関連するとしているunicodeの範囲は以下とする.
  • unicode正規化(NFC)を使えば大体はカバーしていると思われるが, 足りない・入れて欲しい部分部分等があればコメントをお願いします.
  1. hiragana (U+3040 to U+309F): Hiragana PDF
  2. katakana (U+30A0 to U+30FF): Katakana PDF
  3. katakana_half_width (U+FF65 to U+FF9F): Half-width Katakana PDF
  4. katakana_phonetic_extensions (U+31F0 to U+31FF): Katakana Phonetic Extensions PDF
  5. cjk_unified_ideographs (U+4E00 to U+9FFF): CJK Unified Ideographs PDF
  6. cjk_ideographs_extension_a (U+3400 to U+4DBF): CJK Extension A PDF
  7. cjk_ideographs_extension_b (U+20000 to U+2A6DF): CJK Extension B PDF
  8. cjk_ideographs_extension_c (U+2A700 to U+2B73F): CJK Extension C PDF
  9. cjk_ideographs_extension_d (U+2B740 to U+2b81F: CJK Extension D PDF
  10. cjk_ideographs_extension_e (U+2B820 to U+2CEAF): CJK Extension E PDF
  11. cjk_ideographs_extension_f (U+2CEB0 to U+2EBEF): CJK Extension F PDF
  12. japanese_punctuation (U+3000 to U+303F): Japanese Punctuation PDF

上記とa-z, A-Z, 0-9

Function

  • 日本語に関連するunicodeの範囲一文字を指定するコンパイル済み正規表現マッチオブジェクトを作成する関数
  • 後述するが positive_match=Flasere.sub()を組み合わせることで文字列から指定したunicode以外の文字全てを削除することができる
  • positive_match=True のときは日本語に関連するunicodeの範囲一文字の正規表現
  • positive_match=False のときは日本語に関連するのunicodeの範囲「以外の」一文字の正規表現
  • 引数 excludeList[str] でunicodeの範囲から除外したい範囲を指定できる
import re
from typing import Dict, List


def select_unicode_ranges_related_japanese(
    positive_match: bool = False,
    exclude: List[str]=[]
) -> re.Pattern[str]:
    """
        Parameter
        ======
	positive_match: bool default False
        exclude: List[str] default []
            Please input string which you want to exclude

        Output
        ======
        re.Pattern[str]

        Reference
        ======
        If You are looking for source of unicode ranges, please see these documents
        1.  **hiragana (U+3040 to U+309F)**: [Hiragana PDF](https://www.unicode.org/charts/PDF/U3040.pdf)
        2.  **katakana (U+30A0 to U+30FF)**: [Katakana PDF](https://www.unicode.org/charts/PDF/U30A0.pdf)
        3.  **katakana_half_width (U+FF65 to U+FF9F)**: [Half-width Katakana PDF](https://www.unicode.org/charts/PDF/UFF00.pdf)
        4.  **katakana_phonetic_extensions (U+31F0 to U+31FF)**: [Katakana Phonetic Extensions PDF](https://www.unicode.org/charts/PDF/U31F0.pdf)
        5.  **cjk_unified_ideographs (U+4E00 to U+9FFF)**: [CJK Unified Ideographs PDF](https://www.unicode.org/charts/PDF/U4E00.pdf)
        6.  **cjk_ideographs_extension_a (U+3400 to U+4DBF)**: [CJK Extension A PDF](https://www.unicode.org/charts/PDF/U3400.pdf)
        7.  **cjk_ideographs_extension_b (U+20000 to U+2A6DF)**: [CJK Extension B PDF](https://www.unicode.org/charts/PDF/U20000.pdf)
        8.  **cjk_ideographs_extension_c (U+2A700 to U+2B73F)**: [CJK Extension C PDF](https://www.unicode.org/charts/PDF/U2A700.pdf)
        9.  **cjk_ideographs_extension_d (U+2B740 to U+2b81F**: [CJK Extension D PDF](https://www.unicode.org/charts/PDF/U2B740.pdf)
        10. **cjk_ideographs_extension_e (U+2B820 to U+2CEAF)**: [CJK Extension E PDF](https://www.unicode.org/charts/PDF/U2B820.pdf)
        11. **cjk_ideographs_extension_f (U+2CEB0 to U+2EBEF)**: [CJK Extension F PDF](https://www.unicode.org/charts/PDF/U2CEB0.pdf)
        12. **japanese_punctuation (U+3000 to U+303F)**: [Japanese Punctuation PDF](https://www.unicode.org/charts/PDF/U3000.pdf)
    """
    # unicode range names and values
    unicode_ranges: Dict[str, str] = {
        "alphabet_upper": "A-Z",
        "alphabet_lower": "a-z",
        "number_half_width": "0-9",
        "hiragana": r"\u3040-\u309f",
        "katakana": r"\u30a0-\u30ff",
        "katakana_half_width": r"\uff65-\uff9f",
        "katakana_phonetic_extensions": r"\u31f0-\u31ff",
        "cjk_ideographs": r"\u4e00-\u9fff",
        "cjk_ideographs_extension_a": r"\u3400-\u4dbf",
        "cjk_ideographs_extension_b": r"\U00020000-\U0002a6df",
        "cjk_ideographs_extension_c": r"\U0002a700-\U0002b73f",
        "cjk_ideographs_extension_d": r"\U0002b740-\U0002b81f",
        "cjk_ideographs_extension_e": r"\U0002b820-\U0002ceaf",
        "cjk_ideographs_extension_f": r"\U0002ceb0-\U0002ebef",
        "japanese_punctuation": r"\u3000-\u303f",
    }

    # Type Check
    if isinstance(positive_match, bool) is False:
        raise TypeError(f"Input type is invalid (expected bool): {positive_match}")
    if isinstance(exclude, list) is False:
        raise TypeError(f"Input type is invalid (expected list): {exclude}")
    else:
        for obj in exclude:
            if isinstance(obj, str) is False:
                raise TypeError(f"Input type is invalied (expected str): {obj}")
    
    # Input Validation
    unicode_range_names: List[str] = [name for name in unicode_ranges]
    for ex in exclude:
        if ex not in unicode_range_names:
            error_message = f'Invalid Inputs are detected: {ex} \n'
            helper_message = f'you can use strings in these list: {unicode_range_names}'
            raise ValueError(error_message+helper_message)

    # Create regex
    unicode_range: str = r"[" if positive_match else r"[^"
    for key in unicode_ranges:
        if key in exclude:
            continue
        unicode_range += unicode_ranges[key]
    unicode_range += "]"

    return re.compile(unicode_range)

Test

  • 簡単なテストを作成した
  • re.sub() と組み合せてテストを実施している
  • 例えば, positive_match=Falseexclude=["hiragana"] のとき, 関数は平仮名以外の日本語に関連する正規表現にマッチしない. したがって平仮名にマッチし以下のようになる. (なお, 平仮名以外にも日本語に関連しない正規表現であればマッチすることに注意)
only_hiragana = select_unicode_ranges_related_japanese(
    positive_match=False,
    exclude=["hiragana"]
)
res: str = only_hiragana.sub("", "こんにちは")
assert res == ""
print("negative match is ok")
import re

def test_returning_instance_of_rePattern() -> None:
    # use default value
    match_obj = select_unicode_ranges_related_japanese()
    assert isinstance(match_obj, re.Pattern) == True

    # use user's inputs
    match_obj = select_unicode_ranges_related_japanese(
        exclude=[
            "alphabet_upper",
            "alphabet_lower"
        ]
    )
    assert isinstance(match_obj, re.Pattern) == True
    print("returning valid instance")


def test_raising_value_error_when_input_is_invalid() -> None:
    try:
        select_unicode_ranges_related_japanese(exclude=['hoge'])
        assert False
    except ValueError as e:
        print("value error was successfully raised!")


def test_positive_match() -> None:
    all_with_hiragana = select_unicode_ranges_related_japanese(
        positive_match=True,
    )
    res: str = all_with_hiragana.sub("", "こんにちは")
    assert res == ""

    without_hiragana = select_unicode_ranges_related_japanese(
        positive_match=True,
        exclude=["hiragana"]
    )
    res: str = without_hiragana.sub("", "こんにちは")
    assert res == "こんにちは"
    print("positive match is ok")


def test_negative_match() -> None:
    no_one_match = select_unicode_ranges_related_japanese(
        positive_match=False,
    )
    res: str = no_one_match.sub("", "こんにちは")
    assert res == "こんにちは"

    only_hiragana = select_unicode_ranges_related_japanese(
        positive_match=False,
        exclude=["hiragana"]
    )
    res: str = only_hiragana.sub("", "こんにちは")
    assert res == ""
    print("negative match is ok")


test_returning_instance_of_rePattern()
test_raising_value_error_when_input_is_invalid()
test_positive_match()
test_negative_match()
  • 実行すると以下の文字列が出力される
returning valid instance
value error was successfully raised!
positive match is ok
negative match is ok

Discussion