iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
↔️

Converting Between Japanese Hiragana, Katakana, Romaji, and Kanji in R

に公開

About this article

This article introduces methods for converting between Japanese Hiragana, Katakana, Romaji, and Kanji in R. Note that all conversions introduced here are vectorized, so you can replace the example input "カッターを買ったうれしかった" with a vector of length 2 or more, such as c("カッターを買ったうれしかった", "竹やぶ焼けた").

Conversion between Hiragana and Katakana

Simple conversions can be achieved with stringi::stri_trans_general(). You can view the list of conversions that can be specified in stringi::stri_trans_general() by using stringi::stri_trans_list(). Although there are about 750 items, most of them are unrelated to Japanese, so there is no particular need to check them thoroughly.

stringi::stri_trans_general("カッターを買ったうれしかった", "Kana-Hira")
#0 [1] "かったあを買ったうれしかった"
stringi::stri_trans_general("カッターを買ったうれしかった", "Hira-Kana")
#0 [1] "カッターヲ買ッタウレシカッタ"

Wrappers named zipangu::str_conv_* are available for this purpose, allowing you to use them intuitively even if you don't remember the conversion targets.

Conversion from Hiragana/Katakana to Romaji

Hiragana and Katakana characters can be converted to Romaji as follows:

stringi::stri_trans_general("カッターを買ったうれしかった", "Hira-Latn")
#0 [1] "カッタ̄wo買ttaureshikatta"
stringi::stri_trans_general("カッターを買ったうれしかった", "Kana-Latn")
#0 [1] "kattāを買ったうれしかった"
stringi::stri_trans_general("カッターを買ったうれしかった", "Any-Latn")
#0 [1] "kattāwo mǎittaureshikatta"

With Any-Latn, Kanji are also converted, but they likely follow Chinese readings. If you want to convert only Hiragana/Katakana characters, you can use the following to apply the Hepburn system. Note that "the Hepburn system is phonetic and does not distinguish between characters that are pronounced the same in standard Japanese, such as 'ji/dji', 'zu/dzu', or 'o/wo'" (from Wikipedia).

stringi::stri_trans_general("カッターを買ったうれしかった", "ja_Hrkt-ja_Latn/BGN")
#0 [1] "kattāo買ttaureshikatta"

If you want "を" to be converted to "wo", you should use audubon::strj_romanize(config = "traditional hepburn"). This function allows you to choose several other notation systems via the config argument, but in all conversions, Kanji and other characters are ignored during conversion and will not appear in the result.

audubon::strj_romanize("カッターを買ったうれしかった", config = "traditional hepburn")
#0 [1] "kattāwottaureshikatta"

Conversion from Kanji-mixed Text to Katakana

You can convert text to Hiragana or Katakana as follows, but this method will not convert Kanji.

stringi::stri_trans_general("カッターを買ったうれしかった", "Any-Hira")
#> [1] "かったあを買ったうれしかった"
stringi::stri_trans_general("カッターを買ったうれしかった", "Any-Kana")
#> [1] "カッターヲ買ッタウレシカッタ"

If you want to convert the entire text to Katakana, you can easily do so by using MeCab to extract the readings, as long as it's a general sentence without specific proper nouns.

kana <-
  gibasa::tokenize("カッターを買ったうれしかった") |>
  gibasa::prettify(col_select = "Yomi1") |>
  gibasa::pack(Yomi1, .collapse = "") |>
  dplyr::pull(text)

kana
#> [1] "カッターヲカッタウレシカッタ"

Since this result contains no Kanji, it can then be converted to Romaji like this:

stringi::stri_trans_general(kana, "Any-Latn")
#> [1] "kattāwokattaureshikatta"

Conversion from Romaji to Hiragana/Katakana

I don't think there are many situations where you'd want to do this, but it is possible.

stringi::stri_trans_general("kattāwokattaureshikatta", "Latn-Hira")
#> [1] "かったあをかったうれしかった"
stringi::stri_trans_general("kattāwokattaureshikatta", "Latn-Kana")
#> [1] "カッターヲカッタウレシカッタ"

Also, if you want to convert notation like "katta-wokattauresikatta" to Hiragana, it can be achieved as follows. However, since the dictionary being built here does not cover all so-called "Romaji input" methods, some parts may not be converted correctly.

temp_dir <- tempdir()

gibasa::build_sys_dic(
  dic_dir = system.file("latin", package = "gibasa"),
  out_dir = temp_dir,
  encoding = "utf8"
)

file.copy(
  system.file("latin/dicrc", package = "gibasa"),
  temp_dir
)
#> [1] TRUE

gibasa::tokenize("katta-wokattauresikatta", sys_dic = temp_dir) |>
  gibasa::pack(feature, .collapse = "") |>
  dplyr::pull(text)
#> [1] "かったーをかったうれしかった"

Conversion from Hiragana to Kanji-mixed Text

Sure I can!

if (!requireNamespace("kelpbeds", quietly = TRUE)) {
  install.packages("kelpbeds", repos = c("https://paithiov909.r-universe.dev", "https://cran.r-project.org"))
}

temp_dir <- kelpbeds::prep_skkserv(dic_dir = tempdir())

gibasa::build_sys_dic(
  dic_dir = temp_dir,
  out_dir = temp_dir,
  encoding = "utf8"
)

gibasa::tokenize("かったーをかったうれしかった", sys_dic = temp_dir) |>
  gibasa::pack(feature, .collapse = "") |>
  dplyr::pull(text)
#> [1] "カッターを買ったうれしかった"

This conversion was performed using a dictionary from an old program called mecab-skkserv. It might not be particularly convenient, as it is quite old and gibasa does not support N-best solutions, meaning it cannot output multiple conversion candidates.

GitHubで編集を提案

Discussion