iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🥬

Text Classification with Japanese Word Embeddings in R

に公開

What this article covers

Based on Textrecipes series: Pretrained Word Embedding by Emil Hvitfeldt, we will solve a binary classification problem for Japanese text using word embeddings.

For word embeddings, we will use chive-1.1-mc5-aunit.magnitude, which is one of the resources from chiVe trained specifically on "A-unit" words.

As for the dataset, we will use the rhr.tsv data from the JRTE Corpus. This dataset consists of very short text sequences such as "I would definitely like to use it again" or "There is a convenience store nearby," labeled according to whether they are reputations/reviews of a hotel.

Preparation

The JRTE Corpus can be loaded using ldccr.

rhr <- ldccr::read_jrte(keep_rhr = TRUE)
#> Parsing rte.nlp2020_base.tsv...
#> Parsing rte.nlp2020_append.tsv...
#> Parsing rte.lrec2020_surf.tsv...
#> Parsing rte.lrec2020_sem_short.tsv...
#> Parsing rte.lrec2020_sem_long.tsv...
#> Parsing rte.lrec2020_me.tsv...
#> Done.
rhr <- rhr$rhr |>
  dplyr::select(example_id, label, text, usage)

summary(rhr)
#>   example_id                   label          text             usage
#>  Length:5553        reputation    :2069   Length:5553        dev  :1122
#>  Class :character   not_reputation:3484   Class :character   test : 553
#>  Mode  :character                         Mode  :character   train:3878

Word embeddings in magnitude format can be loaded using apportita. Here, we will extract the first 10,000 words of embeddings as a tibble to use.

require(apportita)
#> Loading required package: apportita

conn <- apportita::magnitude("magnitude/chive-1.1-mc5-aunit.magnitude")
dim(conn)
#> [1] 322094    300

embeddings <- apportita::slice_n(conn, 10000)

close(conn)

Since chiVe is trained based on Sudachi linguistic resources, texts need to be tokenized into word forms that match the Sudachi dictionary.

Therefore, for morphological analysis, we use the fledgingr package, which provides a wrapper for sudachi.rs, the Rust implementation of Sudachi.

To specify it in the custom_token argument of textrecipes::step_tokenize, we prepare the following wrapper function:

tokenize <- \(x) {
  fledgingr::tokenize(x, mode = "A") |>
    dplyr::group_by(id) |>
    dplyr::group_map(~ {
      .x$normalized_form
    })
}

tokenize("新しい朝が来た")
#> [[1]]
#> [1] "新しい" "朝"     "が"     "来る"   "た"

Training the Model

The data originally includes labels to divide it into dev/test/train, but we will not use them here and instead split the data into train/test sets again.

require(tidymodels)
#> Loading required package: tidymodels
#> ── Attaching packages ─────────────────────────────────────── tidymodels 1.0.0 ──
#> ✔ broom        1.0.0     ✔ recipes      1.0.1
#> ✔ dials        1.0.0     ✔ rsample      1.0.0
#> ✔ dplyr        1.0.9     ✔ tibble       3.1.8
#> ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
#> ✔ infer        1.0.2     ✔ tune         1.0.0
#> ✔ modeldata    1.0.0     ✔ workflows    1.0.0
#> ✔ parsnip      1.0.0     ✔ workflowsets 1.0.0
#> ✔ purrr        0.3.4     ✔ yardstick    1.0.0
#> ── Conflicts ────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Dig deeper into tidy modeling with R at https://www.tmwr.org
require(textrecipes)
#> Loading required package: textrecipes
tidymodels::tidymodels_prefer()

set.seed(5553)
rhr_split <- initial_split(rhr, strata = label)
rhr_train <- training(rhr_split)
rhr_test <- testing(rhr_split)

Define the workflow.

rhr_rec <-
  recipe(label ~ text, data = rhr_train) |>
  step_text_normalization(text, normalization_form = "nfkc") |>
  step_tokenize(text, custom_token = tokenize) |>
  step_word_embeddings(text, embeddings = embeddings)

rhr_spec <-
  logistic_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

rhr_wflow <-
  workflow() |>
  add_recipe(rhr_rec) |>
  add_model(rhr_spec)

Search for the penalty.

rhr_grid <-
  tune_grid(
    rhr_wflow,
    resamples = bootstraps(rhr_train, times = 10),
    grid = grid_regular(penalty(), levels = 30)
  )

autoplot(rhr_grid)

grid

rhr_grid |>
  show_best("roc_auc")
#> # A tibble: 5 × 7
#>    penalty .metric .estimator  mean     n std_err .config
#>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>
#> 1 0.00174  roc_auc binary     0.921    10 0.00202 Preprocessor1_Model22
#> 2 0.000788 roc_auc binary     0.921    10 0.00192 Preprocessor1_Model21
#> 3 0.00386  roc_auc binary     0.918    10 0.00205 Preprocessor1_Model23
#> 4 0.000356 roc_auc binary     0.918    10 0.00215 Preprocessor1_Model20
#> 5 0.000161 roc_auc binary     0.914    10 0.00210 Preprocessor1_Model19

Perform last_fit.

rhr_wflow <- finalize_workflow(rhr_wflow, select_best(rhr_grid, "roc_auc"))
rhr_last_res <- last_fit(rhr_wflow, rhr_split)

Since the purpose is simply to try training a model, I won't perform a detailed evaluation, but let's plot the ROC curve.

rhr_last_res |>
  collect_predictions() |>
  roc_curve(label, .pred_reputation) |>
  autoplot()

roc_curve

Let's also plot the confusion matrix.

rhr_last_res |>
  collect_predictions() |>
  conf_mat(label, .pred_class) |>
  autoplot(type = "heatmap")

conf_mat

Session Information

sessioninfo::session_info()
#> ─ Session info ────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19043)
#>  system   x86_64, mingw32
#>  ui       RStudio
#>  language (EN)
#>  collate  Japanese_Japan.utf8
#>  ctype    Japanese_Japan.utf8
#>  tz       Asia/Tokyo
#>  date     2022-07-23
#>  rstudio  2022.07.0+548 Spotted Wakerobin (desktop)
#>  pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ────────────────────────────────────────────────────────────────────
#>  package      * version        date (UTC) lib source
#>  apportita    * 0.0.1          2022-05-10 [1] https://paithiov909.r-universe.dev (R 4.2.0)
#>  assertthat     0.2.1          2019-03-21 [1] CRAN (R 4.2.0)
#>  backports      1.4.1          2021-12-13 [1] CRAN (R 4.2.0)
#>  bit            4.0.4          2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64          4.0.5          2020-08-30 [1] CRAN (R 4.2.0)
#>  blob           1.2.3          2022-04-10 [1] CRAN (R 4.2.0)
#>  broom        * 1.0.0          2022-07-01 [1] CRAN (R 4.2.0)
#>  bslib          0.4.0          2022-07-16 [1] RSPM (R 4.2.0)
#>  cachem         1.0.6          2021-08-19 [1] CRAN (R 4.2.0)
#>  class          7.3-20         2022-01-16 [2] CRAN (R 4.2.1)
#>  cleanrmd       0.1.0          2022-06-14 [1] RSPM (R 4.2.0)
#>  cli            3.3.0          2022-04-25 [1] CRAN (R 4.2.0)
#>  codetools      0.2-18         2020-11-04 [2] CRAN (R 4.2.1)
#>  colorspace     2.0-3          2022-02-21 [1] CRAN (R 4.2.0)
#>  conflicted     1.1.0          2021-11-26 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1          2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3          2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr         2.2.1          2022-06-27 [1] CRAN (R 4.2.0)
#>  dials        * 1.0.0          2022-06-14 [1] CRAN (R 4.2.0)
#>  DiceDesign     1.9            2021-02-13 [1] CRAN (R 4.2.0)
#>  digest         0.6.29         2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr        * 1.0.9          2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2          2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15           2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3          2022-03-24 [1] CRAN (R 4.2.0)
#>  farver         2.1.1          2022-07-06 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0.9000     2022-05-15 [1] https://fastverse.r-universe.dev (R 4.2.0)
#>  fledgingr      0.0.0.9003     2022-07-18 [1] https://yutannihilation.r-universe.dev (R 4.2.1)
#>  foreach        1.5.2          2022-02-02 [1] CRAN (R 4.2.0)
#>  furrr          0.3.0          2022-05-04 [1] CRAN (R 4.2.0)
#>  future         1.27.0         2022-07-22 [1] CRAN (R 4.2.1)
#>  future.apply   1.9.0          2022-04-25 [1] CRAN (R 4.2.0)
#>  generics       0.1.3          2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2      * 3.3.6          2022-05-03 [1] CRAN (R 4.2.0)
#>  glmnet       * 4.1-4          2022-04-15 [1] CRAN (R 4.2.0)
#>  globals        0.15.1         2022-06-24 [1] CRAN (R 4.2.0)
#>  glue           1.6.2          2022-02-24 [1] CRAN (R 4.2.0)
#>  gower          1.0.0          2022-02-03 [1] CRAN (R 4.2.0)
#>  GPfit          1.0-8          2019-02-08 [1] CRAN (R 4.2.0)
#>  gtable         0.3.0          2019-03-25 [1] CRAN (R 4.2.0)
#>  hardhat        1.2.0          2022-06-30 [1] CRAN (R 4.2.0)
#>  highr          0.9            2021-04-16 [1] CRAN (R 4.2.0)
#>  hms            1.1.1          2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.3          2022-07-18 [1] CRAN (R 4.2.1)
#>  infer        * 1.0.2          2022-06-10 [1] CRAN (R 4.2.0)
#>  ipred          0.9-13         2022-06-02 [1] CRAN (R 4.2.0)
#>  iterators      1.0.14         2022-02-05 [1] CRAN (R 4.2.0)
#>  jquerylib      0.1.4          2021-04-26 [1] CRAN (R 4.2.0)
#>  jsonlite       1.8.0          2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr          1.39           2022-04-26 [1] CRAN (R 4.2.0)
#>  labeling       0.4.2          2020-10-20 [1] CRAN (R 4.2.0)
#>  lattice        0.20-45        2021-09-22 [2] CRAN (R 4.2.1)
#>  lava           1.6.10         2021-09-02 [1] CRAN (R 4.2.0)
#>  ldccr          0.0.9.20220709 2022-07-09 [1] https://paithiov909.r-universe.dev (R 4.2.1)
#>  lhs            1.1.5          2022-03-22 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1          2021-09-24 [1] CRAN (R 4.2.0)
#>  listenv        0.8.0          2019-12-05 [1] CRAN (R 4.2.0)
#>  lubridate      1.8.0.9000     2022-05-14 [1] https://fastverse.r-universe.dev (R 4.2.0)
#>  magrittr       2.0.3.9000     2022-05-13 [1] https://fastverse.r-universe.dev (R 4.2.0)
#>  MASS           7.3-57         2022-04-22 [2] CRAN (R 4.2.1)
#>  Matrix       * 1.4-1          2022-03-23 [2] CRAN (R 4.2.1)
#>  memoise        2.0.1          2021-11-26 [1] CRAN (R 4.2.0)
#>  modeldata    * 1.0.0          2022-07-01 [1] CRAN (R 4.2.1)
#>  munsell        0.5.0          2018-06-12 [1] CRAN (R 4.2.0)
#>  nnet           7.3-17         2022-01-16 [2] CRAN (R 4.2.1)
#>  parallelly     1.32.1         2022-07-21 [1] CRAN (R 4.2.1)
#>  parsnip      * 1.0.0          2022-06-16 [1] CRAN (R 4.2.0)
#>  pillar         1.8.0          2022-07-18 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3          2019-09-22 [1] CRAN (R 4.2.0)
#>  prodlim        2019.11.13     2019-11-17 [1] CRAN (R 4.2.0)
#>  purrr        * 0.3.4          2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache        0.16.0         2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2          2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0         2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0         2022-06-28 [1] CRAN (R 4.2.0)
#>  R6             2.5.1          2021-08-19 [1] CRAN (R 4.2.0)
#>  rappdirs       0.3.3          2021-01-31 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9          2022-07-08 [1] CRAN (R 4.2.1)
#>  RcppSimdJson   0.1.7          2022-02-18 [1] CRAN (R 4.2.0)
#>  readr          2.1.2          2022-01-30 [1] CRAN (R 4.2.0)
#>  recipes      * 1.0.1          2022-07-07 [1] CRAN (R 4.2.0)
#>  rlang          1.0.4          2022-07-12 [1] RSPM (R 4.2.0)
#>  rmarkdown      2.14           2022-04-25 [1] CRAN (R 4.2.0)
#>  rpart          4.1.16         2022-01-24 [2] CRAN (R 4.2.1)
#>  rsample      * 1.0.0          2022-06-24 [1] CRAN (R 4.2.0)
#>  RSQLite        2.2.15         2022-07-17 [1] RSPM (R 4.2.0)
#>  rstudioapi     0.13           2020-11-12 [1] CRAN (R 4.2.0)
#>  sass           0.4.2          2022-07-16 [1] RSPM (R 4.2.0)
#>  scales       * 1.2.0          2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2          2021-12-06 [1] CRAN (R 4.2.0)
#>  shape          1.4.6          2021-05-19 [1] CRAN (R 4.2.0)
#>  stringi      * 1.7.8          2022-07-11 [1] RSPM (R 4.2.0)
#>  stringr        1.4.0.9000     2022-05-14 [1] https://fastverse.r-universe.dev (R 4.2.0)
#>  styler         1.7.0          2022-03-13 [1] CRAN (R 4.2.0)
#>  survival       3.3-1          2022-03-03 [2] CRAN (R 4.2.1)
#>  textrecipes  * 1.0.0          2022-07-02 [1] CRAN (R 4.2.1)
#>  tibble       * 3.1.8          2022-07-22 [1] CRAN (R 4.2.1)
#>  tidymodels   * 1.0.0          2022-07-13 [1] CRAN (R 4.2.1)
#>  tidyr        * 1.2.0          2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect     1.1.2          2022-02-21 [1] CRAN (R 4.2.0)
#>  timeDate       4021.104       2022-07-19 [1] RSPM (R 4.2.0)
#>  tune         * 1.0.0          2022-07-07 [1] CRAN (R 4.2.0)
#>  tzdb           0.3.0          2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8           1.2.2          2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1          2022-04-13 [1] CRAN (R 4.2.0)
#>  vroom          1.5.7          2021-11-30 [1] CRAN (R 4.2.0)
#>  withr          2.5.0          2022-03-03 [1] CRAN (R 4.2.0)
#>  workflows    * 1.0.0          2022-07-05 [1] CRAN (R 4.2.0)
#>  workflowsets * 1.0.0          2022-07-12 [1] RSPM (R 4.2.0)
#>  xfun           0.31           2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml           2.3.5          2022-02-21 [1] CRAN (R 4.2.0)
#>  yardstick    * 1.0.0          2022-06-06 [1] CRAN (R 4.2.0)
#>
#>  [1] C:/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#>
#> ───────────────────────────────────────────────────────────────────────────────
GitHubで編集を提案

Discussion