iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🦁

tidymodels Practice: Training XGBoost with Sparse Matrices

に公開

About this article

Referencing "Practical Introduction to tidymodels for R Users" (the "tidymodels book"), I will perform author classification using tidymodels and textrecipes.

https://gihyo.jp/book/2023/978-4-297-13236-1

My motivation is to try a pattern not covered in the "tidymodels book"—using XGBoost while providing a sparse matrix weighted by TF-IDF as training data—and compare it with Feature Hashing.

suppressPackageStartupMessages({
  require(tidyverse)
  require(tidymodels)
  require(textrecipes)
})
tidymodels::tidymodels_prefer()

Preparing the dataset

Here, I use a (further subset of) a dataset collected from works written in "Shin-ji Shin-kana" (modern kanji and kana usage) by the following 10 authors in Aozora Bunko.

  • Ryunosuke Akutagawa
  • Osamu Dazai
  • Kyoka Izumi
  • Kan Kikuchi
  • Ogai Mori
  • Soseki Natsume
  • Kido Okamoto
  • Mitsuzo Sasaki
  • Toson Shimazaki
  • Juza Unno

Since the volume varies significantly depending on the author and work, I will extract and use works with fewer than 20,000 characters.

tmp <- tempfile(fileext = ".zip")
download.file("https://github.com/paithiov909/shiryo/raw/main/data/aozora.csv.zip", tmp)

df <- readr::read_csv(tmp, col_types = "cccf") |>
  dplyr::filter(nchar(text) < 20000) |>
  dplyr::distinct(title, .keep_all = TRUE) |>
  dplyr::mutate(author = forcats::fct_lump(author, n = 4)) |>
  dplyr::filter(author != "Other") |>
  dplyr::mutate(author = forcats::fct_drop(author))

nrow(df)
#> [1] 325

Specifically, it consists of 325 works by four authors: Ryunosuke Akutagawa, Osamu Dazai, Kido Okamoto, and Juza Unno. The approximate volume and breakdown are as follows.

df |>
  dplyr::mutate(
    nchar = nchar(text)
  ) |>
  dplyr::group_by(author) |>
  dplyr::summarise(
    nchar_mean = mean(nchar),
    nchar_median = median(nchar),
    nchar_min = min(nchar),
    nchar_max = max(nchar),
    nchar_total = sum(nchar),
    n = dplyr::n()
  ) |>
  dplyr::mutate(across(where(is.numeric), trunc))
#> # A tibble: 4 × 7
#>   author     nchar_mean nchar_median nchar_min nchar_max nchar_total     n
#>   <fct>           <dbl>        <dbl>     <dbl>     <dbl>       <dbl> <dbl>
#> 1 芥川竜之介       4499         3858       392     14895      346424    77
#> 2 太宰治           5439         3385        95     19928      582029   107
#> 3 岡本綺堂         6853         5877       558     18801      459217    67
#> 4 海野十三         8220         7531       824     19643      608341    74

These 325 works seem to consist mostly of short stories for all the authors, with occasional longer pieces.

df |>
  dplyr::mutate(nchar = nchar(text)) |>
  ggpubr::ggdensity(
    "nchar",
    y = "density",
    color = "author",
    palette = viridis::turbo(6)
  )

fig_nchar

Preprocessing

In Chapter 6 of the "tidymodels book," a tokenizer is passed to the custom_token argument of textrecipes::step_tokenize. However, since I won't be swapping that part of the recipe here, the preprocessing for morphological analysis remains common. Therefore, I will perform morphological analysis outside the tidymodels framework beforehand and return it to the dataset as space-delimited text.

The MeCab dictionary used here is the IPA dictionary. In this dataset, the total number of tokens is approximately 1.27 million.

corp <- df |>
  dplyr::mutate(text = audubon::strj_normalize(text)) |>
  dplyr::select(title, text, author) |>
  gibasa::tokenize(text, title, split = TRUE) |>
  gibasa::prettify(col_select = c("POS1", "POS2", "Original"))

nrow(corp)
#> [1] 1273878

I will perform word segmentation by filtering vocabulary found in the dictionary—excluding "particles," "auxiliary verbs," and "symbols"—while attaching part-of-speech (POS) information. By the way, since it is quite common for words to have the same appearance but different meanings based on POS information, attaching POS info can increase the number of distinct words (i.e., the number of features), which likely improves final accuracy (probably).

corp <- corp |>
  dplyr::mutate(
    token = stringr::str_c(Original, POS1, POS2, sep = "/")
  ) |>
  gibasa::mute_tokens(POS1 %in% c("記号", "助詞", "助動詞") |
    is.na(Original)) |>
  gibasa::pack() |>
  dplyr::rename(title = doc_id) |>
  dplyr::left_join(
    dplyr::select(df, doc_id, title, author),
    by = "title"
  )

The resulting word segmentation looks like this:

dplyr::pull(corp[1, ], text) |> stringr::str_sub(end = 200L)
#> [1] "私/名詞/代名詞 家/名詞/一般 代々/名詞/副詞可能 お/接頭詞/名詞接続 奥/名詞/一般 坊主/名詞/一般 の/名詞/非自立 父/名詞/一般 母/名詞/一般 はなはだ/副詞/一般 特徴/名詞/一般 ない/形容詞/自立 平凡/名詞/形容動詞語幹 人間/名詞/一般 父/名詞/一般 一中節/名詞/一般 囲碁/名詞/一般 盆栽/名詞/一般 俳句/名詞/一般 道楽/名詞/サ変接続 ある/動詞/自立 いず"

I will split this dataset into training and test data.

corp_split <- initial_split(corp, strata = author)
corp_train <- training(corp_split)
corp_test <- testing(corp_split)

Modeling 1 (TF-IDF Sparse Matrix Case)

In Chapter 6 of the "tidymodels book," Feature Hashing appears quite suddenly. However, one would still want to try training with TF-IDF first. Therefore, I will first prepare a recipe that includes textrecipes::step_tfidf.

tfidf_rec <-
  recipe(author ~ text, data = corp_train) |>
  step_tokenize(text, custom_token = \(x) {
    strsplit(x, " +")
  }) |>
  step_tokenfilter(text, min_times = 30, max_tokens = 1e3) |>
  step_tfidf(text)

I will create a workflow using this recipe, but the key point here is to include the incantation blueprint = hardhat::default_recipe_blueprint(composition = "dgCMatrix").

xgb_spec <-
  boost_tree(
    trees = 1000,
    learn_rate = .2,
    sample_size = tune(),
    loss_reduction = tune(),
    tree_depth = tune(),
    stop_iter = 5
  ) |>
  set_engine("xgboost") |>
  set_mode("classification")

tfidf_wflow <-
  workflow() |>
  add_recipe(
    tfidf_rec,
    blueprint = hardhat::default_recipe_blueprint(composition = "dgCMatrix")
  ) |>
  add_model(xgb_spec)

This description specifies the type of predictors after preprocessing is completed.

A document-term matrix weighted by TF-IDF is an object with very high sparsity, but by default, the predictor type after preprocessing is a tibble, so it becomes a standard dense object.

Checking it in practice, the zero values are properly filled with 0.0.

tfidf_rec |>
  prep() |>
  bake(new_data = NULL) |>
  dplyr::select(1:10) |>
  dplyr::glimpse()
#> Rows: 242
#> Columns: 10
#> $ author                            <fct> 岡本綺堂, 岡本綺堂, 岡本綺堂, 岡本…
#> $ `tfidf_text_あいだ/名詞/副詞可能` <dbl> 0.0005113384, 0.0017296145, 0.00549…
#> $ `tfidf_text_あがる/動詞/自立`     <dbl> 0.0004354376, 0.0014728782, 0.00000…
#> $ `tfidf_text_あける/動詞/自立`     <dbl> 0.0008516207, 0.0000000000, 0.00057…
#> $ `tfidf_text_あげる/動詞/自立`     <dbl> 0.0020833483, 0.0009395968, 0.00055…
#> $ `tfidf_text_あげる/動詞/非自立`   <dbl> 0.0009433636, 0.0010636501, 0.00000…
#> $ `tfidf_text_あたし/名詞/代名詞`   <dbl> 0.0000000000, 0.0008180038, 0.00000…
#> $ `tfidf_text_あたり/名詞/一般`     <dbl> 0.0003917866, 0.0017669703, 0.00052…
#> $ `tfidf_text_あと/名詞/一般`       <dbl> 0.0019714064, 0.0018523139, 0.00264…
#> $ `tfidf_text_あなた/名詞/代名詞`   <dbl> 0.0000000000, 0.0004801046, 0.00171…

On the other hand, by specifying composition = "dgCMatrix" in the blueprint, the predictor type at the end of preprocessing becomes a sparse matrix (dgCMatrix). Sparse matrices can only be handled by certain models such as glmnet's logistic regression (including multinomial logit), XGBoost, and ranger's random forest, but they allow for more efficient training compared to passing dense objects.

Now that the workflow is ready, I will perform a hyperparameter search.

set.seed(123)
straps <- bootstraps(corp_train, times = 3, strata = author)

corp_tfidf_grid <-
  tune_grid(
    tfidf_wflow,
    resamples = straps,
    grid = grid_latin_hypercube(
      sample_prop(),
      loss_reduction(),
      tree_depth(),
      size = 10
    ),
    metrics = metric_set(f_meas),
    control = control_grid(save_pred = TRUE)
  )

autoplot(corp_tfidf_grid)

tfidf_tune

It seems that this approach allows for training with good accuracy.

tfidf_wflow_best <-
  finalize_workflow(tfidf_wflow, select_best(corp_tfidf_grid))
tfidf_last_fit <-
  last_fit(tfidf_wflow_best, corp_split, metrics = metric_set(f_meas))

collect_metrics(tfidf_last_fit)
#> # A tibble: 1 × 4
#>   .metric .estimator .estimate .config
#>   <chr>   <chr>          <dbl> <chr>
#> 1 f_meas  macro          0.935 Preprocessor1_Model1

Modeling 2 (Feature Hashing)

I will also try Feature Hashing, similar to the "tidymodels book". I actually wanted to try training everything at once using workflowsets, but since I wasn't sure how to specify the blueprint, I will create a new workflow by replacing the previous recipe.

hashing_rec <-
  recipe(author ~ text, data = corp_train) |>
  step_tokenize(text, custom_token = \(x) {
    strsplit(x, " +")
  }) |>
  step_tokenfilter(text, min_times = 30, max_tokens = 1e3) |>
  step_texthash(text)

hashing_wflow <-
  tfidf_wflow |>
  update_recipe(hashing_rec)

I will perform the training in the same way as before.

corp_hashing_grid <-
  tune_grid(
    hashing_wflow,
    resamples = straps,
    grid = grid_latin_hypercube(
      sample_prop(),
      loss_reduction(),
      tree_depth(),
      size = 10
    ),
    metrics = metric_set(f_meas),
    control = control_grid(save_pred = TRUE)
  )

autoplot(corp_hashing_grid)

hashing_tune

A better-looking model than before has been created. Well, considering there isn't much test data to begin with, this result might be about what to expect.

hashing_wflow_best <-
  finalize_workflow(hashing_wflow, select_best(corp_hashing_grid))
hashing_last_fit <-
  last_fit(hashing_wflow_best, corp_split, metrics = metric_set(f_meas))

collect_metrics(hashing_last_fit)
#> # A tibble: 1 × 4
#>   .metric .estimator .estimate .config
#>   <chr>   <chr>          <dbl> <chr>
#> 1 f_meas  macro          0.952 Preprocessor1_Model1

References

GitHubで編集を提案

Discussion