iTranslated by AI
tidymodels Practice: Training XGBoost with Sparse Matrices
About this article
Referencing "Practical Introduction to tidymodels for R Users" (the "tidymodels book"), I will perform author classification using tidymodels and textrecipes.
My motivation is to try a pattern not covered in the "tidymodels book"—using XGBoost while providing a sparse matrix weighted by TF-IDF as training data—and compare it with Feature Hashing.
suppressPackageStartupMessages({
require(tidyverse)
require(tidymodels)
require(textrecipes)
})
tidymodels::tidymodels_prefer()
Preparing the dataset
Here, I use a (further subset of) a dataset collected from works written in "Shin-ji Shin-kana" (modern kanji and kana usage) by the following 10 authors in Aozora Bunko.
- Ryunosuke Akutagawa
- Osamu Dazai
- Kyoka Izumi
- Kan Kikuchi
- Ogai Mori
- Soseki Natsume
- Kido Okamoto
- Mitsuzo Sasaki
- Toson Shimazaki
- Juza Unno
Since the volume varies significantly depending on the author and work, I will extract and use works with fewer than 20,000 characters.
tmp <- tempfile(fileext = ".zip")
download.file("https://github.com/paithiov909/shiryo/raw/main/data/aozora.csv.zip", tmp)
df <- readr::read_csv(tmp, col_types = "cccf") |>
dplyr::filter(nchar(text) < 20000) |>
dplyr::distinct(title, .keep_all = TRUE) |>
dplyr::mutate(author = forcats::fct_lump(author, n = 4)) |>
dplyr::filter(author != "Other") |>
dplyr::mutate(author = forcats::fct_drop(author))
nrow(df)
#> [1] 325
Specifically, it consists of 325 works by four authors: Ryunosuke Akutagawa, Osamu Dazai, Kido Okamoto, and Juza Unno. The approximate volume and breakdown are as follows.
df |>
dplyr::mutate(
nchar = nchar(text)
) |>
dplyr::group_by(author) |>
dplyr::summarise(
nchar_mean = mean(nchar),
nchar_median = median(nchar),
nchar_min = min(nchar),
nchar_max = max(nchar),
nchar_total = sum(nchar),
n = dplyr::n()
) |>
dplyr::mutate(across(where(is.numeric), trunc))
#> # A tibble: 4 × 7
#> author nchar_mean nchar_median nchar_min nchar_max nchar_total n
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 芥川竜之介 4499 3858 392 14895 346424 77
#> 2 太宰治 5439 3385 95 19928 582029 107
#> 3 岡本綺堂 6853 5877 558 18801 459217 67
#> 4 海野十三 8220 7531 824 19643 608341 74
These 325 works seem to consist mostly of short stories for all the authors, with occasional longer pieces.
df |>
dplyr::mutate(nchar = nchar(text)) |>
ggpubr::ggdensity(
"nchar",
y = "density",
color = "author",
palette = viridis::turbo(6)
)

Preprocessing
In Chapter 6 of the "tidymodels book," a tokenizer is passed to the custom_token argument of textrecipes::step_tokenize. However, since I won't be swapping that part of the recipe here, the preprocessing for morphological analysis remains common. Therefore, I will perform morphological analysis outside the tidymodels framework beforehand and return it to the dataset as space-delimited text.
The MeCab dictionary used here is the IPA dictionary. In this dataset, the total number of tokens is approximately 1.27 million.
corp <- df |>
dplyr::mutate(text = audubon::strj_normalize(text)) |>
dplyr::select(title, text, author) |>
gibasa::tokenize(text, title, split = TRUE) |>
gibasa::prettify(col_select = c("POS1", "POS2", "Original"))
nrow(corp)
#> [1] 1273878
I will perform word segmentation by filtering vocabulary found in the dictionary—excluding "particles," "auxiliary verbs," and "symbols"—while attaching part-of-speech (POS) information. By the way, since it is quite common for words to have the same appearance but different meanings based on POS information, attaching POS info can increase the number of distinct words (i.e., the number of features), which likely improves final accuracy (probably).
corp <- corp |>
dplyr::mutate(
token = stringr::str_c(Original, POS1, POS2, sep = "/")
) |>
gibasa::mute_tokens(POS1 %in% c("記号", "助詞", "助動詞") |
is.na(Original)) |>
gibasa::pack() |>
dplyr::rename(title = doc_id) |>
dplyr::left_join(
dplyr::select(df, doc_id, title, author),
by = "title"
)
The resulting word segmentation looks like this:
dplyr::pull(corp[1, ], text) |> stringr::str_sub(end = 200L)
#> [1] "私/名詞/代名詞 家/名詞/一般 代々/名詞/副詞可能 お/接頭詞/名詞接続 奥/名詞/一般 坊主/名詞/一般 の/名詞/非自立 父/名詞/一般 母/名詞/一般 はなはだ/副詞/一般 特徴/名詞/一般 ない/形容詞/自立 平凡/名詞/形容動詞語幹 人間/名詞/一般 父/名詞/一般 一中節/名詞/一般 囲碁/名詞/一般 盆栽/名詞/一般 俳句/名詞/一般 道楽/名詞/サ変接続 ある/動詞/自立 いず"
I will split this dataset into training and test data.
corp_split <- initial_split(corp, strata = author)
corp_train <- training(corp_split)
corp_test <- testing(corp_split)
Modeling 1 (TF-IDF Sparse Matrix Case)
In Chapter 6 of the "tidymodels book," Feature Hashing appears quite suddenly. However, one would still want to try training with TF-IDF first. Therefore, I will first prepare a recipe that includes textrecipes::step_tfidf.
tfidf_rec <-
recipe(author ~ text, data = corp_train) |>
step_tokenize(text, custom_token = \(x) {
strsplit(x, " +")
}) |>
step_tokenfilter(text, min_times = 30, max_tokens = 1e3) |>
step_tfidf(text)
I will create a workflow using this recipe, but the key point here is to include the incantation blueprint = hardhat::default_recipe_blueprint(composition = "dgCMatrix").
xgb_spec <-
boost_tree(
trees = 1000,
learn_rate = .2,
sample_size = tune(),
loss_reduction = tune(),
tree_depth = tune(),
stop_iter = 5
) |>
set_engine("xgboost") |>
set_mode("classification")
tfidf_wflow <-
workflow() |>
add_recipe(
tfidf_rec,
blueprint = hardhat::default_recipe_blueprint(composition = "dgCMatrix")
) |>
add_model(xgb_spec)
This description specifies the type of predictors after preprocessing is completed.
A document-term matrix weighted by TF-IDF is an object with very high sparsity, but by default, the predictor type after preprocessing is a tibble, so it becomes a standard dense object.
Checking it in practice, the zero values are properly filled with 0.0.
tfidf_rec |>
prep() |>
bake(new_data = NULL) |>
dplyr::select(1:10) |>
dplyr::glimpse()
#> Rows: 242
#> Columns: 10
#> $ author <fct> 岡本綺堂, 岡本綺堂, 岡本綺堂, 岡本…
#> $ `tfidf_text_あいだ/名詞/副詞可能` <dbl> 0.0005113384, 0.0017296145, 0.00549…
#> $ `tfidf_text_あがる/動詞/自立` <dbl> 0.0004354376, 0.0014728782, 0.00000…
#> $ `tfidf_text_あける/動詞/自立` <dbl> 0.0008516207, 0.0000000000, 0.00057…
#> $ `tfidf_text_あげる/動詞/自立` <dbl> 0.0020833483, 0.0009395968, 0.00055…
#> $ `tfidf_text_あげる/動詞/非自立` <dbl> 0.0009433636, 0.0010636501, 0.00000…
#> $ `tfidf_text_あたし/名詞/代名詞` <dbl> 0.0000000000, 0.0008180038, 0.00000…
#> $ `tfidf_text_あたり/名詞/一般` <dbl> 0.0003917866, 0.0017669703, 0.00052…
#> $ `tfidf_text_あと/名詞/一般` <dbl> 0.0019714064, 0.0018523139, 0.00264…
#> $ `tfidf_text_あなた/名詞/代名詞` <dbl> 0.0000000000, 0.0004801046, 0.00171…
On the other hand, by specifying composition = "dgCMatrix" in the blueprint, the predictor type at the end of preprocessing becomes a sparse matrix (dgCMatrix). Sparse matrices can only be handled by certain models such as glmnet's logistic regression (including multinomial logit), XGBoost, and ranger's random forest, but they allow for more efficient training compared to passing dense objects.
Now that the workflow is ready, I will perform a hyperparameter search.
set.seed(123)
straps <- bootstraps(corp_train, times = 3, strata = author)
corp_tfidf_grid <-
tune_grid(
tfidf_wflow,
resamples = straps,
grid = grid_latin_hypercube(
sample_prop(),
loss_reduction(),
tree_depth(),
size = 10
),
metrics = metric_set(f_meas),
control = control_grid(save_pred = TRUE)
)
autoplot(corp_tfidf_grid)

It seems that this approach allows for training with good accuracy.
tfidf_wflow_best <-
finalize_workflow(tfidf_wflow, select_best(corp_tfidf_grid))
tfidf_last_fit <-
last_fit(tfidf_wflow_best, corp_split, metrics = metric_set(f_meas))
collect_metrics(tfidf_last_fit)
#> # A tibble: 1 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 f_meas macro 0.935 Preprocessor1_Model1
Modeling 2 (Feature Hashing)
I will also try Feature Hashing, similar to the "tidymodels book". I actually wanted to try training everything at once using workflowsets, but since I wasn't sure how to specify the blueprint, I will create a new workflow by replacing the previous recipe.
hashing_rec <-
recipe(author ~ text, data = corp_train) |>
step_tokenize(text, custom_token = \(x) {
strsplit(x, " +")
}) |>
step_tokenfilter(text, min_times = 30, max_tokens = 1e3) |>
step_texthash(text)
hashing_wflow <-
tfidf_wflow |>
update_recipe(hashing_rec)
I will perform the training in the same way as before.
corp_hashing_grid <-
tune_grid(
hashing_wflow,
resamples = straps,
grid = grid_latin_hypercube(
sample_prop(),
loss_reduction(),
tree_depth(),
size = 10
),
metrics = metric_set(f_meas),
control = control_grid(save_pred = TRUE)
)
autoplot(corp_hashing_grid)

A better-looking model than before has been created. Well, considering there isn't much test data to begin with, this result might be about what to expect.
hashing_wflow_best <-
finalize_workflow(hashing_wflow, select_best(corp_hashing_grid))
hashing_last_fit <-
last_fit(hashing_wflow_best, corp_split, metrics = metric_set(f_meas))
collect_metrics(hashing_last_fit)
#> # A tibble: 1 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 f_meas macro 0.952 Preprocessor1_Model1
Discussion