iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌿

Demo of spacyr and cleanNLP in R

に公開

spacyr

About spacyr

spacyr is an R package that calls spaCy via the reticulate package.

https://spacyr.quanteda.io/

Basically, it is designed for performing POS tagging, dependency labeling, and named entity recognition using models that can be installed via pip.

Currently, you cannot modify the pipeline, retrain models, or load custom models through this package. You should probably write Python code yourself for those tasks.

Notes

  • Although not in this list, spaCy officially added Japanese models in version 2.3, so Japanese models are available in that version and later.
  • This issue has already been closed, so installing the latest version of spaCy should generally be fine.
  • Multi-process settings like nlp.pipe(n_process=2) cannot be passed. Multi-threading is supported on the Python side (internally, it runs a Python script wrapping spaCy using itertools).
  • While spaCy can use GPUs by setting spacy.prefer.gpu(), it seems GPU usage is not supported via spacyr (furthermore, spacyr::spacy_install currently doesn't support installing GPU-enabled versions).

Usage

(1) Installation via conda

If you use a Miniconda environment, Python libraries will naturally all be installed via conda.

While it is easy to set up, there is a trap specific to Windows: the following steps require running R with administrator privileges. If using RStudio, right-click the shortcut and select "Run as administrator."

Executing the following script as an administrator will download spaCy and the Japanese model ja_core_news_sm into a condaenv named spacy_condaenv (Japanese models come in _sm, _md, and _lg sizes).

Note that on non-Windows systems, you can also isolate the environment using virtualenv; in that case, replace spacy_install with spacy_install_virtualenv.

spacyr::spacy_install(lang_models = "ja_core_news_sm")

Load the model to use it.

spacyr::spacy_initialize(model = "ja_core_news_sm")

As a side note, in spaCy > 2.3, the Japanese tokenizer has migrated from fugashi to sudachipy (sudachidict-core), so installing spaCy will automatically install sudachipy. (spaCy also depends on natto-py, but that is for Korean; using the Japanese model should not depend on MeCab).

Especially when using large models, keeping them in memory might not be ideal, so it is recommended to run the following command after you finish.

spacyr::spacy_finalize()
# It seems that once you finalize, you need to restart the R session to initialize again.

(2) Installation via pip with GiNZA

There are cases where you might want to use pip even in a conda environment, such as when you want to use GiNZA models. Since mixing conda and pip can sometimes break the Miniconda environment itself, it's not generally recommended. However, it can work if you create a separate environment and use only pip within it without mixing with conda packages.

In the latest CRAN release of spacyr (1.2.1), you cannot specify and install spaCy v2.x. To use GiNZA models, you can do the following:

Note that R must still be run with administrator privileges, and on Windows, Visual C++ is required to build certain dependencies when installing via pip. Also, if you have already created spacy_condaenv via conda, the environments will get mixed; in that case, delete the environment to start fresh or use a different environment name.

require(reticulate)
conda_create("spacy_condaenv", python_version = "3.9")
conda_install("spacy_condaenv", packages = "spacy<3", pip = TRUE)

require(spacyr)
spacy_download_langmodel("ja_core_news_sm")

conda_install("spacy_condaenv", packages = "ginza", pip = TRUE)

Now you can load the GiNZA model. You can call it directly without needing reticulate::import("ginza").

spacyr::spacy_initialize(model = "ja_ginza")

Don't forget to finalize when you're done.

spacyr::spacy_finalize()

Parsing

Let's try using it after loading ja_ginza. Here is how to call only the tagger layer.

spacyr::spacy_tokenize(
  c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権")
)
#> $text1
#> [1] "望遠鏡" "で"     "泳ぐ"   "彼女"   "を"     "見"     "た"
#>
#> $text2
#> [1] "頭"     "が"     "赤い"   "魚"     "を"     "食べる" "猫"
#>
#> $text3
#> [1] "外国人参政権"

To perform labeling, call the spacy_parse function. Since lemmatization doesn't work well with Japanese models, set lemma = FALSE to avoid warnings.

spacyr::spacy_parse(
  c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
  lemma = FALSE
)
#>    doc_id sentence_id token_id        token  pos         entity
#> 1   text1           1        1       望遠鏡 NOUN
#> 2   text1           1        2           で  ADP
#> 3   text1           1        3         泳ぐ VERB
#> 4   text1           1        4         彼女 PRON
#> 5   text1           1        5           を  ADP
#> 6   text1           1        6           見 VERB
#> 7   text1           1        7           た  AUX
#> 8   text2           1        1           頭 NOUN  Animal_Part_B
#> 9   text2           1        2           が  ADP
#> 10  text2           1        3         赤い  ADJ Nature_Color_B
#> 11  text2           1        4           魚 NOUN
#> 12  text2           1        5           を  ADP
#> 13  text2           1        6       食べる VERB
#> 14  text2           1        7           猫 NOUN       Mammal_B
#> 15  text3           1        1 外国人参政権 NOUN

Since it uses 'Universal Dependencies', you can output dependencies.

spacyr::spacy_parse(
  c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
  dependency = TRUE,
  lemma = FALSE,
  pos = FALSE
)
#>    doc_id sentence_id token_id        token head_token_id dep_rel         entity
#> 1   text1           1        1       望遠鏡             3     obl
#> 2   text1           1        2           で             1    case
#> 3   text1           1        3         泳ぐ             4     acl
#> 4   text1           1        4         彼女             6     obj
#> 5   text1           1        5           を             4    case
#> 6   text1           1        6           見             6    ROOT
#> 7   text1           1        7           た             6     aux
#> 8   text2           1        1           頭             3   nsubj  Animal_Part_B
#> 9   text2           1        2           が             1    case
#> 10  text2           1        3         赤い             4     acl Nature_Color_B
#> 11  text2           1        4           魚             6     obj
#> 12  text2           1        5           を             4    case
#> 13  text2           1        6       食べる             7     acl
#> 14  text2           1        7           猫             7    ROOT       Mammal_B
#> 15  text3           1        1 外国人参政権             1    ROOT

There are specific functions for extracting noun_chunks and performing Named Entity Recognition.
Since the return value of spacyr::spacy_parse is a data frame, you could do this on the R side using dplyr or similar, but using the Python side might be faster depending on the environment.

spacyr::spacy_parse(
  c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
  lemma = FALSE,
  entity = TRUE
) %>%
  spacyr::entity_extract()
#>
#>   doc_id sentence_id entity entity_type
#> 1  text2           1     頭      Animal
#> 2  text2           1   赤い      Nature
#> 3  text2           1     猫      Mammal

Displaying dependency graphs

You might wonder if there are graphs like those produced by spacy-streamlit. Yes, there are.

You can use a package called textplot. It can be installed normally from CRAN.

textplot::textplot_dependencyparser normally takes a data frame of udpipe results as an argument, but since you don't need all the columns, you can plot spacyr results with a little processing.

spacyr::spacy_parse(
  c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
  dependency = TRUE,
  lemma = FALSE,
  pos = TRUE # upos in UD
) %>%
  dplyr::rename(upos = pos) %>% # rename to upos here
  dplyr::filter(doc_id == "text1") %>% # needs to be passed one sentence at a time
  textplot::textplot_dependencyparser()

spacy_plot-1.png

cleanNLP

About cleanNLP

This is an R package that allows you to use UDPipe, spaCy, and CoreNLP in a tidy way!

https://statsmaths.github.io/cleanNLP/

For UDPipe, it uses udpipe as a backend. For spaCy and CoreNLP, it seems to use pip to install Python libraries separately and use them as a backend (it is unrelated to spacyr or coreNLP). Effectively, it is a wrapper for udpipe.

Usage

cleanNLP::cnlp_init_udpipe will get it running. On Windows, you need to convert and pass the character encoding as UTF-8.

cleanNLP::cnlp_init_udpipe(model_name = "japanese")
annotation <- cleanNLP::cnlp_annotate(input = iconv("望遠鏡で泳ぐ彼女を見た", to = "UTF-8"))
annotation$token
#>
#> # A tibble: 7 x 11
#>   doc_id   sid tid   token  token_with_ws lemma  upos  xpos  feats tid_source relation
#> *  <int> <int> <chr> <chr>  <chr>         <chr>  <chr> <chr> <chr> <chr>      <chr>
#> 1      1     1 1     望遠鏡 望遠鏡        望遠鏡 NOUN  NN    <NA>  6          obl
#> 2      1     1 2     で     で            で     ADP   PS    <NA>  1          case
#> 3      1     1 3     泳ぐ   泳ぐ          泳ぐ   VERB  VV    <NA>  4          acl
#> 4      1     1 4     彼女   彼女          彼女   PRON  NP    <NA>  6          obj
#> 5      1     1 5     を     を            を     ADP   PS    <NA>  4          case
#> 6      1     1 6     見     見            見る   VERB  VV    <NA>  0          root
#> 7      1     1 7     た     た            た     AUX   AV    <NA>  6          aux

To be honest, it doesn't feel much different from udpipe, but cleanNLP does manage the loaded models within an environment inside the package's namespace.

In udpipe, it would be equivalent to the following, so it's a bit easier to write.

model_path <- udpipe::udpipe_download_model("japanese")
model <- udpipe::udpipe_load_model(model_path$file_model)
udpipe::udpipe(iconv("これをこうやって使う", to = "UTF-8"), model)

Regarding which models are which

This is likely current as of May 2021. For information on Japanese UD itself, you should read this.

Training Corpus Word Count Model License
spaCy > 2.3 UD Japanese-GSD v2.6-NE 186k CC BY-SA
GiNZA > 4 UD Japanese-BCCWJ v2.6 1,098k MIT
UDPipe 1 UD Japanese-GSD v2.5 186k CC BY-NC-SA
  • The "Word Count" here refers to the size of the original training corpus, not the model's vocabulary.
  • It seems a different annotated corpus (GSK2014-A (2019) BCCWJ edition) is used for the NER model in GiNZA v4.
  • As for UDPipe, models trained on UD 2.6 have already been released, but they cannot be used from the R package.

References

Session Information

Tested on Windows 10, Miniconda3 (Miniconda environment itself: Python=3.7.9, conda=4.9.2).

sessioninfo::session_info()
#0 - Session info --------------------------------------------------------------------------------
#0  setting  value
#0  version  R version 4.0.2 (2020-06-22)
#0  os       Windows 10 x64
#0  system   x86_64, mingw32
#0  ui       RStudio
#0  language (EN)
#0  collate  Japanese_Japan.932
#0  ctype    Japanese_Japan.932
#0  tz       Asia/Tokyo
#0  date     2021-05-24
#0
#0 - Packages ------------------------------------------------------------------------------------
#0  package      * version date       lib source
#0  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#0  backports      1.2.1   2020-12-09 [1] CRAN (R 4.0.3)
#0  bslib          0.2.4   2021-01-25 [1] CRAN (R 4.0.3)
#0  cli            2.5.0   2021-04-26 [1] CRAN (R 4.0.5)
#0  colorspace     2.0-1   2021-05-04 [1] CRAN (R 4.0.5)
#0  crayon         1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
#0  data.table     1.14.0  2021-02-21 [1] CRAN (R 4.0.2)
#0  DBI            1.1.1   2021-01-15 [1] CRAN (R 4.0.3)
#0  digest         0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
#0  dplyr          1.0.6   2021-05-05 [1] CRAN (R 4.0.2)
#0  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.0.2)
#0  evaluate       0.14    2019-05-28 [1] CRAN (R 4.0.2)
#0  fansi          0.4.2   2021-01-15 [1] CRAN (R 4.0.3)
#0  farver         2.1.0   2021-02-28 [1] CRAN (R 4.0.2)
#0  generics       0.1.0   2020-10-31 [1] CRAN (R 4.0.3)
#0  ggforce        0.3.3   2021-03-05 [1] CRAN (R 4.0.4)
#0  ggplot2        3.3.3   2020-12-30 [1] CRAN (R 4.0.3)
#0  ggraph         2.0.5   2021-02-23 [1] CRAN (R 4.0.4)
#0  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.0.3)
#0  glue           1.4.2   2020-08-27 [1] CRAN (R 4.0.3)
#0  graphlayouts   0.7.1   2020-10-26 [1] CRAN (R 4.0.4)
#0  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.0.2)
#0  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.0.2)
#0  highr          0.9     2021-04-16 [1] CRAN (R 4.0.5)
#0  htmltools      0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#0  igraph         1.2.6   2020-10-06 [1] CRAN (R 4.0.3)
#0  jquerylib      0.1.4   2021-04-26 [1] CRAN (R 4.0.5)
#0  jsonlite       1.7.2   2020-12-09 [1] CRAN (R 4.0.3)
#0  knitr          1.33    2021-04-24 [1] CRAN (R 4.0.5)
#0  labeling       0.4.2   2020-10-20 [1] CRAN (R 4.0.3)
#0  lattice        0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
#0  lifecycle      1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
#0  magrittr     * 2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#0  MASS           7.3-54  2021-05-03 [1] CRAN (R 4.0.5)
#0  Matrix         1.3-3   2021-05-04 [1] CRAN (R 4.0.2)
#0  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.0.2)
#0  pillar         1.6.0   2021-04-13 [1] CRAN (R 4.0.5)
#0  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#0  png            0.1-7   2013-12-03 [1] CRAN (R 4.0.3)
#0  polyclip       1.10-0  2019-03-14 [1] CRAN (R 4.0.3)
#0  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#0  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.0.5)
#0  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.0.3)
#0  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.0.3)
#0  R.utils        2.10.1  2020-08-26 [1] CRAN (R 4.0.3)
#0  R6             2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
#0  rappdirs       0.3.3   2021-01-31 [1] CRAN (R 4.0.3)
#0  Rcpp           1.0.6   2021-01-15 [1] CRAN (R 4.0.3)
#0  rematch2       2.1.2   2020-05-01 [1] CRAN (R 4.0.2)
#0  reticulate     1.20    2021-05-03 [1] CRAN (R 4.0.5)
#0  rlang          0.4.11  2021-04-30 [1] CRAN (R 4.0.5)
#0  rmarkdown      2.8     2021-05-07 [1] CRAN (R 4.0.5)
#0  sass           0.3.1   2021-01-24 [1] CRAN (R 4.0.3)
#0  scales         1.1.1   2020-05-11 [1] CRAN (R 4.0.2)
#0  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#0  spacyr         1.2.1   2020-03-04 [1] CRAN (R 4.0.2)
#0  stringi        1.6.1   2021-05-10 [1] CRAN (R 4.0.2)
#0  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#0  styler         1.4.1   2021-03-30 [1] CRAN (R 4.0.5)
#0  textplot       0.1.4   2020-10-10 [1] CRAN (R 4.0.5)
#0  tibble         3.1.1   2021-04-18 [1] CRAN (R 4.0.2)
#0  tidygraph      1.2.0   2020-05-12 [1] CRAN (R 4.0.4)
#0  tidyr          1.1.3   2021-03-03 [1] CRAN (R 4.0.4)
#0  tidyselect     1.1.1   2021-04-30 [1] CRAN (R 4.0.2)
#0  tweenr         1.0.2   2021-03-23 [1] CRAN (R 4.0.2)
#0  utf8           1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
#0  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.0.2)
#0  viridis        0.6.0   2021-04-15 [1] CRAN (R 4.0.5)
#0  viridisLite    0.4.0   2021-04-13 [1] CRAN (R 4.0.5)
#0  withr          2.4.2   2021-04-18 [1] CRAN (R 4.0.5)
#0  xfun           0.22    2021-03-11 [1] CRAN (R 4.0.4)
#0  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#0
#0 [1] C:/Users/user/Documents/R/win-library/4.0
#0 [2] C:/Program Files/R/R-4.0.2/library
GitHubで編集を提案

Discussion