iTranslated by AI
Demo of spacyr and cleanNLP in R
spacyr
About spacyr
spacyr is an R package that calls spaCy via the reticulate package.
Basically, it is designed for performing POS tagging, dependency labeling, and named entity recognition using models that can be installed via pip.
Currently, you cannot modify the pipeline, retrain models, or load custom models through this package. You should probably write Python code yourself for those tasks.
Notes
- Although not in this list, spaCy officially added Japanese models in version 2.3, so Japanese models are available in that version and later.
- This issue has already been closed, so installing the latest version of spaCy should generally be fine.
- Multi-process settings like
nlp.pipe(n_process=2)cannot be passed. Multi-threading is supported on the Python side (internally, it runs a Python script wrapping spaCy usingitertools). - While spaCy can use GPUs by setting
spacy.prefer.gpu(), it seems GPU usage is not supported via spacyr (furthermore,spacyr::spacy_installcurrently doesn't support installing GPU-enabled versions).
Usage
(1) Installation via conda
If you use a Miniconda environment, Python libraries will naturally all be installed via conda.
While it is easy to set up, there is a trap specific to Windows: the following steps require running R with administrator privileges. If using RStudio, right-click the shortcut and select "Run as administrator."
Executing the following script as an administrator will download spaCy and the Japanese model ja_core_news_sm into a condaenv named spacy_condaenv (Japanese models come in _sm, _md, and _lg sizes).
Note that on non-Windows systems, you can also isolate the environment using virtualenv; in that case, replace spacy_install with spacy_install_virtualenv.
spacyr::spacy_install(lang_models = "ja_core_news_sm")
Load the model to use it.
spacyr::spacy_initialize(model = "ja_core_news_sm")
As a side note, in spaCy > 2.3, the Japanese tokenizer has migrated from fugashi to sudachipy (sudachidict-core), so installing spaCy will automatically install sudachipy. (spaCy also depends on natto-py, but that is for Korean; using the Japanese model should not depend on MeCab).
Especially when using large models, keeping them in memory might not be ideal, so it is recommended to run the following command after you finish.
spacyr::spacy_finalize()
# It seems that once you finalize, you need to restart the R session to initialize again.
(2) Installation via pip with GiNZA
There are cases where you might want to use pip even in a conda environment, such as when you want to use GiNZA models. Since mixing conda and pip can sometimes break the Miniconda environment itself, it's not generally recommended. However, it can work if you create a separate environment and use only pip within it without mixing with conda packages.
In the latest CRAN release of spacyr (1.2.1), you cannot specify and install spaCy v2.x. To use GiNZA models, you can do the following:
Note that R must still be run with administrator privileges, and on Windows, Visual C++ is required to build certain dependencies when installing via pip. Also, if you have already created spacy_condaenv via conda, the environments will get mixed; in that case, delete the environment to start fresh or use a different environment name.
require(reticulate)
conda_create("spacy_condaenv", python_version = "3.9")
conda_install("spacy_condaenv", packages = "spacy<3", pip = TRUE)
require(spacyr)
spacy_download_langmodel("ja_core_news_sm")
conda_install("spacy_condaenv", packages = "ginza", pip = TRUE)
Now you can load the GiNZA model. You can call it directly without needing reticulate::import("ginza").
spacyr::spacy_initialize(model = "ja_ginza")
Don't forget to finalize when you're done.
spacyr::spacy_finalize()
Parsing
Let's try using it after loading ja_ginza. Here is how to call only the tagger layer.
spacyr::spacy_tokenize(
c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権")
)
#> $text1
#> [1] "望遠鏡" "で" "泳ぐ" "彼女" "を" "見" "た"
#>
#> $text2
#> [1] "頭" "が" "赤い" "魚" "を" "食べる" "猫"
#>
#> $text3
#> [1] "外国人参政権"
To perform labeling, call the spacy_parse function. Since lemmatization doesn't work well with Japanese models, set lemma = FALSE to avoid warnings.
spacyr::spacy_parse(
c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
lemma = FALSE
)
#> doc_id sentence_id token_id token pos entity
#> 1 text1 1 1 望遠鏡 NOUN
#> 2 text1 1 2 で ADP
#> 3 text1 1 3 泳ぐ VERB
#> 4 text1 1 4 彼女 PRON
#> 5 text1 1 5 を ADP
#> 6 text1 1 6 見 VERB
#> 7 text1 1 7 た AUX
#> 8 text2 1 1 頭 NOUN Animal_Part_B
#> 9 text2 1 2 が ADP
#> 10 text2 1 3 赤い ADJ Nature_Color_B
#> 11 text2 1 4 魚 NOUN
#> 12 text2 1 5 を ADP
#> 13 text2 1 6 食べる VERB
#> 14 text2 1 7 猫 NOUN Mammal_B
#> 15 text3 1 1 外国人参政権 NOUN
Since it uses 'Universal Dependencies', you can output dependencies.
spacyr::spacy_parse(
c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
dependency = TRUE,
lemma = FALSE,
pos = FALSE
)
#> doc_id sentence_id token_id token head_token_id dep_rel entity
#> 1 text1 1 1 望遠鏡 3 obl
#> 2 text1 1 2 で 1 case
#> 3 text1 1 3 泳ぐ 4 acl
#> 4 text1 1 4 彼女 6 obj
#> 5 text1 1 5 を 4 case
#> 6 text1 1 6 見 6 ROOT
#> 7 text1 1 7 た 6 aux
#> 8 text2 1 1 頭 3 nsubj Animal_Part_B
#> 9 text2 1 2 が 1 case
#> 10 text2 1 3 赤い 4 acl Nature_Color_B
#> 11 text2 1 4 魚 6 obj
#> 12 text2 1 5 を 4 case
#> 13 text2 1 6 食べる 7 acl
#> 14 text2 1 7 猫 7 ROOT Mammal_B
#> 15 text3 1 1 外国人参政権 1 ROOT
There are specific functions for extracting noun_chunks and performing Named Entity Recognition.
Since the return value of spacyr::spacy_parse is a data frame, you could do this on the R side using dplyr or similar, but using the Python side might be faster depending on the environment.
spacyr::spacy_parse(
c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
lemma = FALSE,
entity = TRUE
) %>%
spacyr::entity_extract()
#>
#> doc_id sentence_id entity entity_type
#> 1 text2 1 頭 Animal
#> 2 text2 1 赤い Nature
#> 3 text2 1 猫 Mammal
Displaying dependency graphs
You might wonder if there are graphs like those produced by spacy-streamlit. Yes, there are.
You can use a package called textplot. It can be installed normally from CRAN.
textplot::textplot_dependencyparser normally takes a data frame of udpipe results as an argument, but since you don't need all the columns, you can plot spacyr results with a little processing.
spacyr::spacy_parse(
c("望遠鏡で泳ぐ彼女を見た", "頭が赤い魚を食べる猫", "外国人参政権"),
dependency = TRUE,
lemma = FALSE,
pos = TRUE # upos in UD
) %>%
dplyr::rename(upos = pos) %>% # rename to upos here
dplyr::filter(doc_id == "text1") %>% # needs to be passed one sentence at a time
textplot::textplot_dependencyparser()

cleanNLP
About cleanNLP
This is an R package that allows you to use UDPipe, spaCy, and CoreNLP in a tidy way!
For UDPipe, it uses udpipe as a backend. For spaCy and CoreNLP, it seems to use pip to install Python libraries separately and use them as a backend (it is unrelated to spacyr or coreNLP). Effectively, it is a wrapper for udpipe.
Usage
cleanNLP::cnlp_init_udpipe will get it running. On Windows, you need to convert and pass the character encoding as UTF-8.
cleanNLP::cnlp_init_udpipe(model_name = "japanese")
annotation <- cleanNLP::cnlp_annotate(input = iconv("望遠鏡で泳ぐ彼女を見た", to = "UTF-8"))
annotation$token
#>
#> # A tibble: 7 x 11
#> doc_id sid tid token token_with_ws lemma upos xpos feats tid_source relation
#> * <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 1 望遠鏡 望遠鏡 望遠鏡 NOUN NN <NA> 6 obl
#> 2 1 1 2 で で で ADP PS <NA> 1 case
#> 3 1 1 3 泳ぐ 泳ぐ 泳ぐ VERB VV <NA> 4 acl
#> 4 1 1 4 彼女 彼女 彼女 PRON NP <NA> 6 obj
#> 5 1 1 5 を を を ADP PS <NA> 4 case
#> 6 1 1 6 見 見 見る VERB VV <NA> 0 root
#> 7 1 1 7 た た た AUX AV <NA> 6 aux
To be honest, it doesn't feel much different from udpipe, but cleanNLP does manage the loaded models within an environment inside the package's namespace.
In udpipe, it would be equivalent to the following, so it's a bit easier to write.
model_path <- udpipe::udpipe_download_model("japanese")
model <- udpipe::udpipe_load_model(model_path$file_model)
udpipe::udpipe(iconv("これをこうやって使う", to = "UTF-8"), model)
Regarding which models are which
This is likely current as of May 2021. For information on Japanese UD itself, you should read this.
| Training Corpus | Word Count | Model License | |
|---|---|---|---|
| spaCy > 2.3 | UD Japanese-GSD v2.6-NE | 186k | CC BY-SA |
| GiNZA > 4 | UD Japanese-BCCWJ v2.6 | 1,098k | MIT |
| UDPipe 1 | UD Japanese-GSD v2.5 | 186k | CC BY-NC-SA |
- The "Word Count" here refers to the size of the original training corpus, not the model's vocabulary.
- It seems a different annotated corpus (GSK2014-A (2019) BCCWJ edition) is used for the NER model in GiNZA v4.
- As for UDPipe, models trained on UD 2.6 have already been released, but they cannot be used from the R package.
References
- Using Tidytext and SpacyR in R to do Sentiment Analysis on the COVID-19 Update Speeches by the President of Ghana. | Ghana Data Stuff
- spaCy からたどる最近の日本語自然言語処理ライブラリの調査 | ハカセノオト
- 固有表現抽出のアノテーションデータについて - NLP太郎のブログ
Session Information
Tested on Windows 10, Miniconda3 (Miniconda environment itself: Python=3.7.9, conda=4.9.2).
sessioninfo::session_info()
#0 - Session info --------------------------------------------------------------------------------
#0 setting value
#0 version R version 4.0.2 (2020-06-22)
#0 os Windows 10 x64
#0 system x86_64, mingw32
#0 ui RStudio
#0 language (EN)
#0 collate Japanese_Japan.932
#0 ctype Japanese_Japan.932
#0 tz Asia/Tokyo
#0 date 2021-05-24
#0
#0 - Packages ------------------------------------------------------------------------------------
#0 package * version date lib source
#0 assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#0 backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3)
#0 bslib 0.2.4 2021-01-25 [1] CRAN (R 4.0.3)
#0 cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5)
#0 colorspace 2.0-1 2021-05-04 [1] CRAN (R 4.0.5)
#0 crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
#0 data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.2)
#0 DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#0 digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#0 dplyr 1.0.6 2021-05-05 [1] CRAN (R 4.0.2)
#0 ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
#0 evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#0 fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3)
#0 farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
#0 generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#0 ggforce 0.3.3 2021-03-05 [1] CRAN (R 4.0.4)
#0 ggplot2 3.3.3 2020-12-30 [1] CRAN (R 4.0.3)
#0 ggraph 2.0.5 2021-02-23 [1] CRAN (R 4.0.4)
#0 ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.0.3)
#0 glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
#0 graphlayouts 0.7.1 2020-10-26 [1] CRAN (R 4.0.4)
#0 gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.2)
#0 gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
#0 highr 0.9 2021-04-16 [1] CRAN (R 4.0.5)
#0 htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#0 igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.3)
#0 jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.5)
#0 jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.3)
#0 knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5)
#0 labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.3)
#0 lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
#0 lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
#0 magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#0 MASS 7.3-54 2021-05-03 [1] CRAN (R 4.0.5)
#0 Matrix 1.3-3 2021-05-04 [1] CRAN (R 4.0.2)
#0 munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
#0 pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.5)
#0 pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#0 png 0.1-7 2013-12-03 [1] CRAN (R 4.0.3)
#0 polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.0.3)
#0 purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#0 R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.0.5)
#0 R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.3)
#0 R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.3)
#0 R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.3)
#0 R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#0 rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.0.3)
#0 Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#0 rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.0.2)
#0 reticulate 1.20 2021-05-03 [1] CRAN (R 4.0.5)
#0 rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.5)
#0 rmarkdown 2.8 2021-05-07 [1] CRAN (R 4.0.5)
#0 sass 0.3.1 2021-01-24 [1] CRAN (R 4.0.3)
#0 scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
#0 sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#0 spacyr 1.2.1 2020-03-04 [1] CRAN (R 4.0.2)
#0 stringi 1.6.1 2021-05-10 [1] CRAN (R 4.0.2)
#0 stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#0 styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.5)
#0 textplot 0.1.4 2020-10-10 [1] CRAN (R 4.0.5)
#0 tibble 3.1.1 2021-04-18 [1] CRAN (R 4.0.2)
#0 tidygraph 1.2.0 2020-05-12 [1] CRAN (R 4.0.4)
#0 tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.4)
#0 tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
#0 tweenr 1.0.2 2021-03-23 [1] CRAN (R 4.0.2)
#0 utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
#0 vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
#0 viridis 0.6.0 2021-04-15 [1] CRAN (R 4.0.5)
#0 viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.5)
#0 withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
#0 xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4)
#0 yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#0
#0 [1] C:/Users/user/Documents/R/win-library/4.0
#0 [2] C:/Program Files/R/R-4.0.2/library
Discussion