iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📚

Errors and Workarounds for Extracting Text from Wikipedia with wikiextractor

に公開

https://dumps.wikimedia.org/jawiki/latest/

The Wikipedia article dump data distributed here is in XML format, so we will extract the body text from it.

This should be easily accomplished using a package called wikiextractor, but when I tried it on March 20, 2022, I ran into several issues, so I'll share the method that worked for me.

https://github.com/attardi/wikiextractor

The wikiextractor README introduces the following method:

git clone https://github.com/attardi/wikiextractor
cd wikiextractor
python -m wikiextractor.WikiExtractor <Wikipedia dump file>

However, this method results in a runtime error called bdb.BdbQuit and does not work.

Solution

Switch to installing via pip and install v3.0.4. It seems the issue mentioned above occurs in versions 3.0.5 and 3.0.6.

Also, ensure that your Python version is 3.7 or earlier; if it is 3.8 or later, you should switch. In version 3.8 or later, even if you install v3.0.4, the error TypeError: cannot pickle '_io.TextIOWrapper' object will occur.

If you are managing Python with pyenv, do the following:

pyenv local 3.7
pip install wikiextractor==3.0.4
wikiextractor <Wikipedia dump file>

This allows you to extract the body text as plain text.

Discussion