iTranslated by AI
Errors and Workarounds for Extracting Text from Wikipedia with wikiextractor
The Wikipedia article dump data distributed here is in XML format, so we will extract the body text from it.
This should be easily accomplished using a package called wikiextractor, but when I tried it on March 20, 2022, I ran into several issues, so I'll share the method that worked for me.
The wikiextractor README introduces the following method:
git clone https://github.com/attardi/wikiextractor
cd wikiextractor
python -m wikiextractor.WikiExtractor <Wikipedia dump file>
However, this method results in a runtime error called bdb.BdbQuit and does not work.
Solution
Switch to installing via pip and install v3.0.4. It seems the issue mentioned above occurs in versions 3.0.5 and 3.0.6.
Also, ensure that your Python version is 3.7 or earlier; if it is 3.8 or later, you should switch. In version 3.8 or later, even if you install v3.0.4, the error TypeError: cannot pickle '_io.TextIOWrapper' object will occur.
If you are managing Python with pyenv, do the following:
pyenv local 3.7
pip install wikiextractor==3.0.4
wikiextractor <Wikipedia dump file>
This allows you to extract the body text as plain text.
Discussion