🥬

MeCabの分かち書きを並列処理で高速化する

2021/07/18に公開

Python

MeCab

tech

モチベーション

MeCabはコマンドラインから以下のようにして簡単にテキストファイルに対する分かち書きを行うことができる。

mecab -Owakati ./wiki.txt -o ./wiki-out.txt

ただ、これだと並列化できないので、サイズが大きいテキストを処理すると結構時間がかかる。
試しに日本語Wikipedia(3.1GiB)でtimeコマンドを使って計測してみると以下のようになった。

time mecab -Owakati ./wiki.txt -o ./wiki-out.txt
________________________________________________________
Executed in  571.49 secs   fish           external
   usr time  537.57 secs  982.00 micros  537.57 secs
   sys time   29.01 secs  432.00 micros   29.01 secs

10分弱かかるのは結構辛いので、高速化したい。
世の中にあるMeCabの高速化についての記事は、複数ファイルに対して並列処理を行うことで高速化をしているけど、1ファイルに対する高速化について検討した記事は見当たらなかったので、色々試した備忘録を残しておく。
12コア24スレッドCPU環境下で実験を行なった。

追記: 2021-07-18

Twitterにてparallelコマンドを用いた並列処理を教えていただいたので、記事末尾に追記させていただきました。

結論

以下のbashスクリプトを書いてなんとかした。
ファイルを分割して、それぞれのファイルに対して並列処理で分かち書きを行なった後に、その出力ファイルをsortしてから結合している。

run.sh

mkdir -p ./tmp
split -a 4 -l 500000 -d --additional-suffix .txt wiki.txt ./tmp/wiki-

find ./tmp -name 'wiki-*.txt' | xargs -l -P $(nproc) -I {} mecab -Owakati {} -o {}.out

cat `find ./tmp -name 'wiki-*.txt.out' | sort` > wiki-out.txt

rm -r ./tmp

結果は以下の通りで、8倍程度高速化することができた。

time bash run.sh 
________________________________________________________
Executed in   77.23 secs   fish           external 
   usr time  643.31 secs  1119.00 micros  643.31 secs 
   sys time   77.29 secs  497.00 micros   77.29 secs

ファイルIOが律速になることがあるので、NFSなど読み書きが遅い環境では注意。その場合は次の節に示すPythonによる並列処理を試した方がいいかもしれない。

もっといい方法もあるかもなので、何か案があればぜひコメントで教えてください。

試行錯誤

multiprocessing with Python

MeCabにはPythonのラッパーライブラリがあるので、これを使ってPythonから並列処理をよしなにコントロールできそうである。
Pythonのmultiprocessingモジュールで愚直なマルチプロセス処理を書くと以下のようになる。
カレントディレクトリに分かち書きをしたいテキストファイルがある前提。

src/parallel-mecab-1.py

import os
import MeCab
from multiprocessing import Pool
from pathlib import Path


def run(text):
    mecab = MeCab.Tagger("-Owakati")
    return mecab.parse(text)

def main():
    with Path("./wiki.txt").open() as f, Path("./wiki-out.txt").open("w") as w:
        with Pool(processes=os.cpu_count()) as pool:
            for sentence in pool.imap(run, f):
                w.write(sentence+"\n")

if __name__ == "__main__":
    main()

Taggerクラスのインスタンスがpickableじゃないので各プロセスで初期化するようにしてみたが、これは各行でTaggerクラスのインスタンスの初期化が走るのでめっちゃ遅い。意味なし。

計測結果は以下の通り。めっちゃ遅くなった。

time poetry run python src/parallel-mecab-1.py  
________________________________________________________
Executed in   22.04 mins   fish           external 
   usr time  199.72 mins    0.00 micros  199.72 mins 
   sys time  252.65 mins  1376.00 micros  252.65 mins

multiprocessing with Python (メモリに展開版)

次に、以下のように書いてみた。

src/parallel-mecab-2.py

import os
import MeCab
from multiprocessing import Pool
from pathlib import Path
from more_itertools import divide


def run(text_iter):
    mecab = MeCab.Tagger("-Owakati")
    return [mecab.parse(text) for text in text_iter]


def main():
    with Path("./wiki.txt").open() as f, Path("./wiki-out.txt").open("w") as w:
        num_process = os.cpu_count()
        with Pool(processes=num_process) as pool:
            for sentences in pool.imap(run, divide(num_process, f)):
                w.write("\n".join(sentences)+"\n")


if __name__ == "__main__":
    main()

分かち書きの結果をメモリに持つ代わりに、コア数と同じ数のプロセスの中で1回だけTaggerの初期化をして、適当にforを回してみた。

計測結果は以下の通り。かなり早くなった(実時間で5分の1)けど、めちゃくちゃメモリを食った。

time poetry run python src/parallel-mecab-2.py

________________________________________________________
Executed in  100.06 secs   fish           external 
   usr time  630.11 secs  1579.00 micros  630.11 secs 
   sys time   60.04 secs    0.00 micros   60.04 secs

splitでごにょごにょする

上のスクリプトを動かすと、メモリを結構食うのでファイルIOを通していろいろやった方がいい気がしたので、スクリプトをごにょごにょ書くことにした。
splitコマンドでファイルを分割した後にmecabでの分かち書きを行い、出力された分かち書き後のファイルをsortしてからまとめる方針でやってみた。

mkdir -p ./tmp
split -a 4 -n $(nproc) -d --additional-suffix .txt wiki.txt ./tmp/wiki-

find ./tmp -name 'wiki-*.txt' | xargs -l -P $(nproc) -I {} mecab -Owakati {} -o {}.out

cat `find ./tmp -name 'wiki-*.txt.out' | sort` > wiki-out.txt

rm -r ./tmp

計測結果は以下の通り。かなり早いし、メモリも全然使わなかったが、ストレージを多めに食う点は結構気をつけた方がいいかも。ファイルIOが律速になる可能性があるので、NFSなどRead/Writeが重たい環境下での実行は考えた方がいいかもしれない。

time bash run-1.sh 
________________________________________________________
Executed in   84.26 secs   fish           external 
   usr time  643.91 secs    1.44 millis  643.90 secs 
   sys time   74.85 secs    0.61 millis   74.85 secs

...とここまでやったところで、splitコマンドがファイルをバイト単位で分割していることに気づいた。これじゃダメなのでちょっと調整する。

splitでごにょごにょする(改善版)

というわけで、splitコマンドを行単位で処理するように変更した。

mkdir -p ./tmp
split -a 4 -l 500000 -d --additional-suffix .txt wiki.txt ./tmp/wiki-

find ./tmp -name 'wiki-*.txt' | xargs -l -P $(nproc) -I {} mecab -Owakati {} -o {}.out

cat `find ./tmp -name 'wiki-*.txt.out' | sort` > wiki-out.txt

rm -r ./tmp

time bash run-2.sh 
________________________________________________________
Executed in   77.23 secs   fish           external 
   usr time  643.31 secs  1119.00 micros  643.31 secs 
   sys time   77.29 secs  497.00 micros   77.29 secs

結果もいい感じ。

追記: parallelコマンドを使う

本記事を公開したところ、Twitterにてparallelコマンドを使うといいと教えていただいた。

mkdir -p ./tmp

cat ./wiki.txt | parallel --blocksize 1073741824 -P $(nproc) --pipe -N 500000 --rpl '{##} $_=sprintf("%04d",$job->seq()-1)' 'mecab -Owakati -o ./tmp/wiki-out.{##}.txt'

cat `find ./tmp -name 'wiki-out.*.txt' | sort` > ./wiki-out.txt

rm -r ./tmp

この処理の仕方なら、splitのように入力のファイルをただ分割しただけのファイルを無くせていい感じ。
計測結果は以下。早い。

time bash run-3.sh
________________________________________________________
Executed in   81.53 secs   fish           external 
   usr time  724.94 secs  746.00 micros  724.94 secs 
   sys time   88.04 secs  410.00 micros   88.04 secs