🧞

トピックモデルによる日本語テキスト分類

2022/05/30に公開

Python

この記事でできるようになること

自前の文章からトピックモデル（LDA:潜在的ディリクレ配分法）を用いてテキストを分類する
教師なし学習なのでテキストを集めるだけで良く楽

Gensim

文章を意味ベクトルとして表現するためのPythonライブラリ
また、教師なし機械学習を用いてテキストを処理することができる

サンプルコードを見てみる

Gensimのサンプルコードを見てLdaModelに何を渡せばいいか調べる

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=10)

common_textsは単語のリストのリスト

[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

dictionaryは単語とIDのセット. common_texts形式を渡すと重複なしのDictionaryを作ってくれる

print(common_dictionary.token2id)
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

bag-of-words
単語とその出現回数で文章を表現する

common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
print(common_corpus)
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

つまりデータの流れは

文章のリストの作成

ここは各タスクに依存するため、get_text_list関数として作成しておく

単語リストのリスト作成

日本語文章なので形態素解析する必要がある。
今回はPythonで手軽に形態素解析できるsudachipyを用いる

テキストの標準化

テキストから余計な情報を削除し、フォーマットを統一する

def normalize_text_list(text_list):
    import pandas as pd
    text_series = pd.Series(text_list)
    text_series = text_series.str.normalize("NFKC") # 文字コード整形
    text_series = text_series.str.lower() # 英語は小文字に統一
    text_series = text_series.replace('@[0-9a-zA-Z_]{1,15}',"", regex=True) # IDを削除
    text_series = text_series.replace("　","",regex=True) # 全角スペース削除
    text_series = text_series.replace(" ","",regex=True) # スペース削除
    text_series = text_series.replace("\n","",regex=True) # 改行削除
    return list(text_series)
    
text_list = get_text_list()
text_list = normalize_text_list(text_list)

sudachipyにより単語の分かち書きにする

from sudachipy import tokenizer
from sudachipy import dictionary
import re

def sudachipy_tokenizer(text):
    tokenizer_obj = dictionary.Dictionary().create()
    mode = tokenizer.Tokenizer.SplitMode.C

    token = tokenizer_obj.tokenize(text, mode)
    token_list = [t.dictionary_form() for t in token]
    token_list = [t.lower() for t in token_list]
    token_list = [t for t in token_list if not len(t)==1]

    kana_re = re.compile("^[ぁ-ゖ]+$")
    token_list = [t for t in token_list if not kana_re.match(t)]
    token_list = [t for t in token_list]
    return token_list
common_texts = [sudachipy_tokenizer(text) for text in text_list]

Dictionary化、bagofwordsのリスト化

単語リストのリストができてるのでここからは同じ

from gensim.corpora.dictionary import Dictionary
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

LDAにかける

分類したいトピック数に応じてnum_topicsを変更する

from gensim.models import LdaModel
lda = LdaModel(common_corpus, num_topics=4)

LDAを使って分類してみる

lda.get_document_topicsに文章のBoWを渡すことで分類できる

def get_document_topics(text, lda, dict):
    text = normalize_text_list([text])
    text = sudachipy_tokenizer(*text)
    text = dict.doc2bow(text)
    lda_result = lda.get_document_topics(text)
    print(lda_result)
    return max(lda_result, key=lambda x:x[1])
text = "Amazon ECSとAmazon EKSの主要なメトリクスをモニタリングする方法を学び、コンテナ化されたAWSアプリケーションを大規模に追跡するための戦略を解説しています。無料のeBookを今すぐダウンロードしてください。"
topic = get_document_topics(text, lda, common_dictionary)
print(topic) # (0, 0.62431365)