Sudachipyで感情分析(自然言語処理)に入門する

nihei

やってみたい記事 ( BERT , 感情分析 , Python )

【実践】PythonとBERTで感情分析しようよ！

【実践】Pythonとjanomeで感情分析しようよ！

卒業研究で、Sudachipy使ってSudachiが好きなので、形態素分割のところだけ、Sudachiに置き換えて実施すると思います。

nihei

とりあえず、Sudachipyの復習から、、、

from sudachipy import tokenizer
from sudachipy import dictionary
import re
text = '外国人参政権'
dict = 'small'
mode = tokenizer.Tokenizer.SplitMode.A
tokenizer_obj = dictionary.Dictionary(config_path=None, resource_dir=None, dict=dict).create()

#形態素解析の実行
token = tokenizer_obj.tokenize(text, mode)

print(token)

$ python3 emotionalanalyzer.py
外国 人 参政 権
## modeA , dict = small
$ python3 emotionalanalyzer.py
外国 人 参政 権
## modeA , dict = core
$ python3 emotionalanalyzer.py
外国 人参 政権
## modeA , dict = full

うん、よし！

nihei

from sudachipy import tokenizer
from sudachipy import dictionary
import re
text = '外国人参政権'
dict = 'small'
mode = tokenizer.Tokenizer.SplitMode.A
tokenizer_obj = dictionary.Dictionary(config_path=None, resource_dir=None, dict=dict).create()

 
#形態素解析の実行
token = tokenizer_obj.tokenize(text, mode)

outputArray = []
for t in token:
    outputArray = t.surface(),t.part_of_speech(),t.reading_form(),t.normalized_form()
    print(outputArray)

('外国', ('名詞', '普通名詞', '一般', '*', '*', '*'), 'ガイコク', '外国')
('人参', ('名詞', '普通名詞', '一般', '*', '*', '*'), 'ニンジン', '人参')
('政権', ('名詞', '普通名詞', '一般', '*', '*', '*'), 'セイケン', '政権')

うん、よし！

nihei

復習も済んだので、ここからは、これをやっていくよ！

いまここまでやって、動詞と名詞と形状詞を配列に格納するところまで行った。

from sudachipy import tokenizer
from sudachipy import dictionary
import re
text = '手ブレ補正はお手軽だ'
dict = 'full'
mode = tokenizer.Tokenizer.SplitMode.A
tokenizer_obj = dictionary.Dictionary(config_path=None, resource_dir=None, dict=dict).create()

 
#形態素解析の実行
token = tokenizer_obj.tokenize(text, mode)

words = []
for t in token:
    word = []
    if t.part_of_speech()[0] in ['名詞','形状詞','動詞']:
        words.append(t.normalized_form())
print(words)

['手振れ', '補正', '手軽']

ちなみに、t.normalized_form()というのは、sudachiにある正規化機能を使っている。

言い忘れてたけど。sudachiのmodeAは一番細かく分割してくれるモードで、dict='full'というのは一番大きいsudachi公式の辞書を使用しているという意味

nihei

次に、このサイトから辞書を読み込み、同じ階層に配置して、
下のコードを実施。

# # #日本語評価極性辞書（名詞編）の読み込み
with open('pn.csv.m3.120408.trim.csv',encoding='utf-8') as f:
    lines = f.readlines()    
    dic = { x.split('\t')[0]:x.split('\t')[1] for x in lines }
    
# # #単語のネガポジ判定
for word in words:
    judge = dic.get(word,'-')
    print(f'{word} : {judge}')

['手振れ', '補正', '手軽']
手振れ : -
補正 : e
手軽 : p

いい感じ！

nihei

あとは、クラス化するだけです。sudachiを使っているので一部変更して、、、
いまの環境が、macなのでそれも考慮して、、、ちょいと元のコードを書き換えました。

emotionalanalyzer.py

from sudachipy import tokenizer
from sudachipy import dictionary
import codecs

class SentimentAnalysis:
    def __init__(self,dic_path):
        self.words = []
        self.dic = self.read_dic(dic_path)

    def analyze(self):
        '''
        感情分析
        '''
        posi = 0
        nega = 0
        neut = 0
        err = 0

        for word in self.words:
            res = self.dic.get(word,'-')
            if res == 'p':
                posi += 1
            elif res == 'n':
                nega += 1
            elif res == 'e':
                neut += 1
            else:
                err += 1
        return posi,nega,neut,err

    def word_separation(self,text):
        dict = 'full'
        mode = tokenizer.Tokenizer.SplitMode.A
        tokenizer_obj = dictionary.Dictionary(config_path=None, resource_dir=None, dict=dict).create()
        token = tokenizer_obj.tokenize(text, mode)
        words = []
        for t in token:
            if t.part_of_speech()[0] in ['名詞','形状詞','動詞']:
                words.append(t.normalized_form())
        return words

    def read_dic(self,filename):
        with codecs.open(filename,'r','utf-8','ignore') as f:
            lines = f.readlines()    
            dic = { x.split('\t')[0]:x.split('\t')[1] for x in lines }
        return dic

    def read_file(self,filename='./pn.csv.m3.120408.trim.csv',encoding='utf-8'):
        with codecs.open(filename,'r',encoding,'ignore') as f:
            self.read_text(f.read())

    def read_text(self,text):
    # 形態素解析を用いて名詞のリストを作成
        self.words = self.word_separation(text)

index.py

from emotionalanalyzer import SentimentAnalysis

# インスタンスの生成
sa = SentimentAnalysis('./pn.csv.m3.120408.trim.csv')
 
# 感情分析したい文章
lines = [
   'やはり1型のセンサーは素晴らしい画質です。スマホとは比べ物になりません。',
    '悩んで購入しましたが、高いわりに画質がイマイチでした。後悔しています。',
    '小型軽量で電池も長持ちするし、お勧めできる製品です。',
    '購入して１年ちょっとで壊れてしまいました。残念です。修理に出します。',
    'このサイズでAPC-Cセンサー搭載とは驚きです。値段は少し高めですが、満足です'
]
 
#1行ずつ読んで分析する
for line in lines:
   print(line)         # 元の文書を表示
   sa.read_text(line)  # 文書の読み込み
   res = sa.analyze()  # 感情分析の実行
   print(res)          # 結果の表示
   print("-----------")

こうすると、、、

command_line

やはり1型のセンサーは素晴らしい画質です。スマホとは比べ物になりません。
(1, 0, 0, 5)
-----------
悩んで購入しましたが、高いわりに画質がイマイチでした。後悔しています。
(1, 1, 0, 6)
-----------
小型軽量で電池も長持ちするし、お勧めできる製品です。
(3, 0, 1, 4)
-----------
購入して１年ちょっとで壊れてしまいました。残念です。修理に出します。
(0, 1, 1, 7)
-----------
このサイズでAPC-Cセンサー搭載とは驚きです。値段は少し高めですが、満足です
(2, 1, 1, 4)
-----------

この中身の内訳は、
(positive,negative,neutral,error)の数です。

わーい！できましたね！

次回からは、BERTを使ったこちらを実施しようと思います。

このスクラップは2023/06/29にクローズされました