👩💻
言語処理100本ノック 2020 (Rev 2) 第6章: 機械学習 54. 正解率の計測
問題
54. 正解率の計測
52で学習したロジスティック回帰モデルの正解率を,学習データおよび評価データ上で計測せよ.
solution54.py
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def preprocessing(data):
x = []
y = []
label = {"b":0, "e":1, "t":2, "m":3}
for title, category in data:
title = re.sub("[0-9]+", "0", title)
x.append(title.lower())
y.append(label[category])
return x,y
def score(model, X):
pred = model.predict([X])
pred_proba = model.predict_proba([X])[0, pred]
return pred[0], pred_proba[0]
np.random.seed(123)
df = pd.read_csv("chapter06/NewsAggregatorDataset/newsCorpora.csv", header=None, sep="\t", names=["ID", "TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"])
df = df.loc[df["PUBLISHER"].isin(["Reuters", "Huffington Post", "Businessweek", "Contactmusic.com", "Daily Mail"]), ["TITLE", "CATEGORY"]]
train, valid_test = train_test_split(df, test_size=0.2, shuffle=True)
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True)
train = np.array(train)
valid = np.array(valid)
test = np.array(test)
X_train, Y_train = preprocessing(train)
X_valid, Y_valid = preprocessing(valid)
X_test, Y_test = preprocessing(test)
tfidfvectorizer = TfidfVectorizer(min_df=0.001)
X_train = tfidfvectorizer.fit_transform(X_train).toarray()
X_valid = tfidfvectorizer.transform(X_valid).toarray()
X_test = tfidfvectorizer.transform(X_test).toarray()
model = LogisticRegression()
model.fit(X_train, Y_train)
train_pred = []
test_pred = []
for X in X_train:
train_pred.append(score(model, X)[0])
for X in X_test:
test_pred.append(score(model, X)[0])
train_acc = accuracy_score(Y_train, train_pred)
test_acc = accuracy_score(Y_test, test_pred)
print("train accuracy", train_acc)
print("test accuracy",test_acc)
output
train accuracy 0.9162293853073463
test accuracy 0.8673163418290855
この問題では、train_pred
リストとtest_pred
リストは、それぞれX_train
とX_test
に対してscore
関数を実行して、予測クラスを格納します。score
関数は、入力データX
に対して、モデルが予測したクラスとその確率を返します。accuracy_score
関数を使用して、Y_train
とtrain_pred
、Y_test
とtest_pred
の正解率を計算します。最終的に、学習データとテストデータの正解率を表示します。
参考記事
第6章: 機械学習
pandas.DataFrameの行を条件で抽出するquery
scikit-learnでデータを訓練用とテスト用に分割するtrain_test_split
Numpy で乱数を生成する
sklearn.feature_extraction.text.TfidfVectorizer
sklearn.linear_model.LogisticRegression
Discussion