🎅

【25日目】回帰問題を分類問題に置き換えてみる【2021アドベントカレンダー】

2021/12/26に公開

2021年1人アドベントカレンダー(機械学習)、最終日の記事になります。

テーマは分類になります。

本来は回帰向きのデータセットですが、ビン分割を使って分類問題に置き換えてみます。

1.0 ～ 2.0 といった予測ラベルを数値 (左記の例では 2.0) に置き換え、実績と比較しRMSEも算定してみます。

Colab のコードはこちら

ビン分割

あえて予測をむずかしくする意図から11個に分割します。

df_dropna["Target"] = pd.cut(
    df_dropna["Global_Sales"], 
    [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, df["Global_Sales"].max()+10], 
    right=False, 
    labels=False
    )

hit_labels = ['normal', 'milion_hit', 'double_milion_hit', 'triple_milion_hit', 'quadruple_milion_hit', 
              'quintuple_milion_hit', 'sextaple_milion_hit', 'septaple_milion_hit', 'octaple_milion_hit', 
              'nonuple_milion_hit', 'ten_million_hit']

df_dropna["Target_Label"] = pd.cut(
    df_dropna["Global_Sales"], 
    [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, df["Global_Sales"].max()+10], 
    right=False, 
    labels=hit_labels
    )

なお、ラベル名は下記を参照しています。

ラベルの分布を確認してみます。

pd.DataFrame(
    [
        df_dropna["Target"].value_counts(normalize=True).sort_index().values,
        y_train.value_counts(normalize=True).sort_index().values,
        y_test.value_counts(normalize=True).sort_index().values,
    ],
    columns = hit_labels,
    index = ["all", "Train", "Test"],
).T

かなり偏ったデータであることがわかります。

ラベルの分布に大きな不均衡がある場合に用いる StratifiedKFold を使ってシンプルに予測してみます。

n_classes = 11

skf = StratifiedKFold(
    n_splits=5, 
    shuffle=True,
    random_state=SEED
    )

cv_result = []
cv_rmse = []

for i, (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
    X_train_skf, X_test_skf = X_train.drop("Global_Sales", axis=1).iloc[train_index], X_train.drop("Global_Sales", axis=1).iloc[test_index]
    y_train_skf, y_test_skf = y_train.iloc[train_index], y_train.iloc[test_index]

    print(f"fold:{i}")

    # データ分布
    histgram = pd.concat([
            y_train_skf.value_counts(normalize=True).sort_index(),
            y_test_skf.value_counts(normalize=True).sort_index()
        ], axis=1).T
    
    histgram.columns = hit_labels
    histgram.index = ["Train", "Test"]

    print("データ分布")
    display(histgram)

    # 学習、推論
    pipe.fit(X_train_skf, y_train_skf)

    y_pred = pipe.predict(X_test_skf)

    # 混合行列のカラム名の作成
    confuse_matrix_columns = np.concatenate([y_test_skf.unique(), np.unique(y_pred)])
    confuse_matrix_columns = np.unique(confuse_matrix_columns)
    confuse_matrix_columns = [hit_labels[x] for x in confuse_matrix_columns]

    print("混合行列")
    display(
        pd.DataFrame(
            confusion_matrix(y_test_skf, y_pred),
            index = confuse_matrix_columns,
            columns = confuse_matrix_columns
        )
    )

    # ROC曲線
    y_score = pipe.predict_proba(X_test_skf)
    y_test_skf_bin = label_binarize(y_test_skf, classes=list(range(n_classes)))

    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for j in range(n_classes):
        fpr[j], tpr[j], _ = roc_curve(y_test_skf_bin[:, j], y_score[:, j])
        roc_auc[j] = auc(fpr[j], tpr[j])

    fpr["micro"], tpr["micro"], _ = roc_curve(y_test_skf_bin.ravel(), y_score.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    # Plot ROC curve
    plt.figure()
    plt.plot(fpr["micro"], tpr["micro"],
            label='micro-average ROC curve (area = {0:0.2f})'
                ''.format(roc_auc["micro"]))
    for k in range(n_classes):
        plt.plot(fpr[k], tpr[k], label='ROC curve of class {0} (area = {1:0.2f})'
                                    ''.format(k, roc_auc[k]))

    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('偽陽性率')
    plt.ylabel('真陽性率')
    plt.title(f'Fold {i} ROC曲線')
    plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left')
    plt.show()
    print()
    print()

    pred = pipe.predict(X_test.drop("Global_Sales", axis=1))
    score = f1_score(y_test, pred, average="macro")
    cv_result.append(score)

    rmse = mean_squared_error(
            X_test["Global_Sales"], 
            [[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, df["Global_Sales"].max()][x] for x in pred],
            squared=False
            )
    cv_rmse.append(rmse)

出力:
テストデータ F1Score: 0.15898311764016076
テストデータ RMSE 1.2553489368798378

F1スコアはかなり低い数値となりました。

やはりラベルの偏りが激しいため、予測がうまくいかないようです。

オーバーサンプリング

そこで少数派のデータをもとに不足分のデータを補完するオーバーサンプリングをやってみます。

(単純にコピペでデータを増やしただけですが・・・)

normal の数を基準に、それぞれ何倍にすれば normal と同数になるか出力してみます。

for i in range(len(df_dropna["Target"].value_counts().sort_index())):
    print(
            df_dropna["Target"].value_counts().sort_index().index[i],                                                    # index
            hit_labels[df_dropna["Target"].value_counts().sort_index().index[i]],                                        # ラベル名
            df_dropna["Target"].value_counts().sort_index()[0]//df_dropna["Target"].value_counts().sort_index().iloc[i], # 倍率
            "倍"
        )

0 normal 1 倍
1 milion_hit 17 倍
2 double_milion_hit 59 倍
3 triple_milion_hit 139 倍
4 quadruple_milion_hit 250 倍
5 quintuple_milion_hit 493 倍
6 sextaple_milion_hit 772 倍
7 septaple_milion_hit 1615 倍
8 octaple_milion_hit 2538 倍
9 nonuple_milion_hit 3554 倍
10 ten_million_hit 772 倍

上記の処理を流用して水増し処理を実行します。

pd.concat([xxx] * num) を使って単純に行数をコピペします。

# 空のDataFrameを作成
X_train_over = pd.DataFrame()
y_train_over = pd.DataFrame()

for i in range(len(df_dropna["Target"].value_counts().sort_index())):

    index_num = df_dropna["Target"].value_counts().sort_index().index[i] # index

    coef = df_dropna["Target"].value_counts().sort_index()[0]//df_dropna["Target"].value_counts().sort_index().iloc[i] # 倍率

    X_train_over = pd.concat([
                              X_train_over,
                              pd.concat([X_train[y_train==index_num]]*coef)
                            ])
    
    y_train_over = pd.concat([
                            y_train_over,
                            pd.concat([y_train[y_train==index_num]]*coef)
                        ])

# index の初期化
X_train_over = X_train_over.reset_index(drop=True)    
y_train_over = y_train_over.reset_index(drop=True)[0] # Series型に戻す
y_train_over = y_train_over.astype(int) # int型に戻す

かなり精度が上がりました。

テストデータに対する F1Score: 0.9986972627154034
テストデータに対する RMSE: 0.3375619095795954

Stacking

最後に分類の場合の Stacking をやってみます。

訓練データの学習・推論による各種モデルの学習用予測データの作成だけでなく、同じモデルでテストデータから評価用の予測データを作成してみます。

処理速度を上げるため、CatBoost と XGBoost は GPU を使ってみます。

GPUを使えるSVC、ThunderSVCは今回のデータセットでは処理が停滞してしまったため使いませんでした。

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from catboost import CatBoostClassifier

models = {
    "logistic":LogisticRegression(random_state=SEED),
    "kneighbor":KNeighborsClassifier(),
    "mlp":MLPClassifier(random_state=SEED),
    "svc":SVC(random_state=SEED),
    "adaboost":AdaBoostClassifier(random_state=SEED),
    "random_forest":RandomForestClassifier(random_state=SEED),
    "gradient":GradientBoostingClassifier(random_state=SEED),
    "catboost":CatBoostClassifier(random_state=SEED, 
                                 silent=True, # ログを非表示
                                task_type='GPU', # GPU使用設定
                                 ),
    "xgboost":xgb.XGBClassifier(
        random_state=SEED,
        objective='multi:softmax',
        tree_method='gpu_hist',  # GPU使用設定
        gpu_id=0                 # GPU使用設定
        ),
    "lightgbm":lgb.LGBMClassifier(
        random_state=SEED,
        ),
}

一層目の処理を行います。

n_classes = 11

# 学習・推論
skf = StratifiedKFold(
    n_splits=5, 
    shuffle=True,
    random_state=SEED
    )

pred_train = pd.DataFrame()
pred_test = pd.DataFrame()

cv_result_stck = {}

for i, (model_name, model) in enumerate(models.items()):

    print(i, model)

    # パイプライン全体の設定
    step = [
        ("columns_transformers", columns_transformers),
        ('model', model)
     ]

    # パイプラインの作成
    pipe = Pipeline(
        step
    )

    each_model_train = pd.DataFrame()
    each_model_test = pd.DataFrame()

    each_model_result = []

    for j, (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
        X_train_skf, X_test_skf = X_train.drop("Global_Sales", axis=1).iloc[train_index], X_train.drop("Global_Sales", axis=1).iloc[test_index]
        y_train_skf, y_test_skf = y_train.iloc[train_index], y_train.iloc[test_index]

        # 学習、推論
        pipe.fit(X_train_skf, y_train_skf)

        pickle.dump(pipe["model"], open(f"model_{model_name}_{j}.pkl", 'wb'))

        y_pred = pipe.predict(X_test_skf)

        tmp_train = pd.DataFrame(
                [y_pred.flatten()], # 平坦化
                columns=test_index
            )

        score = f1_score(y_test_skf, y_pred, average="macro")
        each_model_result.append(score)
        each_model_train = pd.concat([each_model_train , tmp_train.T]) # 各KFold ごとの予測結果をDataFrameに縦に並べる

    for j, (train_index, test_index) in enumerate(skf.split(X_test, [0]*len(X_test))):
        X_train_skf, X_test_skf = X_test.drop("Global_Sales", axis=1).iloc[train_index], X_test.drop("Global_Sales", axis=1).iloc[test_index]

        load_model = pickle.load(open(f"model_{model_name}_{j}.pkl", 'rb'))

        # パイプライン全体の設定
        load_step = [
            ("columns_transformers", columns_transformers),
            ('model', load_model)
        ]

        # パイプラインの作成
        load_pipe = Pipeline(
            load_step
        )

        pred = load_pipe.predict(X_test_skf)

        tmp_test = pd.DataFrame(
                [pred.flatten()], # 平坦化
                columns=test_index
            )

        each_model_test = pd.concat([each_model_test , tmp_test.T]) # 各KFold ごとの予測結果をDataFrameに縦に並べる
 
    cv_result_stck[model_name] = each_model_result # 各モデルのf1スコアを集計

    # 学習用の予測データ作成
    each_model_train.columns = [model_name] # カラム名をモデル名に変更
    pred_train = pd.concat([pred_train, each_model_train.sort_index()], axis=1) # 予測結果集計用DataFrameに各モデルの予測結果をくっつける

    # テスト用の予測データ作成
    each_model_test.columns = [model_name] # カラム名をモデル名に変更
    pred_test = pd.concat([pred_test, each_model_test.sort_index()], axis=1) # 予測結果集計用DataFrameに各モデルの予測結果をくっつける

ベストモデルを選定します。

best_model_score = 0
best_model_name = ""

for model_name, f1score in cv_result_stck.items():
    print(model_name, np.mean(f1score))

    if best_model_score < np.mean(f1score):
        best_model_score = np.mean(f1score)
        best_model_name = model_name

print()
print("Best Model is ", best_model_name, "! Best Score is ", best_model_score, "!")

logistic 0.09035313505672425
kneighbor 0.09291644081877312
mlp 0.11015595330137307
svc 0.09036116103514505
adaboost 0.4292619012974509
random_forest 0.49770355314340337
gradient 0.9391359225535506
catboost 0.5109665774376353
xgboost 0.5189826613966799
lightgbm 0.15411815818209879

Best Model is gradient ! Best Score is 0.9391359225535506 !

2階層目の処理を行います。

# 学習・推論
skf = StratifiedKFold(
    n_splits=5, 
    shuffle=True,
    random_state=SEED
    )

cv_result_stck = []
cv_rmse_stck = []

step = [
    ("numeric_transformers", numeric_transformer), # 全て数値カラムなので、数値カラムのTransformerに変更
    ('model', models[best_model_name]) # 最もスコアの良かったモデルを使用
]

# パイプラインの作成
pipe = Pipeline(
    step
)

for train_index, test_index in skf.split(pred_train, y_train):

    X_train_skf, X_test_skf = pred_train.iloc[train_index], pred_train.iloc[test_index]
    y_train_skf, y_test_skf = y_train.iloc[train_index], y_train.iloc[test_index]

    pipe.fit(X_train_skf, y_train_skf)
    y_pred = pipe.predict(X_test_skf)
    print("F1スコア", f1_score(y_test_skf, y_pred, average="macro"))

    pred = pipe.predict(pred_test)
    score = f1_score(y_test, pred, average="macro")
    cv_result_stck.append(score)

    rmse = mean_squared_error(
            X_test["Global_Sales"], 
            [[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, df["Global_Sales"].max()][x] for x in pred],
            squared=False
            )
    cv_rmse_stck.append(rmse)

オーバーサンプリングほどではないですが、精度を上げることができました。

F1スコア: 0.5782900285036405
RMSE: 0.39725272490719277

スコアまとめ

オーバーサンプリングの結果がかなり良くなりました。

項目	指標	数値
通常	F1Score RMSE	0.159 1.255
オーバーサンプリング	F1Score RMSE	0.999 0.338
Stacking	F1Score RMSE	0.578 0.397

また今回のような順序のあるラベルに効果のある手法とのことで、label distribution learning も試してみましたが、実装がうまくいかないせいかあまり効果が出ず、割愛しました。

アドベントカレンダーは以上になります、最後までお読みいただきありがとうございました。

ビン分割

オーバーサンプリング

Stacking

スコアまとめ

Discussion