🦡

不均衡データ分類とundersampling + bagging、正例:負例の比率による調整

2022/12/05に公開

概要

機械学習でクラス分類を行う際にクラス間のサンプル数に偏りがあるデータは、不均衡データ（imbalanced data）と呼ばれているが、これを扱う手法としてundersampling + baggingがあり、以下で紹介されている。

本稿では、undersampling + baggingを試し、undersamplingの際に正例:負例の比率を操作して検知性能の調整を試してみた。

undersampling + bagging

undersampling + baggingは以下の論文がオリジナルである。

B. C. Wallace, K. Small, C. E. Brodley and T. A. Trikalinos, "Class Imbalance, Redux," 2011 IEEE 11th International Conference on Data Mining, 2011, pp. 754-763, doi: 10.1109/ICDM.2011.33.

https://doi.org/10.1109/ICDM.2011.33

詳細については、上記を読んでいただくとして、簡単な説明としては、上記論文の著者の一人のC. E. Brodleyが2015年に講義した際の資料がわかりやすい。

※ ppt注意

以下ではこの資料の図に基づいて説明する。（全ての図の引用元は上記の講義資料である）

不均衡データ分類の問題点

以下のような不均衡データの標本を仮定する。

真の境界は ${w^*}$ で示したものであるが、限られた標本からは偏った境界 $\hat{W}$ が得られてしまう。

一次元で表現すると以下のようになる。

標本の数が少なく、Pで表した分布の一部しか観察できていない（もっと右側のxが必要）ため、偏った境界 $\hat{W}$ が得られてしまう。

オーバーサンプリングでなくアンダーサンプリングする理由

上記のminority classに対して、オーバーサンプリング（SMOTE; Synthetic Minority Oversamping Technique）を実施したとしよう。

SMOTEは、観測された標本同士の内挿により、合成された標本を生成する手法である。

次の図で観測された標本同士で内挿した場合、

以下のように、２つのx間のデータが増えるだけで、偏った境界 $\hat{W}$ を修正できない。

バギングが有効な理由

majority classのアンダーサンプリングを繰り返して多くの境界wを候補とすることで、真の境界 ${w^*}$ と同等の分類を意図したものがバギングである。

以下のように、アンダーサンプリングを繰り返していく。

以下のように繰り返して得られた境界は偏りを少なくしている。

以上が、オーバーサンプリングせずにアンダーサンプリングし、かつバギングと組み合わせること（undersampling + bagging）が有効とされている理由である。

例

ここでは、Classification on imbalanced data | TensorFlow Coreに出てくるKaggleの Credit Card Fraud Detectionを使って、undersampling + baggingを試してみた。検知性能の算出などもClassification on imbalanced data | TensorFlow Coreの方法に準じたものにした。

notebook全体は以下に公開したので参照いただきたい。

元の記事では、random seedが固定されておらず、実行するたびに異なるデータ分割が行われていたため、以下の部分で設定するよう変更した。

# Use a utility from sklearn to split and shuffle your dataset.
train_df, test_df = train_test_split(cleaned_df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)

尚、random seedを固定した状態で元の記事のnotebook（kerasによるシンプルなNN）を実行したものが以下である。

ここでは、上記のデータ分割に加え、keras用に以下のコードを追加している。

import random
import numpy as np
import tensorflow as tf

seed = 42
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)

アンダーサンプリングなし

では、最初にアンダーサンプリングせずにバギングのみのランダムフォレストの結果を見てみよう。

コードとしては以下である。random_stateの指定、n_jobsの指定（コアを全て使う）以外のパラメータはデフォルトである。

from sklearn.ensemble import RandomForestClassifier

neg, pos = np.bincount(train_labels)
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
train_f_count = neg
train_t_count = pos

classifier = RandomForestClassifier(
    random_state = 42,
    n_jobs = -1
)

classifier.fit(train_df, train_labels)

今回のデータだとMinority（正例）が330件に対し、Majority（負例）が182,276件で、Minority:Majority（正例:負例）は、1:552であった。

Confusion Matrixは以下である。不均衡データでよくあるのは、Trueの予測が全く無くTPが全く無い結果だったりするが、そのような状況ではないようだ。FPが少なく、FNが多い点に注意。

正例:負例を1:1にする

次に、Minority:Majority（正例:負例）を1:1で試してみる。この例だとMajority（負例）が182,276件あるので、これをMinority（正例）の数330と同数までアンダーサンプリングする。

from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import RandomForestClassifier

neg, pos = np.bincount(train_labels)
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
train_f_count = neg
train_t_count = pos

under_sampling_rate = 1
sampler = RandomUnderSampler(
    sampling_strategy = {0 : int(train_t_count * under_sampling_rate), 1 : train_t_count}, 
    random_state = 42
)

classifier = RandomForestClassifier(
    random_state = 42,
    n_jobs = -1
)

train_res_df, train_res_labels = sampler.fit_resample(train_df, train_labels)
classifier.fit(train_res_df, train_res_labels)

Confusion Matrixは以下である。FPが非常に多く、FNが比較的少ない結果になった。

ここでの状況は、Undersampling + baggingで不均衡データに対処した際の予測確率のバイアスを補正して、その結果を可視化してみる - 渋谷駅前で働くデータサイエンティストのブログでTJOさんの仰っている

「undersampling + baggingで不均衡データを補正するとfalse positiveは物凄く多くなる」

と同様のケースと思われる。

正例:負例の比率によるグリッドサーチ

次に、Minority:Majority（正例:負例）の比率を1:1〜1:10まで調整した結果を見てみよう。

f1, precision, recallをそれぞれプロットしているが、

f1, recall重視であればratio=1
precision重視であればratio=7

ということになりそうである。

また、全体的な傾向としては、

Majority（負例）が増えるとFNが増え、FPが減る
Minority（正例）が増えるとFNが減り、FPが増える

ということのようである。

正例:負例を1:7にする

というわけで、Minority:Majority（正例:負例）=1:7を試してみよう。

from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import RandomForestClassifier

neg, pos = np.bincount(train_labels)
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))
train_f_count = neg
train_t_count = pos

under_sampling_rate = 7
sampler = RandomUnderSampler(
    sampling_strategy = {0 : int(train_t_count * under_sampling_rate), 1 : train_t_count}, 
    random_state = 42
)

classifier = RandomForestClassifier(
    random_state = 42,
    n_jobs = -1
)

train_res_df, train_res_labels = sampler.fit_resample(train_df, train_labels)
classifier.fit(train_res_df, train_res_labels)

1:1と比較すると、FPの数がだいぶ少なくなった。

この性能が妥当かどうかは、実務においてはビジネス要件（FNが絶対に許容されないのか、FPが多いことが問題になるか）如何と思われるが、正例:負例の比率を操作することで、precision, recallを調整できそうなことは分かった。

学習データ・テストデータのROC曲線を見ると全体的に過学習なので、ここから別途max_features, min_samples_split, min_samples_leaf, max_depthといったパラメータで調整することになる。

reference

不均衡データをundersampling + baggingで補正すると汎化性能も確保できて良さそう - 渋谷駅前で働くデータサイエンティストのブログ
Undersampling + baggingで不均衡データに対処した際の予測確率のバイアスを補正して、その結果を可視化してみる - 渋谷駅前で働くデータサイエンティストのブログ
B. C. Wallace, K. Small, C. E. Brodley and T. A. Trikalinos, "Class Imbalance, Redux," 2011 IEEE 11th International Conference on Data Mining, 2011, pp. 754-763, doi: 10.1109/ICDM.2011.33.
- https://doi.org/10.1109/ICDM.2011.33
C. E. Brodleyの講義ページ
- https://course.ccs.neu.edu/cs6140sp15/schedulen_mod_Spring2015.html
- その資料
  - https://course.ccs.neu.edu/cs6140sp15/4_boosting/slides/wallace_imbalance_icdm_11_for_class_2012_final.pptx

以下は、undersampling + baggingについては取り上げていないが、不均衡データについて比較的簡潔に書かれたもので、undersampling + bagging以外の手法を検討する際にもしかしたら役に立つかもしれない。

大崎美穂, 私のブックマーク：不均衡データ分類, 人工知能, 2022, 37 巻, 3 号, p. 376-381, 公開日 2022/05/01, Online ISSN 2435-8614, Print ISSN 2188-2266, https://doi.org/10.11517/jjsai.37.3_376, https://www.jstage.jst.go.jp/article/jjsai/37/3/37_376/_article/-char/ja
藤原幸一：スモールデータ解析と機械学習，pp. 296，オーム社（2022）

以下ランダムフォレスト関連

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
- Leo Breimanらによる解説
- Wikipediaからもこちらに繋がっていなかったようなので記載
- Random Forests(tm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software. Our trademarks also include RF(tm), RandomForests(tm), RandomForest(tm) and Random Forest(tm). だそう
https://mlu-explain.github.io/random-forest/
- AmazonのMachine Learning Universityから派生したコンテンツ
- 個人的には直感的で理解しやすかった

GitHubで編集を提案

概要