🌟

Random Forestで予測に効いている項目（特徴量の重要度）を調べる方法

2025/03/12に公開

Random Forest（ランダムフォレスト）では、各特徴量の**「重要度（Feature Importance）」**を計算し、どの項目が予測に最も影響を与えているかを確認できます。

 1. feature_importances_ を使う方法
scikit-learn の RandomForestClassifier または RandomForestRegressor には、学習後に feature_importances_ という属性があり、各特徴量の重要度を数値で取得できます。

 Pythonコード例
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# データの読み込み（乳がんデータセット）
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# データ分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ランダムフォレストの学習
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# 特徴量の重要度を取得
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': rf.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# 結果を表示
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.xlabel("Feature Importance")
plt.ylabel("Feature Name")
plt.title("Feature Importance in Random Forest")
plt.gca().invert_yaxis()  # 上位の特徴が上にくるようにする
plt.show()

 2. SHAP（SHapley Additive exPlanations）を使う方法
feature_importances_ は各特徴の平均的な影響度を示しますが、SHAP を使うと個々の予測に対する各特徴の影響を詳細に分析できます。

 SHAPを使った分析
import shap

# SHAPの計算
explainer = shap.Explainer(rf, X_train)
shap_values = explainer(X_test)

# SHAP Summary Plot
shap.summary_plot(shap_values, X_test)

Summary Plot は、各特徴量がどのように予測に影響を与えているかを可視化できます。

shap_values を使えば、各データポイントに対する特徴量の寄与度も調べられます。

 3. Permutation Importance（入れ替え重要度）
もう一つの方法として、permutation_importance を使うと、各特徴をランダムに入れ替えたときの予測精度の変化を測定できます。
from sklearn.inspection import permutation_importance

# 予測精度への影響を測定
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)

# 結果の可視化
perm_importance = pd.DataFrame({'Feature': X.columns, 'Importance': result.importances_mean})
perm_importance = perm_importance.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(perm_importance['Feature'], perm_importance['Importance'])
plt.xlabel("Permutation Importance")
plt.ylabel("Feature Name")
plt.title("Permutation Feature Importance")
plt.gca().invert_yaxis()
plt.show()

permutation_importance は、特徴量をシャッフルした場合のスコアの変化を測るため、モデルに依存しない評価方法です。

 まとめ


方法
特徴
メリット
デメリット


feature_importances_
分岐時のGini重要度を使う
計算が速い
バイアスがある可能性

SHAP
各特徴の影響を詳細に解析
個々の予測に対する影響が分かる
計算コストが高い

Permutation Importance
予測精度に基づく重要度
モデルに依存しない評価が可能
計算に時間がかかる


 おすすめの使い方
ざっくり確認したいなら feature_importances_
精度を高めるために詳細に分析するなら SHAP
モデルに依存しない評価をしたいなら Permutation Importance
状況に応じて使い分けると良いでしょう！

Random Forestで予測に効いている項目（特徴量の重要度）を調べる方法

1. `feature_importances_` を使う方法

Pythonコード例

2. SHAP（SHapley Additive exPlanations）を使う方法

SHAPを使った分析

3. Permutation Importance（入れ替え重要度）

まとめ

おすすめの使い方

Discussion

方法	特徴	メリット	デメリット
`feature_importances_`	分岐時のGini重要度を使う	計算が速い	バイアスがある可能性
SHAP	各特徴の影響を詳細に解析	個々の予測に対する影響が分かる	計算コストが高い
Permutation Importance	予測精度に基づく重要度	モデルに依存しない評価が可能	計算に時間がかかる