📝

OptunaのLightGBMTunerを使って簡単チューニング

2022/01/20に公開

Quickソースコード

optuna.integration.lightgbm.train だけでLightGBMのパラメーターをチューニングしてくれる。便利〜

import optuna.integration.lightgbm as optuna_lgb

trn_ds = optuna_lgb.Dataset(trn_x, trn_y)
val_ds = optuna_lgb.Dataset(val_x, val_y)

# 固定するparamsは先に指定
params = {
    'objective': 'regression',
    'metric': 'rmse'
}

opt = optuna_lgb.train(
    params, trn_ds, valid_sets=val_ds,
    verbose_eval=False,
    show_progress_bar=False, # プログレスバーの非表示
    num_boost_round=100,
    early_stopping_rounds=50,
)

print(opt.params)

例題

カリフォルニア住宅データセットを使って、回帰問題を解いてみました。データセットに関する説明は以下。

戦略としては、最初にtrain_test_splitを行った上で、test dataの精度を比較しました。評価指標はRMSE。

ソースコード

import pickle
import numpy as np
import pandas as pd
import lightgbm as lgb
import optuna.integration.lightgbm as optuna_lgb

from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, train_test_split
from sklearn.datasets import fetch_california_housing


california_housing = fetch_california_housing()

features = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
labels = pd.Series(california_housing.target)

train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.2)

train_x.reset_index(drop=True, inplace=True)
train_y.reset_index(drop=True, inplace=True)

test_x.reset_index(drop=True, inplace=True)
test_y.reset_index(drop=True, inplace=True)

params = {
    'objective': 'regression',
    'metric': 'rmse'
}

cv = KFold(n_splits=3)
pred_y1 = np.zeros(len(test_y))
pred_y2 = np.zeros(len(test_y))
for trn_idx, val_idx in cv.split(train_x):
    trn_x, trn_y = train_x.iloc[trn_idx, :], train_y[trn_idx]
    val_x, val_y = train_x.iloc[val_idx, :], train_y[val_idx]
    trn_ds = optuna_lgb.Dataset(trn_x, trn_y)
    val_ds = optuna_lgb.Dataset(val_x, val_y)

    opt = optuna_lgb.train(
        params, trn_ds, valid_sets=val_ds,
        verbose_eval=False,
        num_boost_round=100,
        early_stopping_rounds=50,
        show_progress_bar=False,
    )

    # clf1: tuningされたパラメータ
    clf1 = lgb.train(
        opt.params, trn_ds, valid_sets=val_ds,
        verbose_eval=False, num_boost_round=100, early_stopping_rounds=50,
    )

    # clf2: デフォルトパラメーター
    clf2 = lgb.train(
        params, trn_ds, valid_sets=val_ds,
        verbose_eval=False, num_boost_round=100, early_stopping_rounds=50,
    )

    pred_y1 += clf1.predict(test_x) / 3.0
    pred_y2 += clf2.predict(test_x) / 3.0

# Tuningあり
print("Tuning: o", np.sqrt(mean_squared_error(test_y, pred_y1)))

# Tuningなし
print("Tuning: x", np.sqrt(mean_squared_error(test_y, pred_y2)))

結果

Tuning: o 0.4407947838448555
Tuning: x 0.4653107255633059

新規データに対しても、Tuningなしより、Tuningした方が結果がよくなりました。

所感

ハイパラが効いているのか、アンサンブルが効いているのか今ひとつわかりませんが、このくらい簡単ならばとりあえずTuningというのも選択肢として有効だな。。という感想です。

Quickソースコード

例題

所感

Discussion