🎉

【Kaggle】 Tabular Playground Series - May 2022コンペ参加記(9th/1151)

2022/11/12に公開

Kaggle

tech

【Kaggle】 Tabular Playground Series - May 2022コンペ参加記(9th/1151)

今年(2022年)の5月に、kaggleのTabular Playground Series- May 2022に参加しましたので、振り返りたいと思います(拙い部分等あるかと思いますが、その際はご指摘いただけますと幸いです)。

スペック

-非ITの社会人5年目エンジニアです(メーカ職)。
-社会人になってから本格的にpythonの勉強を始めました。

Tabular Playground Seriesとは

Tabular Playground Seriesは、kaggle主催で月に1度定期的に開催されている、テーブルデータのみを扱ったコンペティションです。レベルとしては、titanicに慣れた初心者の方が次に挑戦するくらいの感じです。しかしながら、期間が1か月に設定され、さらにそれなりのチーム数が参加する為、メダルは出ないですがtitanicと比較するとかなり本格的なものとなっています。私も脱初心者を目指し、参加を決意しました。

コンペの概要

今回のコンペの概要に以下の記載がありました。

The May edition of the 2022 Tabular Playground series binary classification problem that includes a number of different feature interactions. This competition is an opportunity to explore various methods for identifying and exploiting these feature interactions.

2022年5月のTabular Playgroundシリーズのバイナリ分類問題は、多くの異なる特徴の相互作用を含んでいます。このコンペティションは、これらの特徴的な相互作用を識別し、利用するための様々な方法を探る機会です。

つまり、与えられた特徴量に関してそれぞれの相互作用特徴量を探索し、より効果的な各特徴量の相関を見つける事が重要になるコンペのようです（同時に、そういった相互作用特徴量を見つける力を鍛える為のコンペとも言えます）。

具体的なコンペの内容ですが、製造制御データが与えられ、機械が状態0か状態1かを予測することが目標です。データには、上記に示した様に機械の状態を決定するのに重要と思われる様々な特徴量の相互作用があるようです。

開催期間

2022年5月1日～5月31日の1ヶ月間で行われました。

データ

train.csv
訓練データであるtrain.csvファイルは、90万行のデータが格納されています。欠損値が無い綺麗なデータです。特徴量がf_01～f_30までの31個で、その内、float型:16個、int型:14個、string型:1個です。特徴的なのが1個あるstring型で、f_27のみが10個のアルファベットからなる文字列となっています。目標となるtargetが0もしくは1のバイナリデータで与えられています。targetは0が51%、1が49%と、バランスの取れたデータです。

test.csv
テストデータは70万行のデータからなり、f_00~f_30までの31個の特徴量が与えられています。
submission.csv
推論したtestデータの結果を入力し、提出する為のcsvファイルです。

評価指標

評価指標には、機械学習にて二値分類タスクのモデル性能評価によく使用されるROC曲線が用いられました。

Feature engineering

今回のコンペで重要と思われる特徴量エンジニアリングです。コンペの概要から、特に特徴量間の相関が重要である事が分かっており、多くの人が各特徴量を組合わせてどの組み合わせが効果的かを探索していました。

序盤～中盤までで、以下を行うと良い事がdiscussionやEDA等で共有されました。

上位3つの相互作用特徴量を特徴量として加える
seabornのヒートマップ等を用いて相互特徴量の高かった３つの組合せである、「f_02とf_21」、「f_05とf_22」、「f_00とf_01とf_26」を新たに特徴量に加えます。
f_27の文字列を10個の別々の特徴量に分割し、さらに固有の文字数をカウントしたものを加える
特徴量の中で唯一文字列からなるf_27 は、各文字を1個ずつに区切りそれぞれをunicode変換した数字を新たに特徴量に加えています。また、その文字列に固有の文字がいくつあるか（例えば、'AABAABCABA'の文字列だと'A', 'B', 'C'の3個）を、unique_charactersとして特徴量に加えています。

上記のように特徴量の変更・追加を行い、最終的な特徴量の数は、31→44個となりました。

以下、特徴量を追加する部分のコードです。

features = [f for f in test.columns if f != 'id' and f != 'f_27']
float_features = [f for f in features if test[f].dtype == float]

for df in [train, test]:
    # Extract the 10 letters of f_27 into individual features
    for i in range(10):
        df[f'ch{i}'] = df.f_27.str.get(i).apply(ord) - ord('A')
        
    # unique_characters feature is from https://www.kaggle.com/code/cabaxiom/tps-may-22-eda-lgbm-model
    df["unique_characters"] = df.f_27.apply(lambda s: len(set(s)))
    
    # Feature interactions: create three ternary features
    # Every ternary feature can have the values -1, 0 and +1
    df['i_02_21'] = (df.f_21 + df.f_02 > 5.2).astype(int) - (df.f_21 + df.f_02 < -5.3).astype(int)
    df['i_05_22'] = (df.f_22 + df.f_05 > 5.1).astype(int) - (df.f_22 + df.f_05 < -5.4).astype(int)
    i_00_01_26 = df.f_00 + df.f_01 + df.f_26
    df['i_00_01_26'] = (i_00_01_26 > 5.0).astype(int) - (i_00_01_26 < -5.0).astype(int)
    
features = [f for f in test.columns if f != 'id' and f != 'f_27']
float_features = [f for f in features if test[f].dtype == float]
int_features = [f for f in features if test[f].dtype == int and f.startswith('f')]
ch_features = [f for f in features if f.startswith('ch')]

用いた手法

序盤はxgboostやLGBMといった勾配ブースティングなどを含めた様々なベースモデルを作って試し、どうやらニューラルネットワークを使用するとスコアが高くなるという事がわかったので、以後はネットワーク構造の改良をしていきました。

Model

最終的に使用したニューラルネットワークモデルは、以下の構造です。
層数などが大きくなると過学習するからなのかscoreが伸びず、あまり大きすぎるモデルにはしないようにしました。

以下、Model定義部分のコードです。

def my_model():
    
    operation_lebel_seed=42
    initializer=tf.keras.initializers.GlorotUniform(seed=operation_lebel_seed)
    activation = 'silu'
    inputs = Input(shape=(len(features)))
    x = Dense(64, kernel_regularizer=tf.keras.regularizers.l2(40e-6),
              activation=activation,kernel_initializer=initializer
             )(inputs)
    x = Dense(128, kernel_regularizer=tf.keras.regularizers.l2(40e-6),
              activation='PReLU',kernel_initializer=initializer
             )(x)
    x = Dense(64, kernel_regularizer=tf.keras.regularizers.l2(40e-6),
              activation='ReLU',kernel_initializer=initializer
             )(x)
    x = Dense(32, kernel_regularizer=tf.keras.regularizers.l2(40e-6),
              activation=activation,kernel_initializer=initializer
             )(x)
    x = Dense(16, kernel_regularizer=tf.keras.regularizers.l2(40e-6),
              activation=activation,kernel_initializer=initializer
             )(x)
    x = Dense(1, #kernel_regularizer=tf.keras.regularizers.l2(1e-6),
              activation='sigmoid',
             )(x)
    model = Model(inputs, x)
    return model

Train

エポック数は200に設定しました(これ以上増やしても改善しませんでした)。
学習率は、Cosine Decayを用いて滑らかに減少させています。

以下、train部分のコードです。

EPOCHS = 200
EPOCHS_COSINEDECAY = 150
CYCLES = 1
VERBOSE = 0 # set to 0 for less output, or to 2 for more output
DIAGRAMS = True
USE_PLATEAU = False
BATCH_SIZE = 2048
ONLY_FIRST_FOLD = False
global_seed = 42

np.random.seed(global_seed)
random.seed(global_seed)
tf.random.set_seed(global_seed)

def fit_model(X_tr, y_tr, X_va=None, y_va=None, run=0):
    """Scale the data, fit a model, plot the training history and optionally validate the model
    
    Returns a trained instance of tensorflow.keras.models.Model.
    
    As a side effect, updates y_va_pred, history_list and score_list.
    """
    global y_va_pred
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()
    X_tr = scaler.fit_transform(X_tr)
    
    if X_va is not None:
        X_va = scaler.transform(X_va)
        validation_data = (X_va, y_va)
    else:
        validation_data = None

    # Define the learning rate schedule and EarlyStopping
    lr_start=0.01

    
    epochs = EPOCHS_COSINEDECAY
    lr_end = 0.00005
    def cosine_decay(epoch):
         # w decays from 1 to 0 in every cycle
         # epoch == 0                  -> w = 1 (first epoch of cycle)
         # epoch == epochs_per_cycle-1 -> w = 0 (last epoch of cycle)
        epochs_per_cycle = epochs // CYCLES
        epoch_in_cycle = epoch % epochs_per_cycle
        if epochs_per_cycle > 1:
            w = (1 + math.cos(epoch_in_cycle / (epochs_per_cycle-1) * math.pi)) / 2
        else:
            w = 1
        return w * lr_start + (1 - w) * lr_end

    lr = LearningRateScheduler(cosine_decay, verbose=0)
    callbacks = [lr, tf.keras.callbacks.TerminateOnNaN()]
        
    # Construct and compile the model(モデルの再構築)
    model = my_model()
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr_start),
                  metrics='AUC',
                  loss=tf.keras.losses.BinaryCrossentropy())

    # Train the model
    history = model.fit(X_tr, y_tr, 
                        validation_data=validation_data, 
                        epochs=epochs,
                        verbose=VERBOSE,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        callbacks=callbacks)

    history_list.append(history.history)
    callbacks, es, lr, history = None, None, None, None
    
    if X_va is None:
        print(f"Training loss: {history_list[-1]['loss'][-1]:.4f}")
    else:
        lastloss = f"Training loss: {history_list[-1]['loss'][-1]:.4f} | Val loss: {history_list[-1]['val_loss'][-1]:.4f}"
        
        # Inference for validation
        y_va_pred = model.predict(X_va, batch_size=len(X_va), verbose=0)
        #oof_list[run][val_idx] = y_va_pred
        
        # Evaluation: Execution time, loss and AUC
        score = roc_auc_score(y_va, y_va_pred)
        print(f"Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}"
              f" | {lastloss} | AUC: {score:.5f}")
        score_list.append(score)
        
        if DIAGRAMS and fold == 0 and run == 0:
            # Plot training history
            plot_history(history_list[-1], 
                         title=f"Learning curve (validation AUC = {score:.5f})",
                         plot_lr=True)

            # Plot y_true vs. y_pred
            plt.figure(figsize=(10, 4))
            plt.hist(y_va_pred[y_va == 0], bins=np.linspace(0, 1, 21),
                     alpha=0.5, density=True)
            plt.hist(y_va_pred[y_va == 1], bins=np.linspace(0, 1, 21),
                     alpha=0.5, density=True)
            plt.xlabel('y_pred')
            plt.ylabel('density')
            plt.title('OOF Predictions')
            plt.show()

    return model, scaler


print(f"{len(features)} features")
history_list = []
score_list = []
kf = KFold(n_splits=5)

for fold, (idx_tr, idx_va) in tqdm(enumerate(kf.split(train))):
    X_tr = train.iloc[idx_tr][features]
    X_va = train.iloc[idx_va][features]
    y_tr = train.iloc[idx_tr].target
    y_va = train.iloc[idx_va].target
    
    fit_model(X_tr, y_tr, X_va, y_va)
    if ONLY_FIRST_FOLD: break # we only need the first fold

print(f"OOF AUC:                       {np.mean(score_list):.5f}")

Predict

predictの際には、訓練データ全体で再度学習を行いました。異なるモデル10個でそれぞれ予測させ、その平均値をとったものを最終的な予測値としています。

以下、predict部分のコードです。

X_tr = train[features]
y_tr = train.target

pred_list = []
seeds_list = [0, 5, 10, 15, 20, 25, 30, 42, 50, 52]
for seed in seeds_list:
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    model, scaler = fit_model(X_tr, y_tr, run=seed)
    pred_list.append(scipy.stats.rankdata(model.predict(scaler.transform(test[features]),
                                                        batch_size=len(test), verbose=0)))
    print(f"{seed:2}", pred_list[-1])
print()
submission = test[['id']].copy()
submission['target'] = np.array(pred_list).mean(axis=0)
submission.to_csv('submission.csv', index=False)
submission

コンペ終了時は、public scoreが0.99829で8位でした。

その他取り組んだ事

擬似ラベルを用いて半教師あり学習
終盤になり煮詰まった頃、その前に参加した某クジラコンペで上位陣が使用していた為、使用を試みましたが、今回のコンペではあまり効果がないようでした。
モデルのアンサンブル
中盤～終盤手前まで、公開されているスコアの高いモデルを拝借しアンサンブルしてscoreを微増させていましたが、single modelとほぼ変わらないscoreであることに気づき、終盤はsingle modelで行うことにしました。

結果

private scoreは0.99824で1つ順位を落とし、1151チーム中9位でした。2位からはスコアがほぼ団子状態でした。

振り返り

-上位陣のかなり多くの人がニューラルネットワークを使用していたのではないかと思います。
優勝したチームの解法を見ると、ニューラルネットワークの入力が2つに分岐しているものでした。

このような発想はこのコンペティション参加当時にはなかったものであった為、非常に参考になりました。今後はより柔軟な発想で、ニューラルネットワーク構造を構築していきたいです。
また、今回ほぼ初めて本格的に特徴量の相関を調べたり可視化したりしましたが、こちらも強いkagglerの皆さんのテクニックやコーディングが非常に参考になり、また同時に知的好奇心を刺激されました。

今回のコンペで得られた知見を、今後のkaggleなどに活かして行ければと思います。いずれはより本格的なコンペでも上位陣に食い込めるよう日々奮闘を続けていきます。

補足

全体のコードは、kaggle notebookで公開しています。
kaggle:https://www.kaggle.com/code/deepernet/9th-place-solution-in-tpsmay22

ここまで読んで頂いた皆様、ありがとうございました。

【Kaggle】 Tabular Playground Series - May 2022コンペ参加記(9th/1151)

スペック

Tabular Playground Seriesとは

コンペの概要

開催期間

データ

評価指標

Feature engineering

用いた手法

Model

Train

Predict

その他取り組んだ事

結果

振り返り

補足

Discussion