✨

【Kaggle】 BELKA NeurIPS2024 1st solution explained

2024/07/12に公開

機械学習

Kaggle

tech

I'll explain the 1st solution of NeurIPS2024 competition in this article.

0. Overview

Link:
BELKA NeurIPS2024
1st Solution

The objective of this competition is to predict whether a molecule binds to a specified protein or not.

More detailed description is here.

1. Solution

The solution is a very basic encoder: Self-Attention -> FeedForward with 4 layers and 8 heads per layer and key/value dimension of 32.

Preprocessing: The author used the atomInSmiles tokenizer, but he did it incorrectly, so the tokenization scheme was almost character-based. not using any pre-trained models like ChemBERTa or similar.

Training: Used two-stage pre-training.

MLM - with 15% masking rate
MLM with input as SMILE and output as SMILE
SMILES to ECFP prediction
Training with input as SMILE and output as ECFP.

After this, changes the heads to dense layers.

I've pre-trained transformer on MLM/ECFP tasks (input - SMILES, outputs - SMILES and ECFP) and then just changed the head (dense layer with 3 units and sigmoid activation).

Pretrain can tell the model how to extract the feature of SMILE, and at actually training, the model predicts with utilizing the way that how to extarct effectively.

the list of things that didn't work out as expected

Complex tokenization schemes: bi- and tri-grams, atomInSmiles
Any model with a depth above 32 and more than 6 encoder layers
Multi-input models (SMILES + fingerprints)
Pre-training on a larger dataset—I spent about a month experimenting with ZINC…
Custom loss functions—BinaryFocusLoss was just fine
Gated fusion of building blocks
And many more—I will update the list.

2. Summary

Very smart and effectively solution.
Trasnforemer and MLM is also strong way to handle the complex data.

Reference

[1] 1st place solution