【Kaggle】 BELKA NeurIPS2024 1st solution explained

2024/07/12に公開

I'll explain the 1st solution of NeurIPS2024 competition in this article.

0. Overview

Link:
BELKA NeurIPS2024
1st Solution

The objective of this competition is to predict whether a molecule binds to a specified protein or not.

More detailed description is here.

1. Solution

The solution is a very basic encoder: Self-Attention -> FeedForward with 4 layers and 8 heads per layer and key/value dimension of 32.

Preprocessing: The author used the atomInSmiles tokenizer, but he did it incorrectly, so the tokenization scheme was almost character-based. not using any pre-trained models like ChemBERTa or similar.

Training: Used two-stage pre-training.

  1. MLM - with 15% masking rate
    MLM with input as SMILE and output as SMILE
  2. SMILES to ECFP prediction
    Training with input as SMILE and output as ECFP.

After this, changes the heads to dense layers.

I've pre-trained transformer on MLM/ECFP tasks (input - SMILES, outputs - SMILES and ECFP) and then just changed the head (dense layer with 3 units and sigmoid activation).

Pretrain can tell the model how to extract the feature of SMILE, and at actually training, the model predicts with utilizing the way that how to extarct effectively.

the list of things that didn't work out as expected
  1. Complex tokenization schemes: bi- and tri-grams, atomInSmiles
  2. Any model with a depth above 32 and more than 6 encoder layers
  3. Multi-input models (SMILES + fingerprints)
  4. Pre-training on a larger dataset—I spent about a month experimenting with ZINC…
  5. Custom loss functions—BinaryFocusLoss was just fine
  6. Gated fusion of building blocks
  7. And many more—I will update the list.

2. Summary

Very smart and effectively solution.
Trasnforemer and MLM is also strong way to handle the complex data.

Reference

[1] 1st place solution

Discussion