🧬

【Kaggle】BELKA NeurIPS 2024 Overview

2024/07/09に公開

機械学習

Kaggle

tech

Plan

XGB →x no input
automl →trying
1DCNN →o 1h44m TPU , input:lightning 2.2.1 downloaded
ChemBERTa →o 3h , input: enc_dataset downloaded with frontier
Mamba →trying
GNN →o 12m , input: none

If you interested in the competition, first, I reccomed you ask to chatAI like "please summarize this (as succinctly as possible): [kaggle's competition description]"

1. Objective

In this competiton, we'll develop machine learning models to predict the "binding affinity" of small molecules to specific protein targets. It helps drug development.

We'll predict which "drug-like small modecules" (chemicals) will bind to three possible protein targets.

2. Description

Small molecule drugs are chemicals that interact with cellular protein machinery and affect the functions of this machinery in some way.

Small molecure drugs are chemicals that affect the protein machinery.
A classic approach to find molecure like that is to physically make molecure them, one by one, and then expose them to the protein taregt of interest and test if the two interact.
This can be a fairly laborious and time-intensive process.

The FDA has approved around 2,000 novel molecular entities, but the druglike chemical space contains an estimated 10^60 molecules, making it impractical to search physically. Leash Biosciences tested 133 million small molecules for interactions with three protein targets using DNA-encoded chemical library (DEL) technology, creating the Big Encoded Library for Chemical Assessment (BELKA) dataset.

This competition aims to revolutionize small molecule binding prediction using machine learning (ML) techniques to search chemical space computationally. Participants will build predictive models to estimate the binding affinity of unknown compounds to protein targets, advancing small molecule chemistry and accelerating drug discovery.

3. Evaluation

This metric for competiton is the average precision calculated for each(protein, split group) and then averaged for the final score.
Please see this forum post for important details.

・Submission File

id,binds
295246830,0.5
295246831,0.5
295246832,0.5
etc.

4. Data

4.1 Overview

The example in the competition dataset are represented by a binary classification of whether a given small molecule is a binder or not to one of three protein targets(three protein targets are same in every time).
Anyone of three is ok. whether a small molecule is a binder or not anyone of three.

Data represented with SMILES and labels as binary binding classifications.

4.2 Files

#####[train/test].[csv/parquet]
The train or test data, available in both the csv and parquet formats. Regardless of format, the contents are same.

columns:

id: A unique example_id used for identify the molecule-binding target pair.
buildingblock1_smiles: The structure, in SMILES, of the first target
buildingblock2_smiles: The structure, in SMILES, of the second target
buildingblock3_smiles: The structure, in SMILES, of the third target
molecule_smiles: The structure of the fully assembled molecule, in SMILES. This includes the three building blocks and the triazine core. Note we use a [Dy] as the stand-in for the DNA linker
protein_name The target column. A binary class label of whether the morecule binds to the protein. Not avaiable for the test set.

sample_submission.csv

A sample submission file in the correct format.

4.3 Competition data

All data were generated in-house at LEash Biosciences. They are providing roughly 98M training examples per protein. 200K validation examples per protein, and 360K test molecules per protein. To test generalizability, the test set contains building blocks that are not in the training set.

These datasets are very imbalanced:
roughly 0.5% of examples are classified as binders; they used 3 rounds of selection in triplicate to identify binders experimentally.

Following the competition, Leash will make all teh data available for future use (3 tragets * 3 rounds of selection * 3 replicates * 133M molecules, or 3.6B measurements).

4.4 Targets

Normally, Proteins are named from discoverers and often have different names by these histories.
They screened three protein targets for this competition.

EPHX2(sEH)

The first target, epoxide hydrolase 2, is encoded by the EPHX2 genetic locus, and its protein product is commonly named “soluble epoxide hydrolase”, or abbreviated to sEH. Hydrolases are enzymes that catalyze certain chemical reactions, and EPHX2/sEH also hydrolyzes certain phosphate groups. EPHX2/sEH is a potential drug target for high blood pressure and diabetes progression, and small molecules inhibiting EPHX2/sEH from earlier DEL efforts made it to clinical trials.

They screened EPHX2/sEH purchased from Cayman Chemical, a life sciences commercial vendor. For those contestants wishing to incorporate protein structural information in their submissions, the amino sequence is positions 2-555 from UniProt entry P34913, the crystal structure can be found in PDB entry 3i28, and predicted structure can be found in AlphaFold2 entry 34913. Additional EPHX2/sEH crystal structures with ligands bound can be found in PDB.

BRD4

The second target, bromodomain 4, is encoded by the BRD4 locus and its protein product is also named BRD4. Bromodomains bind to protein spools in the nucleus that DNA wraps around (called histones) and affect the likelihood that the DNA nearby is going to be transcribed, producing new gene products. Bromodomains play roles in cancer progression and a number of drugs have been discovered to inhibit their activities.

They screened BRD4 purchased from Active Motif, a life sciences commercial vendor. For those contestants wishing to incorporate protein structural information in their submissions, the amino acid sequence is positions 44-460 from UniProt entry O60885-1, the crystal structure (for a single domain) can be found in PDB entry 7USK and predicted structure can be found in AlphaFold2 entry O60885. Additional BRD4 crystal structures with ligands bound can be found in PDB.

ALB(HSA)

The third target, serum albumin, is encoded by the ALB locus and its protein product is also named ALB. The protein product is sometimes abbreviated as HSA, for “human serum albumin”. ALB, the most common protein in the blood, is used to drive osmotic pressure (to bring fluid back from tissues into blood vessels) and to transport many ligands, hormones, fatty acids, and more.

Albumin, being the most abundant protein in the blood, often plays a role in absorbing candidate drugs in the body and sequestering them from their target tissues. Adjusting candidate drugs to bind less to albumin and other blood proteins is a strategy to help these candidate drugs be more effective.

ALB has been screened with DEL approaches previously but the screening data were not published. We included ALB to allow contestants to build models that might have a larger impact on drug discovery across many disease types. The ability to predict ALB binding well would allow drug developers to improve their candidate small molecule therapies much more quickly than physically manufacturing many variants and testing them against ALB empirically in an iterative process.

We screened ALB purchased from Active Motif. For those contestants wishing to incorporate protein structural information in their submissions, the amino acid sequence is positions 25 to 609 from UniProt entry P02768, the crystal structure can be found in PDB entry 1AO6, and predicted structure can be found in AlphaFold2 entry P02768. Additional ALB crystal structures with ligands bound can be found in PDB.

5. Discussions

Discussions

5.1 [LB0.613, low GPU resources!] my experimental results

⭐️⭐️⭐️
It seems like some important experiment results, but I couldn't understand it. I'll back later.

5.2 Data shirnking method

⭐️⭐️

No ID column.
binds columns saved in bytes.
buildingblock1_smiles/buildingblock2_smiles/buildingblock3_smiles columns saved as int16, with encoded index of the building blobks. I saved the building blocks and their indices in separate dictionaries.
I transformed the protein/label columns into three columns of labels per protein, shrinking the dataset length by three. (The other columns have identical values for each three consecutive rows).

Original size: over 50GB.
Shrunken size: ~10GB.

Shrinking notebook.
Shrunken dataset.
Shrunken dataset loading notebook.

5.3 The groupings and permutations of the competition data, in terms of building blocks

⭐️⭐️
see later.

5.4 Background papar and resorces

⭐️
see if have a time.

⭐️⭐️
As the title says.

5.6 Similar comeptition in past.

⭐️⭐️⭐️
past competitons.

5.7 Fast GNN's experiments result

⭐️⭐️⭐️
looks like useful as base code.
lb0.602

5.8 Starter kernel, competition, solutions handbook

⭐️⭐️⭐️
summary of past various method for starter. It's useful

5.9 CV split strategy

⭐️⭐️
should read after understaing how scoring.

5.10 faster fingerprint calculation

⭐️⭐️
This libary provides x10 times faster creating fingerprint(what is this?) than sklearn. It looks so useful.

5.11 The train data has 295,246,830 rows !!

⭐️
some past competition that has huge dataset.

stanford-ribonanza competition had somewhat similar train size.
IceCube wiht ~117GB (NeurIPS: about 70GB)
Ribonanza had around ~1M samples (in terms of computational complexity, since for each sample we got 200-400 labels)

5.12 CV/LB

⭐️⭐️⭐️
They sharing their cv strategy. looks useful.

5.13 3d-Graph-NN solution

⭐️⭐️
I may use as base code.

5.14 useful papers on ML mulecule represetations

⭐️
This looks useful, but I feel it take too much time.

5.15 Common ways to represent molecules in ML

⭐️☆
Looks so great, but I don't have enough time, so I think should utilize to public notebook ways in representation.

5.16 Videos: How DEL works?

⭐️⭐️
Videos how del works. I'll take it if I have a time

5.17 A fast way to tokenize(encoding)

⭐️⭐️⭐️
How to make new dataset in my prefer way

5.16 some new architecture

⭐️⭐️⭐️
Mamba is here! great.

6. Additional info

One of the goals of this competiton si to explore and compare many different ways of representing molecules. Small molucures have been representaed with SMILES, graphs, 3D sutuctures and more, including more esoteric methods such as spherical CNN.
Competition hosts also expect we try different ways of representing the molecules.

The data is provided as SMILES format that sufficient to be translated into any other chemical representation format(like RDKit) you want to try.

7. Codes

codes

1DCNN starter

⭐️⭐️⭐️
As is. 1DCNN starter. This encode charcter to num simply. Like 'l' to 1.

Many embedding methods

⭐️⭐️⭐️⭐️
So useful.

Descriptors(RDKit): represented by many(208) meaningful parameters like weight, LogP(hydrophobicity)
Molecular Fingerprints(MACCS or Morgan): represented by binary vector. bit represents the presence of specific atomic or structural features within the chemical compound. note to bit collision by small nbits(resolution).
Graph(DGL form RDKIt): represeted by graph. atoms as nodes and bonds as edges.
Mol2Vec(pretrained mol2vec model): By pretrained model, mol is transformed to same size vector. It can be handled so many models like 1DCNN, Linear model, svm, FNN, RNN, GBDT, SVM,etc. but note to deterioration of information.
Chemical Language Model Embeddings
There are language models pretrained for Chemical field(chemical language models). And these utilize the tokenizer for training from smiles.
・models:
ChemBERTa: adapted for chemical SMILES from the RoBERTa architecture, trained on a dataset of 77 million molecules
MoLFormer: another transformer-based model adapted for SMILES but trained on a larger dataset (1.1 billion molecules!)
So we can use this tokenaizer for emmbeding. but it perhaps optimized for chemical language models.

Additional_sEH_Data

⭐️⭐️
Additional new core data(the train has only triazin and test has more core data? if test the this addition, If it make result wrose, truncate this.). It looks so important information, but if don't have the much time, I can ignore it. when you use it, you should add the binds(target) label from comment's code.

    # Label compounds as 'binders' or 'non-binders' based on experimental read count values
    data['bind'] = [0 if x == 0 else 1 for x in data['read_count']]

binds=1 data : 1.5M records (all positive data in train data)
binds=0 data : 1.5M records (random picked)

8. Plan

Make models that use each embeddings, and ensemble these.
I think this is the best in here.