🫘

MolSnapper deploying

に公開

I've been interested in molecule generation AI, so I decided to implement and test a recent paper.
This time, it's this report: MolSnapper
The GitHub repository is here: MolSnapper(github)

In this article, I will walk through the process of running the MolSnapper repository locally for the protein-ligand complex with PDB ID: 4AG8, including command and script examples for the following steps:

  1. Data Preparation
  2. Preprocessing (Cleaning & Pocket Extraction)
  3. Molecule Sampling
  4. Evaluation Scoring
  5. Extracting Top Candidates & Merging SDFs (by Similarity / QED)

1. Environment Setup

# Clone the MolSnapper repository
mkdir /usr/local/apps
cd /usr/local/apps
git clone [https://github.com/oxpig/MolSnapper.git](https://github.com/oxpig/MolSnapper.git)
cd MolSnapper

# Create Conda environment
conda env create -f env.yml
conda activate MolSnapper

Required modules:

  • Python 3.9.18
  • PyTorch 2.0.1 + CUDA 11.7
  • PyTorch Geometric 2.3.1
  • RDKit 2022.03.5
  • Biopython 1.83
  • pdb-tools ≥2.0, etc.

2. Raw Data Preparation

# Create a working directory
mkdir -p ~/molsnapper/4AG8_test
cd ~/molsnapper/4AG8_test

# Download PDB file (4AG8)
wget -O 4AG8.pdb [https://files.rcsb.org/download/4AG8.pdb](https://files.rcsb.org/download/4AG8.pdb)

# Check the three-letter code for the ligand (e.g., LIG)
grep '^HETATM' 4AG8.pdb | awk '{print $4}' | sort | uniq

# Download the ligand model SDF
mkdir -p ligands
wget -O ligands/4AG8_0.sdf [https://files.rcsb.org/ligands/download/LIG_model.sdf](https://files.rcsb.org/ligands/download/LIG_model.sdf)

Example directory structure:

~/molsnapper/4AG8_test/
├── 4AG8.pdb
└── ligands/
    └── 4AG8_0.sdf

3. Preprocessing (Cleaning & Splitting -> Pocket Extraction)

3.1 Cleaning & Splitting ATOM/HETATM

python scripts/clean_and_split.py \
  --in-dir       ~/molsnapper/4AG8_test \
  --proteins-dir ~/molsnapper/4AG8_test/proteins \
  --ligands-dir  ~/molsnapper/4AG8_test/ligands

Example output:

proteins/4AG8_protein.pdb
ligands/4AG8_0.sdf

3.2 Generate Pocket .pkl File

python scripts/prepare_single_complex.py \
  --root_dir         ~/molsnapper/4AG8_test \
  --ligand_filename  ligands/4AG8_0.sdf \
  --protein_filename proteins/4AG8_protein.pdb \
  --out_pockets_path ~/molsnapper/4AG8_test/processed_pocket_4AG8.pkl

4. Molecule Sampling

mkdir -p ~/molsnapper/4AG8_test/outputs

python scripts/sample_single_pocket.py \
  --outdir       ~/molsnapper/4AG8_test/outputs \
  --config       configs/sample/sample_MolDiff.yml \
  --device       cuda:0 \
  --batch_size   0 \
  --pocket_path  ~/molsnapper/4AG8_test/processed_pocket_4AG8.pkl \
  --sdf_path     ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf \
  --use_pharma   False \
  --clash_rate   0.1

The generated results are output into a subdirectory as

0.sdf, 1.sdf, ..., *_shifted.sdf

5. Evaluation Scoring

mkdir -p ~/molsnapper/4AG8_test/outputs/eval

python scripts/evaluate.py \
  ~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF \
  --protein_path ~/molsnapper/4AG8_test/4AG8.pdb \
  --reflig_path  ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf \
  --save_path    ~/molsnapper/4AG8_test/outputs/eval

Output:

~/molsnapper/4AG8_test/outputs/eval/eval_all.pt

→ Convert this to results.csv with the following commands.

cd ~/molsnapper/4AG8_test/outputs/eval
python - << 'EOF'
import torch, pandas as pd

data = torch.load('eval_all.pt')
df   = pd.DataFrame(data)
chem = pd.json_normalize(df['chem_results'])
df2  = pd.concat([df.drop(columns=['chem_results']), chem], axis=1)
df2['sample'] = df2.index
df2.to_csv('results.csv', index=False)
print(df2.head())
EOF

6. Extracting Top Candidates & Merging SDFs

6.1 Top 5 by Similarity

# Get IDs
cd ~/molsnapper/4AG8_test/outputs/eval
ids=$(python - << 'EOF'
import pandas as pd
df = pd.read_csv('results.csv')
print(" ".join(df.sort_values('similarity', ascending=False).head(5)['sample'].astype(str)))
EOF
)

# Merge
merged=~/molsnapper/4AG8_test/outputs/top5_sim_shifted.sdf
> "$merged"
for id in $ids; do
  cat ~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF/${id}_shifted.sdf \
    >> "$merged"
  echo '$$$$' >> "$merged"
done

# Verify
grep -c '^\$\$\$\$$' "$merged"  # => 5

6.2 Top 5 by QED

# Get IDs
cd ~/molsnapper/4AG8_test/outputs/eval
ids=$(python - << 'EOF'
import pandas as pd
df = pd.read_csv('results.csv')
print(" ".join(df.sort_values('qed', ascending=False).head(5)['sample'].astype(str)))
EOF
)

# Merge
merged=~/molsnapper/4AG8_test/outputs/top5_qed_shifted.sdf
> "$merged"
for id in $ids; do
  cat ~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF/${id}_shifted.sdf \
    >> "$merged"
  echo '$$$$' >> "$merged"
done

# Verify
grep -c '^\$\$\$\$$' "$merged"  # => 5

7. Visualization in PyMOL

Let's load everything into PyMOL and take a look at the generated structures.

pymol
# Receptor
load ~/molsnapper/4AG8_test/4AG8.pdb, receptor

# Input structure
load ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf, ligand


# Top 5 by Similarity
load ~/molsnapper/4AG8_test/outputs/top5_sim_shifted.sdf, sim_candidates

# Top 5 by QED
load ~/molsnapper/4AG8_test/outputs/top5_qed_shifted.sdf, qed_candidates

I realized partway through that this methodology seems to be about transforming an input molecular structure within the confines of the pocket, so it wasn't exactly de novo generation.
It looks like pocket2mol and DiffSBDD can generate molecular structures without a small molecule input, so I'd like to try those next time.

Discussion