MolSnapper deploying
I've been interested in molecule generation AI, so I decided to implement and test a recent paper.
This time, it's this report: MolSnapper
The GitHub repository is here: MolSnapper(github)
In this article, I will walk through the process of running the MolSnapper repository locally for the protein-ligand complex with PDB ID: 4AG8, including command and script examples for the following steps:
- Data Preparation
- Preprocessing (Cleaning & Pocket Extraction)
- Molecule Sampling
- Evaluation Scoring
- Extracting Top Candidates & Merging SDFs (by Similarity / QED)
1. Environment Setup
# Clone the MolSnapper repository
mkdir /usr/local/apps
cd /usr/local/apps
git clone [https://github.com/oxpig/MolSnapper.git](https://github.com/oxpig/MolSnapper.git)
cd MolSnapper
# Create Conda environment
conda env create -f env.yml
conda activate MolSnapper
Required modules:
- Python 3.9.18
- PyTorch 2.0.1 + CUDA 11.7
- PyTorch Geometric 2.3.1
- RDKit 2022.03.5
- Biopython 1.83
- pdb-tools ≥2.0, etc.
2. Raw Data Preparation
# Create a working directory
mkdir -p ~/molsnapper/4AG8_test
cd ~/molsnapper/4AG8_test
# Download PDB file (4AG8)
wget -O 4AG8.pdb [https://files.rcsb.org/download/4AG8.pdb](https://files.rcsb.org/download/4AG8.pdb)
# Check the three-letter code for the ligand (e.g., LIG)
grep '^HETATM' 4AG8.pdb | awk '{print $4}' | sort | uniq
# Download the ligand model SDF
mkdir -p ligands
wget -O ligands/4AG8_0.sdf [https://files.rcsb.org/ligands/download/LIG_model.sdf](https://files.rcsb.org/ligands/download/LIG_model.sdf)
Example directory structure:
~/molsnapper/4AG8_test/
├── 4AG8.pdb
└── ligands/
└── 4AG8_0.sdf
3. Preprocessing (Cleaning & Splitting -> Pocket Extraction)
3.1 Cleaning & Splitting ATOM/HETATM
python scripts/clean_and_split.py \
--in-dir ~/molsnapper/4AG8_test \
--proteins-dir ~/molsnapper/4AG8_test/proteins \
--ligands-dir ~/molsnapper/4AG8_test/ligands
Example output:
proteins/4AG8_protein.pdb
ligands/4AG8_0.sdf
3.2 Generate Pocket .pkl File
python scripts/prepare_single_complex.py \
--root_dir ~/molsnapper/4AG8_test \
--ligand_filename ligands/4AG8_0.sdf \
--protein_filename proteins/4AG8_protein.pdb \
--out_pockets_path ~/molsnapper/4AG8_test/processed_pocket_4AG8.pkl
4. Molecule Sampling
mkdir -p ~/molsnapper/4AG8_test/outputs
python scripts/sample_single_pocket.py \
--outdir ~/molsnapper/4AG8_test/outputs \
--config configs/sample/sample_MolDiff.yml \
--device cuda:0 \
--batch_size 0 \
--pocket_path ~/molsnapper/4AG8_test/processed_pocket_4AG8.pkl \
--sdf_path ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf \
--use_pharma False \
--clash_rate 0.1
The generated results are output into a subdirectory as
0.sdf, 1.sdf, ..., *_shifted.sdf
5. Evaluation Scoring
mkdir -p ~/molsnapper/4AG8_test/outputs/eval
python scripts/evaluate.py \
~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF \
--protein_path ~/molsnapper/4AG8_test/4AG8.pdb \
--reflig_path ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf \
--save_path ~/molsnapper/4AG8_test/outputs/eval
Output:
~/molsnapper/4AG8_test/outputs/eval/eval_all.pt
→ Convert this to results.csv
with the following commands.
cd ~/molsnapper/4AG8_test/outputs/eval
python - << 'EOF'
import torch, pandas as pd
data = torch.load('eval_all.pt')
df = pd.DataFrame(data)
chem = pd.json_normalize(df['chem_results'])
df2 = pd.concat([df.drop(columns=['chem_results']), chem], axis=1)
df2['sample'] = df2.index
df2.to_csv('results.csv', index=False)
print(df2.head())
EOF
6. Extracting Top Candidates & Merging SDFs
6.1 Top 5 by Similarity
# Get IDs
cd ~/molsnapper/4AG8_test/outputs/eval
ids=$(python - << 'EOF'
import pandas as pd
df = pd.read_csv('results.csv')
print(" ".join(df.sort_values('similarity', ascending=False).head(5)['sample'].astype(str)))
EOF
)
# Merge
merged=~/molsnapper/4AG8_test/outputs/top5_sim_shifted.sdf
> "$merged"
for id in $ids; do
cat ~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF/${id}_shifted.sdf \
>> "$merged"
echo '$$$$' >> "$merged"
done
# Verify
grep -c '^\$\$\$\$$' "$merged" # => 5
6.2 Top 5 by QED
# Get IDs
cd ~/molsnapper/4AG8_test/outputs/eval
ids=$(python - << 'EOF'
import pandas as pd
df = pd.read_csv('results.csv')
print(" ".join(df.sort_values('qed', ascending=False).head(5)['sample'].astype(str)))
EOF
)
# Merge
merged=~/molsnapper/4AG8_test/outputs/top5_qed_shifted.sdf
> "$merged"
for id in $ids; do
cat ~/molsnapper/4AG8_test/outputs/sample_MolDiff_*_SDF/${id}_shifted.sdf \
>> "$merged"
echo '$$$$' >> "$merged"
done
# Verify
grep -c '^\$\$\$\$$' "$merged" # => 5
7. Visualization in PyMOL
Let's load everything into PyMOL and take a look at the generated structures.
# Receptor
load ~/molsnapper/4AG8_test/4AG8.pdb, receptor
# Input structure
load ~/molsnapper/4AG8_test/ligands/4AG8_0.sdf, ligand
# Top 5 by Similarity
load ~/molsnapper/4AG8_test/outputs/top5_sim_shifted.sdf, sim_candidates
# Top 5 by QED
load ~/molsnapper/4AG8_test/outputs/top5_qed_shifted.sdf, qed_candidates
I realized partway through that this methodology seems to be about transforming an input molecular structure within the confines of the pocket, so it wasn't exactly de novo generation.
It looks like pocket2mol
and DiffSBDD
can generate molecular structures without a small molecule input, so I'd like to try those next time.
Discussion