This GitHub repository contains the dataset and relevant code for the paper:
SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction
dataset is available at huggingface dataset repo
Please download and unzip all files
it should contains:
The dataset used to train gnn and cnn-3d model.
The dataset used to train Uni-Mol or ProFSA model for the SIU 0.6 version. It contains train.lmdb, valid.lmdb and test.lmdb.
The dataset used to train Uni-Mol or ProFSA model for the SIU 0.9 version. It contains train.lmdb, valid.lmdb and test.lmdb.
Pretrained weights for Uni-Mol and ProFSA. weights for pretrained ProFSA model weights for pretrained Uni-Mol molecular Encoder weights for pretrained Uni-Mol pocketr Encoder
Complete dataset file in pickle format
contains all structure files for protein and docked small molecules.
each key is an UniProt ID
corresponding value is a list of dictionaries. Each dictionary is a data point and has following keys:
Key | Description |
atoms | atom types in ligand |
coordinates | list of different conformations of the ligand |
pocket_atoms | atom types in pocket |
pocket_coordinates | atom positions of the pocket |
source_data | UniProt ID and PDB ID information |
label | dictionary for assay types and assay values |
ik | InChI key of the ligand |
smi | SMILES notation of the ligand |
All training and testing data are in lmdb format, and have the same keys as shown above.
Note that for single task learning, the label is a float value instead of a dictionary.
In, we provide a script to read from lmdb files and pickle files.
Follow the environment setting in Uni-Mol and Atom3D
Or use siu.yaml
cd ./atom3d/examples/lba
for cnn3d
cd cnn3d
for gnn
cd gnn
start training
Note that the data path in should be changed to atom3d_data/split_60 or atom3d_data/split_90
All the parameters are in We train the model with one NVIDIA A100 GPU.
cd ./unimol_train_code
bash or
use the pretrained weights in
Note that the data_path should in the bash file should be pointed to dir split_60 or split_90 (0.6 version and 0.9 version respectively)
If you want to to Multi Task Learning, please let --num-heads equals to 5, else set it to 1 and point to the correct directory in split_60 or split_90 (ic50, ec50, ki, kd)
All the parameters are in bash scripts. We train the model with 4 NVIDIA A100 GPU.