MultiGeo-DTA: Multi-modal Geometric Deep Learning Enables Drug-Target Affinity Prediction with Robustness and Generalization

MultiGeoDTA is a multimodal neural network that integrates protein pocket 3D structure, drug 3D structure, protein full sequence, protein pocket sequence, and drug SMILES sequence information to predict drug-target affinity.

1. Project layout

MultiGeoDTA/
├── assets/                              # Figures (e.g. MultiGeoDTA.png, benchmark plots)
├── docs/                                # Detailed documentation (MultiGeoDTA.md)
├── src/multigeodta/                     # Installable package
│   ├── cli.py                           # Unified CLI (train / evaluate / screen)
│   ├── config/                          # YAML loading
│   ├── data/                            # Datasets, featurizers, task registry
│   ├── models/                          # DTA model & GVP
│   ├── training/                        # Trainer & experiment loop
│   ├── inference/                       # Virtual screening
│   ├── metrics/                         # Regression metrics
│   └── utils/
├── configs/
│   ├── base.yaml
│   └── tasks/                           # Per-benchmark YAML
├── scripts/
│   ├── install.sh                       # Environment setup
│   ├── download_datasets_and_model_weights.sh  # Hugging Face data & checkpoints
│   ├── run_predict_pipeline.sh          # End-to-end affinity prediction
│   ├── build_zinc_vs_dataset.sh         # ZINC virtual-screening prep
│   ├── smoke_install_and_train.sh       # Smoke test
│   ├── upload_to_hf.sh                  # Upload artifacts to Hugging Face
│   ├── predict_affinity_from_sequence/  # New-target prediction pipeline
│   │   ├── predict_structure.py         # ESMFold2 + DoGSite3 pocket detection
│   │   ├── structure_pipeline.py        # Structure generation helpers
│   │   └── predict_affinity.py          # MultiGeo-DTA ensemble inference
│   ├── dataset/                         # Raw PDBBind preprocessing
│   ├── virtual_screening/
│   │   ├── docking/                     # Molecular docking demo
│   │   └── filter_vs_results.ipynb
│   └── lib/
│       └── install_deps.sh              # Dependency installer (used by install.sh)
├── requirements/                        # base.txt, cuda118.txt
├── data/                                # Downloaded datasets (gitignored)
├── outputs/                             # Checkpoints, logs & user predictions (gitignored)
│   └── user_request_results/            # Timestamped outputs from run_predict_pipeline.sh
├── pyproject.toml
└── environment.yml

2. Quick install

Recommended — one script installs PyTorch 2.1 + cu118, DGL, PyG, and mamba wheels in the correct order:

cd MultiGeoDTA
bash scripts/install.sh
conda activate multigeodta

Other CUDA / PyTorch versions: pick matching wheels from:

3. Download preprocessed data and trained model weights

export HF_ENDPOINT=https://hf-mirror.com
bash scripts/download_datasets_and_model_weights.sh
export MULTIGEODTA_DATA_DIR=/path/to/MultiGeoDTA/data

4. End-to-end drug-target affinity prediction for new data (Use trained model weights on PDBBind v2020 dataset for prediction)

Step 1: Create a new virtual environment for ESMFold2

conda create -n esmfold2 python=3.12 -y
conda activate esmfold2
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install "esm@git+https://github.com/Biohub/esm.git@main"
export HF_ENDPOINT=https://hf-mirror.com   # download ESMFold2 model weights

Step2: Run end-to-end prediction

bash scripts/run_predict_pipeline.sh -p "GENFMDIECFMVLNPSQQLAIAVLSLTLGTFTVLENLLVLCVILHSRSLRCRPSYHFIGSLAVADLLGSVIFVYSFIDFHVFHRKDSRNVFLFKLGGVTASFTASVGSLFLAAIDRYISIHRPLAYKRIVTRPKAVVAFCLMWTIAIVIAVLPLLGWNCEKLQSVCSDIFPHIDKTYLMFWIGVVSVLLLFIVYAYMYILWKAHSHAVAKALIVYGSTTGNTEYTAETIARELADAGYEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIELQDDFIPLFDSLEETGAQGRKVACFGCGDSSWEYFCGAVDAIEEKLKNLGAEIVQDGLRIDGDPRAARDDIVGWAHDVRGAIPDQARMDIELAKTLVLILVVLIICWGPLLAIMVYDVFGKMNKLIKTVFAFCSMLCLLNSTVNPIIYALRSKDLRHAFRSMFPS" -s "Cc1nn(CCCNC(=O)c2ccco2)c(C)c1Br"

Note: -p: full protein sequence; -s: SMILES sequence.

5. Gold Standard Benchmark Results

PDBBind v2016

PDBBind v2020

6. Other Usage

All commands use the unified CLI (python -m multigeodta). Each benchmark has a YAML under configs/tasks/; override hyperparameters via CLI flags when needed.

PDBBind v2016

# Train
python -m multigeodta train --config configs/tasks/pdbbind_v2016.yaml

# Evaluate
python -m multigeodta evaluate --config configs/tasks/pdbbind_v2016.yaml \
  --model_file pdbbind_v2016 --output_dir outputs/pdbbind_v2016

PDBBind v2020

# Train
python -m multigeodta train --config configs/tasks/pdbbind_v2020.yaml

# Evaluate
python -m multigeodta evaluate --config configs/tasks/pdbbind_v2020.yaml \
  --model_file pdbbind_v2020 --output_dir outputs/pdbbind_v2020

PDBBind v2021 (similarity split)

Similarity-based splits with ligand/protein novelty settings (new_compound, new_protein, new_new) and Tanimoto thresholds (0.3–0.6). Default config uses new_new at threshold 0.5.

# Train (default: new_new / 0.5)
python -m multigeodta train --config configs/tasks/pdbbind_v2021_similarity.yaml

# Train another split
python -m multigeodta train --task pdbbind_v2021_similarity \
  --split_method new_protein --thre 0.4 \
  --output_dir outputs/pdbbind_v2021_similarity/new_protein/0.4

# Evaluate
python -m multigeodta evaluate --config configs/tasks/pdbbind_v2021_similarity.yaml \
  --model_file pdbbind_v2021_similarity/new_new/0.5 \
  --output_dir outputs/pdbbind_v2021_similarity/new_new/0.5

PDBBind v2021 (time split)

# Train
python -m multigeodta train --config configs/tasks/pdbbind_v2021_time.yaml

# Evaluate
python -m multigeodta evaluate --config configs/tasks/pdbbind_v2021_time.yaml \
  --model_file pdbbind_v2021_time --output_dir outputs/pdbbind_v2021_time

LP-PDBBind

# Train
python -m multigeodta train --config configs/tasks/lp_pdbbind.yaml

# Evaluate
python -m multigeodta evaluate --config configs/tasks/lp_pdbbind.yaml \
  --model_file lp_pdbbind --output_dir outputs/lp_pdbbind

PDBBind v2016 robustness benchmark

Robustness CSVs live under data/pdbbind_v2016/pdbbind_v2016_robustness_test/.
Only the training set is perturbed; validation and test use the standard PDBBind v2016 splits.

Type	Variants	Description
Missing samples	`missing_0.2` … `missing_0.8`	Randomly drop 20%–80% of training rows
Label noise	`noised_scale_0.2` … `noised_scale_1.0`	Add Gaussian noise: `label + scale × N(0,1)`

See [data/pdbbind_v2016/pdbbind_v2016_robustness_test/README.md](data/pdbbind_v2016/pdbbind_v2016_robustness_test/README.md) for file details and regeneration scripts.

# Train one variant (YAML per variant under configs/tasks/pdbbind_v2016_robustness/)
python -m multigeodta train \
  --config configs/tasks/pdbbind_v2016_robustness/noised_scale_0.4.yaml

# Or
python -m multigeodta train --task pdbbind_v2016_robustness \
  --variant missing_0.6 \
  --output_dir outputs/pdbbind_v2016_robustness/missing_0.6

# Evaluate
python -m multigeodta evaluate \
  --config configs/tasks/pdbbind_v2016_robustness/noised_scale_0.4.yaml \
  --model_file pdbbind_v2016_robustness/noised_scale_0.4 \
  --output_dir outputs/pdbbind_v2016_robustness/noised_scale_0.4

# Run all 9 variants
for cfg in configs/tasks/pdbbind_v2016_robustness/*.yaml; do
  python -m multigeodta train --config "$cfg"
done

Virtual Screening

Screen ZINC compounds against a target (protein sequence + pocket positions). Default config uses a CB1R example and checkpoints trained on PDBBind v2020.

# Screen
python -m multigeodta screen --config configs/tasks/zinc_vs.yaml \
  --model_file pdbbind_v2020 --output_dir outputs/zinc --device 0

See [data/zinc/README.md](data/zinc/README.md) for the full ZINC download, preprocessing, and new-target workflow. One-command build: bash scripts/build_zinc_vs_dataset.sh.

Environment variables

Variable	Default	Description
`MULTIGEODTA_DATA_DIR`	`./data` or `./create_dataset`	Dataset root
`MULTIGEODTA_OUTPUT_DIR`	`./outputs`	Checkpoints & logs

7. Detailed project documentation

If you need to make improvements based on MultiGeoDTA or want to gain a deeper understanding of the working mechanism of MultiGeoDTA, please read the more detailed project documentation.[docs/MultiGeoDTA.md](docs/MultiGeoDTA.md)

8. Citation

If you use this code, please cite the MultiGeo-DTA paper and contact Yazi Li (yazi_li@tongji.edu.cn) for questions.

9. Contact

GitHub issues or yazi_li@tongji.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiGeo-DTA: Multi-modal Geometric Deep Learning Enables Drug-Target Affinity Prediction with Robustness and Generalization

1. Project layout

2. Quick install

3. Download preprocessed data and trained model weights

4. End-to-end drug-target affinity prediction for new data (Use trained model weights on PDBBind v2020 dataset for prediction)

Step 1: Create a new virtual environment for ESMFold2

Step2: Run end-to-end prediction

Note: -p: full protein sequence; -s: SMILES sequence.

5. Gold Standard Benchmark Results

PDBBind v2016

PDBBind v2020

6. Other Usage

PDBBind v2016

PDBBind v2020

PDBBind v2021 (similarity split)

PDBBind v2021 (time split)

LP-PDBBind

PDBBind v2016 robustness benchmark

Virtual Screening

Environment variables

7. Detailed project documentation

8. Citation

9. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
assets		assets
configs		configs
docs		docs
requirements		requirements
scripts		scripts
src/multigeodta		src/multigeodta
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

MultiGeo-DTA: Multi-modal Geometric Deep Learning Enables Drug-Target Affinity Prediction with Robustness and Generalization

1. Project layout

2. Quick install

3. Download preprocessed data and trained model weights

4. End-to-end drug-target affinity prediction for new data (Use trained model weights on PDBBind v2020 dataset for prediction)

Step 1: Create a new virtual environment for ESMFold2

Step2: Run end-to-end prediction

Note: -p: full protein sequence; -s: SMILES sequence.

5. Gold Standard Benchmark Results

PDBBind v2016

PDBBind v2020

6. Other Usage

PDBBind v2016

PDBBind v2020

PDBBind v2021 (similarity split)

PDBBind v2021 (time split)

LP-PDBBind

PDBBind v2016 robustness benchmark

Virtual Screening

Environment variables

7. Detailed project documentation

8. Citation

9. Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages