GENOVA: Generative Modeling Framework for Highly Bioavailable and Blood Brain Barrier Permeant Drug Design
Hasegawa K., Papadopoulos E., Xie E., Kementzidis G., Chorev M., Aktas B.H., Deng Y.
GENOVA: Generative Modeling Framework for Highly Bioavailable and Blood Brain Barrier Permeant Drug Design
GENOVA is a generative AI framework for the de novo design of highly bioavailable, blood-brain barrier (BBB) permeant drug candidates. As a proof-of-concept, GENOVA is applied to the discovery of novel BACE1 inhibitors as potential therapeutics for Alzheimer's Disease (AD).
GENOVA integrates:
- A SELFIES-based Autoencoder for robust molecular representation
- QSAR Neural Networks for predicting key pharmacological properties
- A Transfer Learning (TL)-based WGAN-GP for de novo molecule generation
- A Genetic Algorithm (GA) for multi-property fitness optimization
- A SAS-QED-PAINS filtering for compound selection
- AutoDock Vina molecular docking for independent validation
Starting from over 2 million generated novel compounds, GENOVA identifies 190 BBB-permeable candidate BACE1 inhibitors that outperform all BACE1 inhibitors that have advanced to human clinical trials in terms of binding affinity and/or specificity.
GENOVA-pytorch/
│
├── Dataset/ # All datasets used in this study
│ ├── 500k_small_molecule # Subset of ChEMBL 33 bioactive small molecules for autoencoder training (500k datapoints, not included in this folder and available per request)
│ ├── 100k_small_molecule # Subset of ChEMBL 33 bioactive small molecules for WGAN-GP pre-training (100k datapoints)
│ ├── BACE1 # BACE1 inhibitors (7,096 datapoints after preprocessing)
│ ├── pIC50 # pIC50 values of BACE1 inhibitors (7,096 datapoints after preprocessing)
│ ├── logBB # FDA-approved CNS drugs with logBB values (1,021 datapoints)
│ ├── Bioavailability # Compound bioavailability values from multiple sources (2,405 datapoints)
│ ├── specificity # BACE1 vs. BACE2 specificity scores calucated based on AutoDock Vina (10,619 datapoints)
│ ├── SAS # Synthetic accessibility scores (100k datapoints)
│ └── BACE1_clinical_trial # 11 BACE1 clinical trial candidate drugs
│
├── Runfiles/ # Run scripts for each task in the pipeline
│ ├── run_AE_selfies.py # Train the SELFIES autoencoder
│ ├── run_AE_smiles.py # Train the smiles autoencoder
│ ├── run_QSAR.py # Train QSAR neural networks
│ ├── run_WGANGP.py # Pre-train WGAN-GP on 100k small molecules / BACE1 dataset
│ ├── run_WGANGP_TL.py # Fine-tune WGAN-GP on BACE1 dataset (Transfer Learning)
│ ├── run_GA.py # Run Genetic Algorithm optimization
│
├── models/ # Model architecture definitions
│ ├── AE.py # Encoder-Decoder with bidirectional LSTM layers
│ ├── QSAR.py # Feed-forward QSAR neural networks (pIC50, logBB, bioavailability, SAS, specificity)
│ ├── WGANGP.py # Wasserstein GAN with Gradient Penalty (generator + critic)
│ ├── WGANGP_TL.py # Wasserstein GAN with Gradient Penalty (generator + critic) + transfer learning
│ └── GA.py # Genetic Algorithm with elitism selection
│
├── Utility.py # Shared utility functions
├── config.py # Configuration file (hyperparameters, file paths, thresholds, etc.)
├── requirements.txt # Python dependencies
└── README.md
- Python 3.8+
- CUDA-compatible GPU (trained and tested on NVIDIA V100)
- AutoDock Vina (for molecular docking)
- RDKit (for cheminformatics utilities)
- Open Babel (for molecular format conversion)
This project is licensed under the MIT License. See LICENSE for details.