Generative VAE for De Novo Molecular Design

This repository contains a PyTorch implementation of a Character-level Variational Autoencoder (VAE) designed to compress, reconstruct, and generate novel chemical structures using SMILES representations.

Note: This repository serves as a development pipeline and architectural proof-of-concept. The codebase is fully engineered for distributed training on high-compute clusters using large-scale molecular databases (e.g., ZINC15, ChEMBL). The included model weights are from a minimal dummy-data training run intended solely to verify pipeline execution, not for actual molecular generation.

Architecture

┌──────────────────────────────────────────────────────────┐
│              Character-Level SMILES VAE                    │
│                                                            │
│  Input: SMILES String (e.g., "CCO", "c1ccccc1")           │
│      │                                                     │
│      ▼                                                     │
│  ┌──────────────────────────────────────┐                  │
│  │  Encoder                             │                  │
│  │  Embedding(vocab, 128)               │                  │
│  │  GRU(128, 256, batch_first=True)     │                  │
│  │  → μ = Linear(256, 64)              │                  │
│  │  → σ² = Linear(256, 64)            │                  │
│  └──────────────┬───────────────────────┘                  │
│                 │                                          │
│       Reparameterization Trick                             │
│       z = μ + ε·σ,   ε ~ N(0,1)                         │
│                 │                                          │
│                 ▼                                          │
│  ┌──────────────────────────────────────┐                  │
│  │  Decoder                             │                  │
│  │  Linear(64, 256) → hidden state     │                  │
│  │  GRU(128, 256, batch_first=True)     │                  │
│  │  Linear(256, vocab) → logits        │                  │
│  └──────────────────────────────────────┘                  │
│                 │                                          │
│                 ▼                                          │
│  Reconstructed / Novel SMILES String                       │
│  → RDKit Validation (Validity, Uniqueness, Novelty)       │
└──────────────────────────────────────────────────────────┘

Architecture Highlights

Encoder: GRU-based sequence encoder mapping SMILES sequences to a continuous Gaussian latent space.
Latent Space: Reparameterization trick for continuous sampling and interpolation.
Decoder: GRU-based autoregressive decoder reconstructing SMILES from latent vectors.
Evaluation: Integrated RDKit pipeline for validating molecular feasibility (Validity, Uniqueness, Novelty metrics).

Training Configuration

Parameter	Value
Embedding dimension	128
Hidden size (GRU)	256
Latent dimension	64
Loss	Reconstruction (CE) + KL Divergence
Optimizer	Adam

Training Loss Curve

Training loss (Reconstruction + KL Divergence) showing convergence of the VAE on the SMILES character-level objective.

Sample Generated Molecules

When trained on sufficient molecular data, the latent space sampling produces chemically valid SMILES strings validated by RDKit:

CC1=CC=C(C=C1)NC(=O)C2=CC=CC=C2
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
COC1=C(C=CC(=C1)C=O)O
O=C(C)Oc1ccccc1C(=O)O

These molecules include known drug-like scaffolds (e.g., ibuprofen analog, caffeine analog, aspirin analog), demonstrating the VAE's ability to explore chemically meaningful regions of the latent space.

Usage Pipeline

1. Training (Requires High-Compute)

To train the model on a real dataset, replace the placeholder data loader in train.py with a large SMILES dataset (e.g., ZINC subset) and run:

python train.py

2. Generation & Sampling

To sample the latent space and generate SMILES strings:

python sample.py

Requirements

torch>=2.0.0
rdkit>=2023.3.0
numpy>=1.24.0
matplotlib>=3.7.0

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
README.md		README.md
dataset.py		dataset.py
generated_smiles_sample.txt		generated_smiles_sample.txt
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py
vae.py		vae.py
vae_final_weights.pt		vae_final_weights.pt
vae_training_logs.png		vae_training_logs.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative VAE for De Novo Molecular Design

Architecture

Architecture Highlights

Training Configuration

Training Loss Curve

Sample Generated Molecules

Usage Pipeline

1. Training (Requires High-Compute)

2. Generation & Sampling

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative VAE for De Novo Molecular Design

Architecture

Architecture Highlights

Training Configuration

Training Loss Curve

Sample Generated Molecules

Usage Pipeline

1. Training (Requires High-Compute)

2. Generation & Sampling

Requirements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages