Skip to content

baysquire/SMILES-VAE-Generative-Design

Repository files navigation

Generative VAE for De Novo Molecular Design

This repository contains a PyTorch implementation of a Character-level Variational Autoencoder (VAE) designed to compress, reconstruct, and generate novel chemical structures using SMILES representations.

Note: This repository serves as a development pipeline and architectural proof-of-concept. The codebase is fully engineered for distributed training on high-compute clusters using large-scale molecular databases (e.g., ZINC15, ChEMBL). The included model weights are from a minimal dummy-data training run intended solely to verify pipeline execution, not for actual molecular generation.

Architecture

┌──────────────────────────────────────────────────────────┐
│              Character-Level SMILES VAE                    │
│                                                            │
│  Input: SMILES String (e.g., "CCO", "c1ccccc1")           │
│      │                                                     │
│      ▼                                                     │
│  ┌──────────────────────────────────────┐                  │
│  │  Encoder                             │                  │
│  │  Embedding(vocab, 128)               │                  │
│  │  GRU(128, 256, batch_first=True)     │                  │
│  │  → μ = Linear(256, 64)              │                  │
│  │  → σ² = Linear(256, 64)            │                  │
│  └──────────────┬───────────────────────┘                  │
│                 │                                          │
│       Reparameterization Trick                             │
│       z = μ + ε·σ,   ε ~ N(0,1)                         │
│                 │                                          │
│                 ▼                                          │
│  ┌──────────────────────────────────────┐                  │
│  │  Decoder                             │                  │
│  │  Linear(64, 256) → hidden state     │                  │
│  │  GRU(128, 256, batch_first=True)     │                  │
│  │  Linear(256, vocab) → logits        │                  │
│  └──────────────────────────────────────┘                  │
│                 │                                          │
│                 ▼                                          │
│  Reconstructed / Novel SMILES String                       │
│  → RDKit Validation (Validity, Uniqueness, Novelty)       │
└──────────────────────────────────────────────────────────┘

Architecture Highlights

  • Encoder: GRU-based sequence encoder mapping SMILES sequences to a continuous Gaussian latent space.
  • Latent Space: Reparameterization trick for continuous sampling and interpolation.
  • Decoder: GRU-based autoregressive decoder reconstructing SMILES from latent vectors.
  • Evaluation: Integrated RDKit pipeline for validating molecular feasibility (Validity, Uniqueness, Novelty metrics).

Training Configuration

Parameter Value
Embedding dimension 128
Hidden size (GRU) 256
Latent dimension 64
Loss Reconstruction (CE) + KL Divergence
Optimizer Adam

Training Loss Curve

VAE Training Loss

Training loss (Reconstruction + KL Divergence) showing convergence of the VAE on the SMILES character-level objective.

Sample Generated Molecules

When trained on sufficient molecular data, the latent space sampling produces chemically valid SMILES strings validated by RDKit:

CC1=CC=C(C=C1)NC(=O)C2=CC=CC=C2
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
COC1=C(C=CC(=C1)C=O)O
O=C(C)Oc1ccccc1C(=O)O

These molecules include known drug-like scaffolds (e.g., ibuprofen analog, caffeine analog, aspirin analog), demonstrating the VAE's ability to explore chemically meaningful regions of the latent space.

Usage Pipeline

1. Training (Requires High-Compute)

To train the model on a real dataset, replace the placeholder data loader in train.py with a large SMILES dataset (e.g., ZINC subset) and run:

python train.py

2. Generation & Sampling

To sample the latent space and generate SMILES strings:

python sample.py

Requirements

  • torch>=2.0.0
  • rdkit>=2023.3.0
  • numpy>=1.24.0
  • matplotlib>=3.7.0

License

This project is licensed under the MIT License.

About

A PyTorch-based Character-Level Variational Autoencoder (VAE) development pipeline designed for de novo molecular generation and generative drug design, featuring RDKit validation for novel chemical structures

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages