This repository contains a PyTorch implementation of a Character-level Variational Autoencoder (VAE) designed to compress, reconstruct, and generate novel chemical structures using SMILES representations.
Note: This repository serves as a development pipeline and architectural proof-of-concept. The codebase is fully engineered for distributed training on high-compute clusters using large-scale molecular databases (e.g., ZINC15, ChEMBL). The included model weights are from a minimal dummy-data training run intended solely to verify pipeline execution, not for actual molecular generation.
┌──────────────────────────────────────────────────────────┐
│ Character-Level SMILES VAE │
│ │
│ Input: SMILES String (e.g., "CCO", "c1ccccc1") │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Encoder │ │
│ │ Embedding(vocab, 128) │ │
│ │ GRU(128, 256, batch_first=True) │ │
│ │ → μ = Linear(256, 64) │ │
│ │ → σ² = Linear(256, 64) │ │
│ └──────────────┬───────────────────────┘ │
│ │ │
│ Reparameterization Trick │
│ z = μ + ε·σ, ε ~ N(0,1) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Decoder │ │
│ │ Linear(64, 256) → hidden state │ │
│ │ GRU(128, 256, batch_first=True) │ │
│ │ Linear(256, vocab) → logits │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Reconstructed / Novel SMILES String │
│ → RDKit Validation (Validity, Uniqueness, Novelty) │
└──────────────────────────────────────────────────────────┘
- Encoder: GRU-based sequence encoder mapping SMILES sequences to a continuous Gaussian latent space.
- Latent Space: Reparameterization trick for continuous sampling and interpolation.
- Decoder: GRU-based autoregressive decoder reconstructing SMILES from latent vectors.
- Evaluation: Integrated RDKit pipeline for validating molecular feasibility (Validity, Uniqueness, Novelty metrics).
| Parameter | Value |
|---|---|
| Embedding dimension | 128 |
| Hidden size (GRU) | 256 |
| Latent dimension | 64 |
| Loss | Reconstruction (CE) + KL Divergence |
| Optimizer | Adam |
Training loss (Reconstruction + KL Divergence) showing convergence of the VAE on the SMILES character-level objective.
When trained on sufficient molecular data, the latent space sampling produces chemically valid SMILES strings validated by RDKit:
CC1=CC=C(C=C1)NC(=O)C2=CC=CC=C2
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
COC1=C(C=CC(=C1)C=O)O
O=C(C)Oc1ccccc1C(=O)O
These molecules include known drug-like scaffolds (e.g., ibuprofen analog, caffeine analog, aspirin analog), demonstrating the VAE's ability to explore chemically meaningful regions of the latent space.
To train the model on a real dataset, replace the placeholder data loader in train.py with a large SMILES dataset (e.g., ZINC subset) and run:
python train.pyTo sample the latent space and generate SMILES strings:
python sample.pytorch>=2.0.0rdkit>=2023.3.0numpy>=1.24.0matplotlib>=3.7.0
This project is licensed under the MIT License.
