Skip to content

alchemia-db/alchemia

Repository files navigation

Alchemia - Molecular Database — Accelerating Drug Discovery with AI

Early Access

Python 3.11 RDKit PyTorch CUDA License: MIT

8.7M+ Standardized Molecules AIMNet2 3D Minimization admet_ai Predictions 5 Molecular Fingerprints Docker


Early Access

We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.

→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.


Overview

ALCHEMIA MOLECULAR DATABASE is an open-source molecular data bank built with Spec-Driven Development for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.

Phase 01 deliverables:

  • Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
  • Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
  • ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
  • Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL bfp compatible
  • Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
  • PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
  • PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale

All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.

Alchemia Platform — 5 Core Deliverables


Architecture

Alchemia — The Data Factory for Drug Discovery: Phase 01 ETL + Phase 02 Cloud & Web

Phase 01 — Data Pipeline & ETL: Raw ingestion, rigorous standardization (InChI key, 22+ properties, AIMNet2), 3D minimization, and PostgreSQL load for millions of molecules from 6 databases.

Phase 02 — Cloud & Web Product: FastAPI chemical search API (Tanimoto similarity + SMARTS substructure), Next.js frontend with Ketcher / Mol* / 3Dmol.js, and scalable AWS + Cloudflare infrastructure.


Phase 01 Pipeline

Raw sources (SDF / MOL2 / CSV)
  → Audit & Inventory             (source manifest, file checksums)
  → Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
  → 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
  → Structural Filters            (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
  → Fingerprints                  (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
  → Conformer Generation          (ETKDGv3, MMFF94s geometry refinement, 3D QC)
  → AIMNet2 Energy Minimization   (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
  → 3D SDF Export                 (AIMNet2-ok first, MMFF94s fallback)
  → ADMET Predictions             (GPU, 30+ endpoints, ADMET-AI)
  → PDBQT Preparation             (meeko, AutoDock4 format)
  → Molecular Classification      (196 SMARTS classes)
  → Visualization                 (RDKit mol grid images)
  → PostgreSQL Load               (RDKit cartridge + pgvector + pg_trgm)
  → Validation & Audit            (QC log, qc_failures.parquet)

The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).


Source Databases

Database Description Compounds Type Format Phase 01 Status
BrNPDB Brazilian Natural Products Database — curated natural products isolated from Brazilian biodiversity 9,215 Natural products MOL2 (3D available) ✅ All stages complete + PDBQT + PostgreSQL
NCI National Cancer Institute Open Chemical Repository — compounds screened for antitumor activity 85,495 Bioactive / screening SDF (multiple files) ✅ All stages complete + PDBQT + PostgreSQL
COCONUT Collection of Open Natural Products — the largest open-access natural products database 725,267 Natural products CSV + 2D/3D SDF Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 (8× A100 shards)
ChEMBL EMBL-EBI database of bioactive molecules with drug-like properties, curated from medicinal chemistry literature ~2,854,815 Bioactive / drugs SDF + PostgreSQL dump Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards active
Enamine Enamine REAL Space — make-on-demand screening compounds with synthetic feasibility guarantees ~5,000,000+ Synthetic / screening CSV + SDF Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards active
Molport Commercial compound catalog — in-stock and make-on-demand molecules from 200+ suppliers ~7,000,000+ Commercial catalog CSV + SDF + SMILES ⏳ Queued

Standardized to date: ~8.7M+ unique compounds across 5 databases, deduplicated by InChI. Full pipeline complete for BrNPDB and NCI; COCONUT completing AIMNet2 minimization on 8× NVIDIA A100 80GB; ChEMBL and Enamine ADMET and AIMNet2 active.


Features

Feature Details
Multi-Source COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema
AIMNet2 Minimization GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl)
ADMET Predictions 30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI
5 Fingerprint Types ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL bfp
Structural Filters PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions
Drug-Likeness Lipinski, Veber, Ghose, lead-like, fragment-like, QED
196-Class SMARTS Chemical taxonomy from utils/smarts.json
PDBQT Prep meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools)
Docker Pipeline 5 specialized images: base (3.35 GB) · cpu (3.38 GB) · gpu (36.2 GB) · pdbqt (3.43 GB) · snakemake (2.52 GB)
Streaming I/O Chunked, checkpointed, resumable — never OOM on multi-million-compound sources
Deterministic Keys compound_key = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables
PostgreSQL-Ready RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto

Data Outputs

Each source produces a suite of Parquet files, SDF exports, and PDBQT files:

File Key Columns
{source}_unique_compounds.parquet compound_key, canonical_smiles, inchi, inchikey, source_compound_id
{source}_properties.parquet compound_key, mw, logp, tpsa, hbd, hba, qed, lipinski_pass
{source}_fingerprints.parquet compound_key, ecfp4, ecfp6, maccs, rdkit_fp, atompair, torsion (binary)
{source}_admet.parquet compound_key, admet_model_name, admet_model_version, predictions_json
{source}_complete_3d.sdf 3D molblocks — AIMNet2-minimized first, MMFF94s fallback, _Name = compound_key
{source}_complete_minimized.parquet compound_key, minimization_method, energy_kcal_mol, min_status
{source}_pdbqt_manifest.parquet compound_key, pdbqt_path, sha256, num_torsions, pdbqt_status
{source}_classification.parquet compound_key, matched_classes (JSON), primary_class, num_classes
master_compounds.parquet Cross-source deduplicated — compound_key (ALC_*), source_name, inchi, canonical_smiles
cross_reference.parquet source_compound_key, source_name, source_compound_id, master_compound_key

3D Structure Pipeline

Input: raw SMILES / existing 3D coordinates
  1. ETKDGv3 conformer generation (RDKit)
  2. MMFF94s geometry refinement (UFF fallback)
  3. AIMNet2 energy minimization via Auto3D
     — Supported elements: H, C, N, O, F, S, Cl
     — GPU: NVIDIA A100 80GB (8×, DGX cluster) / RTX 5070 (local)
     — Graceful skip for unsupported elements (MMFF94s result kept)
  4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
  5. PDBQT generation via meeko (AutoDock4 format)

Installation

Prerequisites

  • Python 3.11, conda/mamba
  • NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
  • Docker + NVIDIA Container Toolkit (for containerized pipeline)

Local Setup

git clone https://github.com/alchemia-db/alchemia.git
cd alchemia

# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1

# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Editable install
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Docker Pipeline

# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh

# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
  python scripts/run_standardization.py --source BrNPDB --sample 1000

# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_minimization.py --source BrNPDB --device cuda

# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_admet.py --source BrNPDB

# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
  python scripts/run_pdbqt.py --source BrNPDB

# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
  snakemake --cores 8 --dry-run

Pipeline Scripts

All scripts support --sample N, --dry-run, --resume:

python scripts/audit_repository.py          # Audit & inventory
python scripts/run_standardization.py       # Standardize + cross-dedup
python scripts/run_properties.py            # 2D descriptors + drug-likeness
python scripts/run_filters.py               # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py          # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py            # ETKDGv3 conformer generation
python scripts/patch_auto3d.py              # Patch Auto3D np.min([]) crash
python scripts/run_minimization.py          # AIMNet2 energy minimization
python scripts/merge_minimized_sdf.py       # Complete 3D SDF export
python scripts/run_admet.py                 # ADMET-AI predictions
python scripts/run_pdbqt.py                 # PDBQT via meeko
python scripts/run_classification.py        # 196-class SMARTS taxonomy
python scripts/run_viz.py                   # Molecular image grids
python scripts/load_postgres.py             # PostgreSQL load
python scripts/orchestrate_gpus.py          # GPU watchdog — auto-dispatches tasks to idle GPUs
python scripts/dashboard.py                 # Terminal TUI — real-time pipeline progress and GPU monitoring

Repository Structure

Only GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):

alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml          ← PostgreSQL, Redis, API services
├── environment.yml             ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml              ← Package definition + dev dependencies
│
├── assets/                     ← Logos, banners, pipeline diagrams
│   ├── logos/                  ← 6 SVG logo variants (white, dark, icon-only)
│   ├── 1.png                   ← Hero banner (header)
│   ├── 2.png                   ← Platform pillars overview
│   ├── 3.png                   ← Full architecture: Phase 01 ETL + Phase 02 Cloud
│   └── 5.png                   ← Footer with social links
│
├── configs/
│   ├── hardware.yaml           ← CPU/GPU/RAM tuning, checkpoint cadence
│   ├── pipeline.yaml           ← Stage settings, batch sizes, paths
│   └── sources/                ← Per-source YAML configs (6 files)
│
├── docker/
│   ├── build.sh                ← Build all 5 images in dependency order
│   ├── base/                   ← Miniconda + conda-forge
│   ├── cpu/                    ← RDKit, Polars, Snakemake
│   ├── gpu/                    ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│   ├── pdbqt/                  ← Python 3.11 + meeko + gemmi
│   ├── snakemake/              ← Snakemake orchestrator
│   └── postgres/               ← RDKit cartridge init SQL
│
├── pipeline/
│   ├── Snakefile               ← Main DAG entry point
│   ├── config/pipeline.yaml
│   ├── config/profiles/dgx/   ← NVIDIA DGX A100 Snakemake profile
│   └── rules/                  ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/                    ← Thin CLI entrypoints (delegate to src/)
│   └── (16 scripts, one per pipeline stage)
│
├── src/
│   └── alchemia/               ← Main Python package
│       ├── admet/              ← ADMET-AI runner, chunked + checkpointed
│       ├── classification/     ← 196-class SMARTS classifier
│       ├── conformers/         ← ETKDGv3 generator + 3D QC
│       ├── descriptors/        ← 2D properties + drug-likeness
│       ├── filters/            ← PAINS, Brenk, NIH, toxicophores
│       ├── fingerprints/       ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│       ├── io/                 ← Streaming SDF/MOL2/CSV readers
│       ├── minimization/       ← AIMNet2 via Auto3D + MMFF94s fallback
│       ├── pdbqt/              ← meeko PDBQT preparation
│       ├── postgres/           ← Schema, staging, loaders
│       ├── sources/            ← Per-source profiler + parser
│       ├── standardization/    ← RDKit MolStandardize + InChI dedup
│       ├── utils/              ← Logging, checksums, QC logger
│       └── viz/                ← Molecular image grid generator
│
├── tests/unit/                 ← 22 test modules, 63+ tests
│
├── utils/
│   ├── pains.json              ← 480+ PAINS SMARTS patterns
│   ├── smarts.json             ← 196-class chemical taxonomy
│   └── unwanted_substructures.csv
│
├── docs/
│   ├── decisions/              ← Architecture Decision Records
│   ├── runbooks/               ← current_state.md, next_actions.md
│   └── superpowers/            ← Implementation plans + design specs
│
└── data/                       ← Pipeline outputs (gitignored; .gitkeep only)
    ├── admet/ · classification/ · conformers/ · fingerprints/
    ├── minimization/ · pdbqt/ · properties/ · standardized/
    └── tables/ · viz/

Credits


Acknowledgments

The authors gratefully acknowledge Barretos Cancer Hospital (Hospital de Amor, Barretos, São Paulo, Brazil) for providing access to its high-performance computing infrastructure.

All 3D energy minimization stages — conformer generation, AIMNet2 neural network potential optimization, and MMFF94s fallback refinement across millions of molecules — were carried out on the institution's NVIDIA DGX A100 cluster (8× NVIDIA A100 80GB SXM4, CUDA 12.8). Without this computational infrastructure, minimizing structures at the scale of 8.7M+ compounds would not have been feasible.

We also extend our sincere gratitude to the following researchers for their scientific guidance, institutional support, and contributions to the drug discovery mission that motivates this work:

  • Dr. Rui Manuel Reis — for scientific leadership and vision in oncology research
  • Dra. Luciane Sussuchi — for contributions to molecular oncology and research coordination
  • Dr. Renato J. S. Oliveira — for support in computational and structural biology initiatives and preclinical drug discovery
  • Dra. Simone Queiroz Pantaleão — for contributions to chemistry research and drug discovery
  • Dr. André Luiz Pinto Santos and the Digital Health AI Laboratory (LiaaOnco) — for hardware support and access to the NVIDIA DGX A100 cluster at Barretos Cancer Hospital

We also thank the HPC support team at Barretos Cancer Hospital for technical assistance, and the open-source scientific software community — particularly the developers of RDKit, Auto3D, AIMNet2, ADMET-AI, meeko, Polars, Snakemake, and PostgreSQL — whose tools made this project possible.


References

Tools & Models

Tool Citation
RDKit Landrum G. RDKit: Open-source cheminformatics. rdkit.org
Auto3D Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817
AIMNet2 Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch
ADMET-AI Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416
meeko Forli S et al. AutoDock Meeko. github.com/forlilab/meeko
ECFP Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t
PAINS Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j
Snakemake Mölder F et al. F1000Research 2021. doi:10.12688/f1000research.29032.2
pgvector Holzer A. pgvector: Open-source vector similarity search for PostgreSQL. github.com/pgvector/pgvector

Databases

Database Citation
BrNPDB Pilon AC et al. J Chem Inf Model. 2017;57(7):1652–1657. doi:10.1021/acs.jcim.7b00083
NCI Zaharevitz DW et al. National Cancer Institute Open Repository. cactus.nci.nih.gov
COCONUT Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9
ChEMBL Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075
Enamine REAL Grygorenko OO et al. iScience 2020;23(11):101681. doi:10.1016/j.isci.2020.101681
Molport Molport SIA. Molport Compound Catalog. molport.com

Alchemia Molecular Database — LinkedIn · GitHub · preview.alchemiadatabase.com

LinkedIn · GitHub · preview.alchemiadatabase.com

Releases

No releases published

Packages

 
 
 

Contributors