We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.
→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.
ALCHEMIA MOLECULAR DATABASE is an open-source molecular data bank built with Spec-Driven Development for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.
Phase 01 deliverables:
- Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
- Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
- ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
- Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL
bfpcompatible - Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
- PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
- PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale
All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.
Phase 01 — Data Pipeline & ETL: Raw ingestion, rigorous standardization (InChI key, 22+ properties, AIMNet2), 3D minimization, and PostgreSQL load for millions of molecules from 6 databases.
Phase 02 — Cloud & Web Product: FastAPI chemical search API (Tanimoto similarity + SMARTS substructure), Next.js frontend with Ketcher / Mol* / 3Dmol.js, and scalable AWS + Cloudflare infrastructure.
Raw sources (SDF / MOL2 / CSV)
→ Audit & Inventory (source manifest, file checksums)
→ Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
→ 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
→ Structural Filters (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
→ Fingerprints (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
→ Conformer Generation (ETKDGv3, MMFF94s geometry refinement, 3D QC)
→ AIMNet2 Energy Minimization (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
→ 3D SDF Export (AIMNet2-ok first, MMFF94s fallback)
→ ADMET Predictions (GPU, 30+ endpoints, ADMET-AI)
→ PDBQT Preparation (meeko, AutoDock4 format)
→ Molecular Classification (196 SMARTS classes)
→ Visualization (RDKit mol grid images)
→ PostgreSQL Load (RDKit cartridge + pgvector + pg_trgm)
→ Validation & Audit (QC log, qc_failures.parquet)
The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).
| Database | Description | Compounds | Type | Format | Phase 01 Status |
|---|---|---|---|---|---|
| BrNPDB | Brazilian Natural Products Database — curated natural products isolated from Brazilian biodiversity | 9,215 | Natural products | MOL2 (3D available) | ✅ All stages complete + PDBQT + PostgreSQL |
| NCI | National Cancer Institute Open Chemical Repository — compounds screened for antitumor activity | 85,495 | Bioactive / screening | SDF (multiple files) | ✅ All stages complete + PDBQT + PostgreSQL |
| COCONUT | Collection of Open Natural Products — the largest open-access natural products database | 725,267 | Natural products | CSV + 2D/3D SDF | Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 (8× A100 shards) |
| ChEMBL | EMBL-EBI database of bioactive molecules with drug-like properties, curated from medicinal chemistry literature | ~2,854,815 | Bioactive / drugs | SDF + PostgreSQL dump | Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards active |
| Enamine | Enamine REAL Space — make-on-demand screening compounds with synthetic feasibility guarantees | ~5,000,000+ | Synthetic / screening | CSV + SDF | Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards active |
| Molport | Commercial compound catalog — in-stock and make-on-demand molecules from 200+ suppliers | ~7,000,000+ | Commercial catalog | CSV + SDF + SMILES | ⏳ Queued |
Standardized to date: ~8.7M+ unique compounds across 5 databases, deduplicated by InChI. Full pipeline complete for BrNPDB and NCI; COCONUT completing AIMNet2 minimization on 8× NVIDIA A100 80GB; ChEMBL and Enamine ADMET and AIMNet2 active.
| Feature | Details |
|---|---|
| Multi-Source | COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema |
| AIMNet2 Minimization | GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl) |
| ADMET Predictions | 30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI |
| 5 Fingerprint Types | ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL bfp |
| Structural Filters | PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions |
| Drug-Likeness | Lipinski, Veber, Ghose, lead-like, fragment-like, QED |
| 196-Class SMARTS | Chemical taxonomy from utils/smarts.json |
| PDBQT Prep | meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools) |
| Docker Pipeline | 5 specialized images: base (3.35 GB) · cpu (3.38 GB) · gpu (36.2 GB) · pdbqt (3.43 GB) · snakemake (2.52 GB) |
| Streaming I/O | Chunked, checkpointed, resumable — never OOM on multi-million-compound sources |
| Deterministic Keys | compound_key = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables |
| PostgreSQL-Ready | RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto |
Each source produces a suite of Parquet files, SDF exports, and PDBQT files:
| File | Key Columns |
|---|---|
{source}_unique_compounds.parquet |
compound_key, canonical_smiles, inchi, inchikey, source_compound_id |
{source}_properties.parquet |
compound_key, mw, logp, tpsa, hbd, hba, qed, lipinski_pass |
{source}_fingerprints.parquet |
compound_key, ecfp4, ecfp6, maccs, rdkit_fp, atompair, torsion (binary) |
{source}_admet.parquet |
compound_key, admet_model_name, admet_model_version, predictions_json |
{source}_complete_3d.sdf |
3D molblocks — AIMNet2-minimized first, MMFF94s fallback, _Name = compound_key |
{source}_complete_minimized.parquet |
compound_key, minimization_method, energy_kcal_mol, min_status |
{source}_pdbqt_manifest.parquet |
compound_key, pdbqt_path, sha256, num_torsions, pdbqt_status |
{source}_classification.parquet |
compound_key, matched_classes (JSON), primary_class, num_classes |
master_compounds.parquet |
Cross-source deduplicated — compound_key (ALC_*), source_name, inchi, canonical_smiles |
cross_reference.parquet |
source_compound_key, source_name, source_compound_id, master_compound_key |
Input: raw SMILES / existing 3D coordinates
1. ETKDGv3 conformer generation (RDKit)
2. MMFF94s geometry refinement (UFF fallback)
3. AIMNet2 energy minimization via Auto3D
— Supported elements: H, C, N, O, F, S, Cl
— GPU: NVIDIA A100 80GB (8×, DGX cluster) / RTX 5070 (local)
— Graceful skip for unsupported elements (MMFF94s result kept)
4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
5. PDBQT generation via meeko (AutoDock4 format)
- Python 3.11, conda/mamba
- NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
- Docker + NVIDIA Container Toolkit (for containerized pipeline)
git clone https://github.com/alchemia-db/alchemia.git
cd alchemia
# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1
# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Editable install
pip install -e ".[dev]"
# Run tests
pytest tests/ -v# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh
# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
python scripts/run_standardization.py --source BrNPDB --sample 1000
# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
python scripts/run_minimization.py --source BrNPDB --device cuda
# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
python scripts/run_admet.py --source BrNPDB
# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
python scripts/run_pdbqt.py --source BrNPDB
# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
snakemake --cores 8 --dry-runAll scripts support --sample N, --dry-run, --resume:
python scripts/audit_repository.py # Audit & inventory
python scripts/run_standardization.py # Standardize + cross-dedup
python scripts/run_properties.py # 2D descriptors + drug-likeness
python scripts/run_filters.py # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py # ETKDGv3 conformer generation
python scripts/patch_auto3d.py # Patch Auto3D np.min([]) crash
python scripts/run_minimization.py # AIMNet2 energy minimization
python scripts/merge_minimized_sdf.py # Complete 3D SDF export
python scripts/run_admet.py # ADMET-AI predictions
python scripts/run_pdbqt.py # PDBQT via meeko
python scripts/run_classification.py # 196-class SMARTS taxonomy
python scripts/run_viz.py # Molecular image grids
python scripts/load_postgres.py # PostgreSQL load
python scripts/orchestrate_gpus.py # GPU watchdog — auto-dispatches tasks to idle GPUs
python scripts/dashboard.py # Terminal TUI — real-time pipeline progress and GPU monitoringOnly GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):
alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml ← PostgreSQL, Redis, API services
├── environment.yml ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml ← Package definition + dev dependencies
│
├── assets/ ← Logos, banners, pipeline diagrams
│ ├── logos/ ← 6 SVG logo variants (white, dark, icon-only)
│ ├── 1.png ← Hero banner (header)
│ ├── 2.png ← Platform pillars overview
│ ├── 3.png ← Full architecture: Phase 01 ETL + Phase 02 Cloud
│ └── 5.png ← Footer with social links
│
├── configs/
│ ├── hardware.yaml ← CPU/GPU/RAM tuning, checkpoint cadence
│ ├── pipeline.yaml ← Stage settings, batch sizes, paths
│ └── sources/ ← Per-source YAML configs (6 files)
│
├── docker/
│ ├── build.sh ← Build all 5 images in dependency order
│ ├── base/ ← Miniconda + conda-forge
│ ├── cpu/ ← RDKit, Polars, Snakemake
│ ├── gpu/ ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│ ├── pdbqt/ ← Python 3.11 + meeko + gemmi
│ ├── snakemake/ ← Snakemake orchestrator
│ └── postgres/ ← RDKit cartridge init SQL
│
├── pipeline/
│ ├── Snakefile ← Main DAG entry point
│ ├── config/pipeline.yaml
│ ├── config/profiles/dgx/ ← NVIDIA DGX A100 Snakemake profile
│ └── rules/ ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/ ← Thin CLI entrypoints (delegate to src/)
│ └── (16 scripts, one per pipeline stage)
│
├── src/
│ └── alchemia/ ← Main Python package
│ ├── admet/ ← ADMET-AI runner, chunked + checkpointed
│ ├── classification/ ← 196-class SMARTS classifier
│ ├── conformers/ ← ETKDGv3 generator + 3D QC
│ ├── descriptors/ ← 2D properties + drug-likeness
│ ├── filters/ ← PAINS, Brenk, NIH, toxicophores
│ ├── fingerprints/ ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│ ├── io/ ← Streaming SDF/MOL2/CSV readers
│ ├── minimization/ ← AIMNet2 via Auto3D + MMFF94s fallback
│ ├── pdbqt/ ← meeko PDBQT preparation
│ ├── postgres/ ← Schema, staging, loaders
│ ├── sources/ ← Per-source profiler + parser
│ ├── standardization/ ← RDKit MolStandardize + InChI dedup
│ ├── utils/ ← Logging, checksums, QC logger
│ └── viz/ ← Molecular image grid generator
│
├── tests/unit/ ← 22 test modules, 63+ tests
│
├── utils/
│ ├── pains.json ← 480+ PAINS SMARTS patterns
│ ├── smarts.json ← 196-class chemical taxonomy
│ └── unwanted_substructures.csv
│
├── docs/
│ ├── decisions/ ← Architecture Decision Records
│ ├── runbooks/ ← current_state.md, next_actions.md
│ └── superpowers/ ← Implementation plans + design specs
│
└── data/ ← Pipeline outputs (gitignored; .gitkeep only)
├── admet/ · classification/ · conformers/ · fingerprints/
├── minimization/ · pdbqt/ · properties/ · standardized/
└── tables/ · viz/
- Lead scientist: Aryel J. A. Bezerra (@HighScientist)
- Lead developer: Gabriel C. Furniel (@gabriel1734)
The authors gratefully acknowledge Barretos Cancer Hospital (Hospital de Amor, Barretos, São Paulo, Brazil) for providing access to its high-performance computing infrastructure.
All 3D energy minimization stages — conformer generation, AIMNet2 neural network potential optimization, and MMFF94s fallback refinement across millions of molecules — were carried out on the institution's NVIDIA DGX A100 cluster (8× NVIDIA A100 80GB SXM4, CUDA 12.8). Without this computational infrastructure, minimizing structures at the scale of 8.7M+ compounds would not have been feasible.
We also extend our sincere gratitude to the following researchers for their scientific guidance, institutional support, and contributions to the drug discovery mission that motivates this work:
- Dr. Rui Manuel Reis — for scientific leadership and vision in oncology research
- Dra. Luciane Sussuchi — for contributions to molecular oncology and research coordination
- Dr. Renato J. S. Oliveira — for support in computational and structural biology initiatives and preclinical drug discovery
- Dra. Simone Queiroz Pantaleão — for contributions to chemistry research and drug discovery
- Dr. André Luiz Pinto Santos and the Digital Health AI Laboratory (LiaaOnco) — for hardware support and access to the NVIDIA DGX A100 cluster at Barretos Cancer Hospital
We also thank the HPC support team at Barretos Cancer Hospital for technical assistance, and the open-source scientific software community — particularly the developers of RDKit, Auto3D, AIMNet2, ADMET-AI, meeko, Polars, Snakemake, and PostgreSQL — whose tools made this project possible.
| Tool | Citation |
|---|---|
| RDKit | Landrum G. RDKit: Open-source cheminformatics. rdkit.org |
| Auto3D | Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817 |
| AIMNet2 | Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch |
| ADMET-AI | Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416 |
| meeko | Forli S et al. AutoDock Meeko. github.com/forlilab/meeko |
| ECFP | Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t |
| PAINS | Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j |
| Snakemake | Mölder F et al. F1000Research 2021. doi:10.12688/f1000research.29032.2 |
| pgvector | Holzer A. pgvector: Open-source vector similarity search for PostgreSQL. github.com/pgvector/pgvector |
| Database | Citation |
|---|---|
| BrNPDB | Pilon AC et al. J Chem Inf Model. 2017;57(7):1652–1657. doi:10.1021/acs.jcim.7b00083 |
| NCI | Zaharevitz DW et al. National Cancer Institute Open Repository. cactus.nci.nih.gov |
| COCONUT | Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9 |
| ChEMBL | Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075 |
| Enamine REAL | Grygorenko OO et al. iScience 2020;23(11):101681. doi:10.1016/j.isci.2020.101681 |
| Molport | Molport SIA. Molport Compound Catalog. molport.com |



