GitHub - alchemia-db/alchemia: Alchemia Molecular Database: Accelerating Drug Discovery with AI

Early Access

We're processing millions of molecules. While we prepare the full dataset for public release, sign up for early access — we'll notify you by email when we launch.

→ preview.alchemiadatabase.com/pt ←
Be among the first to access the Alchemia dataset.

Overview

ALCHEMIA MOLECULAR DATABASE is an open-source molecular data bank built with Spec-Driven Development for ultra-large virtual screening (ULVS) in drug discovery. It aggregates molecules from 6 public databases, standardizes them through a reproducible, GPU-accelerated Python pipeline, and delivers research-ready datasets at scale.

Phase 01 deliverables:

Standardized chemical structures — SMILES, InChI, InChIKey, cross-source deduplication (ALC_XXXXXXX compound keys)
Energy-minimized 3D conformers — AIMNet2 neural network potential (near-QM accuracy) via Auto3D, MMFF94s fallback
ADMET property predictions — 30+ endpoints via ADMET-AI (absorption, distribution, metabolism, excretion, toxicity)
Molecular fingerprints — ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet, PostgreSQL bfp compatible
Drug-likeness & structural filters — Lipinski, Veber, Ghose, QED, PAINS (480+), Brenk, NIH, toxicophores
PDBQT-ready ligands — meeko-based preparation for AutoDock/Vina workflows
PostgreSQL database — RDKit cartridge + pgvector + pg_trgm for similarity search at scale

All pipeline stages are chunked, checkpointed, and resumable. No stage ever loads a full SDF/MOL2/CSV into memory.

Architecture

Phase 01 — Data Pipeline & ETL: Raw ingestion, rigorous standardization (InChI key, 22+ properties, AIMNet2), 3D minimization, and PostgreSQL load for millions of molecules from 6 databases.

Phase 02 — Cloud & Web Product: FastAPI chemical search API (Tanimoto similarity + SMARTS substructure), Next.js frontend with Ketcher / Mol* / 3Dmol.js, and scalable AWS + Cloudflare infrastructure.

Phase 01 Pipeline

Raw sources (SDF / MOL2 / CSV)
  → Audit & Inventory             (source manifest, file checksums)
  → Standardization + Cross-Dedup (RDKit MolStandardize, InChI keys, ALC_XXXXXXX)
  → 2D Properties + Drug-Likeness (MW, logP, TPSA, HBD/HBA, QED, Lipinski/Veber/Ghose)
  → Structural Filters            (PAINS 480+, Brenk, NIH, toxicophores — flags, not exclusions)
  → Fingerprints                  (ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet)
  → Conformer Generation          (ETKDGv3, MMFF94s geometry refinement, 3D QC)
  → AIMNet2 Energy Minimization   (GPU, near-QM accuracy, H/C/N/O/F/S/Cl)
  → 3D SDF Export                 (AIMNet2-ok first, MMFF94s fallback)
  → ADMET Predictions             (GPU, 30+ endpoints, ADMET-AI)
  → PDBQT Preparation             (meeko, AutoDock4 format)
  → Molecular Classification      (196 SMARTS classes)
  → Visualization                 (RDKit mol grid images)
  → PostgreSQL Load               (RDKit cartridge + pgvector + pg_trgm)
  → Validation & Audit            (QC log, qc_failures.parquet)

The pipeline is orchestrated with Snakemake (14 rules) and runs on Docker images optimized for NVIDIA DGX clusters (8× A100 80GB) or local GPU workstations (RTX 5070+).

Source Databases

Database	Description	Compounds	Type	Format	Phase 01 Status
BrNPDB	Brazilian Natural Products Database — curated natural products isolated from Brazilian biodiversity	9,215	Natural products	MOL2 (3D available)	✅ All stages complete + PDBQT + PostgreSQL
NCI	National Cancer Institute Open Chemical Repository — compounds screened for antitumor activity	85,495	Bioactive / screening	SDF (multiple files)	✅ All stages complete + PDBQT + PostgreSQL
COCONUT	Collection of Open Natural Products — the largest open-access natural products database	725,267	Natural products	CSV + 2D/3D SDF	Std–Props–FP–ADMET–Class ✅ · AIMNet2 🔄 (8× A100 shards)
ChEMBL	EMBL-EBI database of bioactive molecules with drug-like properties, curated from medicinal chemistry literature	~2,854,815	Bioactive / drugs	SDF + PostgreSQL dump	Std–Props–FP–Class ✅ · ADMET 🔄 ~18.7% · AIMNet2 🔄 shards active
Enamine	Enamine REAL Space — make-on-demand screening compounds with synthetic feasibility guarantees	~5,000,000+	Synthetic / screening	CSV + SDF	Std–Props–FP–Class ✅ · ADMET 🔄 running · AIMNet2 🔄 shards active
Molport	Commercial compound catalog — in-stock and make-on-demand molecules from 200+ suppliers	~7,000,000+	Commercial catalog	CSV + SDF + SMILES	⏳ Queued

Standardized to date: ~8.7M+ unique compounds across 5 databases, deduplicated by InChI. Full pipeline complete for BrNPDB and NCI; COCONUT completing AIMNet2 minimization on 8× NVIDIA A100 80GB; ChEMBL and Enamine ADMET and AIMNet2 active.

Features

Feature	Details
Multi-Source	COCONUT, NCI, BrNPDB, ChEMBL, Enamine, Molport — 6 databases, one unified schema
AIMNet2 Minimization	GPU-accelerated 3D energy minimization, near-QM accuracy (supports H, C, N, O, F, S, Cl)
ADMET Predictions	30+ endpoints: absorption, distribution, metabolism, excretion, toxicity via ADMET-AI
5 Fingerprint Types	ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion — binary Parquet + PostgreSQL `bfp`
Structural Filters	PAINS (480+), Brenk, NIH, toxicophores — warning flags, not hard exclusions
Drug-Likeness	Lipinski, Veber, Ghose, lead-like, fragment-like, QED
196-Class SMARTS	Chemical taxonomy from `utils/smarts.json`
PDBQT Prep	meeko-based AutoDock/Vina-ready PDBQT generation (pure Python 3.11, no MGLTools)
Docker Pipeline	5 specialized images: `base` (3.35 GB) · `cpu` (3.38 GB) · `gpu` (36.2 GB) · `pdbqt` (3.43 GB) · `snakemake` (2.52 GB)
Streaming I/O	Chunked, checkpointed, resumable — never OOM on multi-million-compound sources
Deterministic Keys	`compound_key` = ALC_XXXXXXX (SHA256 of InChI) — stable join key across all tables
PostgreSQL-Ready	RDKit cartridge + pgvector (similarity search) + pg_trgm + pgcrypto

Data Outputs

Each source produces a suite of Parquet files, SDF exports, and PDBQT files:

File	Key Columns
`{source}_unique_compounds.parquet`	`compound_key`, `canonical_smiles`, `inchi`, `inchikey`, `source_compound_id`
`{source}_properties.parquet`	`compound_key`, `mw`, `logp`, `tpsa`, `hbd`, `hba`, `qed`, `lipinski_pass`
`{source}_fingerprints.parquet`	`compound_key`, `ecfp4`, `ecfp6`, `maccs`, `rdkit_fp`, `atompair`, `torsion` (binary)
`{source}_admet.parquet`	`compound_key`, `admet_model_name`, `admet_model_version`, `predictions_json`
`{source}_complete_3d.sdf`	3D molblocks — AIMNet2-minimized first, MMFF94s fallback, `_Name` = `compound_key`
`{source}_complete_minimized.parquet`	`compound_key`, `minimization_method`, `energy_kcal_mol`, `min_status`
`{source}_pdbqt_manifest.parquet`	`compound_key`, `pdbqt_path`, `sha256`, `num_torsions`, `pdbqt_status`
`{source}_classification.parquet`	`compound_key`, `matched_classes` (JSON), `primary_class`, `num_classes`
`master_compounds.parquet`	Cross-source deduplicated — `compound_key (ALC_*)`, `source_name`, `inchi`, `canonical_smiles`
`cross_reference.parquet`	`source_compound_key`, `source_name`, `source_compound_id`, `master_compound_key`

3D Structure Pipeline

Input: raw SMILES / existing 3D coordinates
  1. ETKDGv3 conformer generation (RDKit)
  2. MMFF94s geometry refinement (UFF fallback)
  3. AIMNet2 energy minimization via Auto3D
     — Supported elements: H, C, N, O, F, S, Cl
     — GPU: NVIDIA A100 80GB (8×, DGX cluster) / RTX 5070 (local)
     — Graceful skip for unsupported elements (MMFF94s result kept)
  4. Complete 3D SDF export (AIMNet2-ok first, MMFF94s fallback)
  5. PDBQT generation via meeko (AutoDock4 format)

Installation

Prerequisites

Python 3.11, conda/mamba
NVIDIA GPU + CUDA 12.8 (for AIMNet2 minimization and ADMET-AI)
Docker + NVIDIA Container Toolkit (for containerized pipeline)

Local Setup

git clone https://github.com/alchemia-db/alchemia.git
cd alchemia

# Create conda environment (RDKit, Polars, PyArrow included)
conda env create -f environment.yml
conda activate alchemia-ph1

# PyTorch with CUDA 12.8 (RTX 5070 / A100 Blackwell)
pip install torch --index-url https://download.pytorch.org/whl/cu128

# Editable install
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

Docker Pipeline

# Build all images (base → cpu → gpu → pdbqt → snakemake)
bash docker/build.sh

# CPU stage (standardization, properties, fingerprints)
docker run --rm -v "$PWD:/workspace" alchemia/cpu \
  python scripts/run_standardization.py --source BrNPDB --sample 1000

# GPU stage (AIMNet2 minimization)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_minimization.py --source BrNPDB --device cuda

# GPU stage (ADMET predictions)
docker run --rm --gpus all -v "$PWD:/workspace" alchemia/gpu \
  python scripts/run_admet.py --source BrNPDB

# PDBQT preparation
docker run --rm -v "$PWD:/workspace" alchemia/pdbqt \
  python scripts/run_pdbqt.py --source BrNPDB

# Full Snakemake pipeline (dry-run first)
docker run --rm -v "$PWD:/workspace" alchemia/snakemake \
  snakemake --cores 8 --dry-run

Pipeline Scripts

All scripts support --sample N, --dry-run, --resume:

python scripts/audit_repository.py          # Audit & inventory
python scripts/run_standardization.py       # Standardize + cross-dedup
python scripts/run_properties.py            # 2D descriptors + drug-likeness
python scripts/run_filters.py               # PAINS / Brenk / NIH flags
python scripts/run_fingerprints.py          # ECFP4/6 / MACCS / RDKit / AtomPair
python scripts/run_conformers.py            # ETKDGv3 conformer generation
python scripts/patch_auto3d.py              # Patch Auto3D np.min([]) crash
python scripts/run_minimization.py          # AIMNet2 energy minimization
python scripts/merge_minimized_sdf.py       # Complete 3D SDF export
python scripts/run_admet.py                 # ADMET-AI predictions
python scripts/run_pdbqt.py                 # PDBQT via meeko
python scripts/run_classification.py        # 196-class SMARTS taxonomy
python scripts/run_viz.py                   # Molecular image grids
python scripts/load_postgres.py             # PostgreSQL load
python scripts/orchestrate_gpus.py          # GPU watchdog — auto-dispatches tasks to idle GPUs
python scripts/dashboard.py                 # Terminal TUI — real-time pipeline progress and GPU monitoring

Repository Structure

Only GitHub-tracked files are shown (databases, pipeline outputs, and local agent files are gitignored):

alchemia/
├── .gitignore
├── LICENSE
├── README.md
├── docker-compose.yml          ← PostgreSQL, Redis, API services
├── environment.yml             ← Conda env: Python 3.11, RDKit, Polars, CUDA
├── pyproject.toml              ← Package definition + dev dependencies
│
├── assets/                     ← Logos, banners, pipeline diagrams
│   ├── logos/                  ← 6 SVG logo variants (white, dark, icon-only)
│   ├── 1.png                   ← Hero banner (header)
│   ├── 2.png                   ← Platform pillars overview
│   ├── 3.png                   ← Full architecture: Phase 01 ETL + Phase 02 Cloud
│   └── 5.png                   ← Footer with social links
│
├── configs/
│   ├── hardware.yaml           ← CPU/GPU/RAM tuning, checkpoint cadence
│   ├── pipeline.yaml           ← Stage settings, batch sizes, paths
│   └── sources/                ← Per-source YAML configs (6 files)
│
├── docker/
│   ├── build.sh                ← Build all 5 images in dependency order
│   ├── base/                   ← Miniconda + conda-forge
│   ├── cpu/                    ← RDKit, Polars, Snakemake
│   ├── gpu/                    ← PyTorch 2.8 + ADMET-AI + AIMNet2 (CUDA 12.8)
│   ├── pdbqt/                  ← Python 3.11 + meeko + gemmi
│   ├── snakemake/              ← Snakemake orchestrator
│   └── postgres/               ← RDKit cartridge init SQL
│
├── pipeline/
│   ├── Snakefile               ← Main DAG entry point
│   ├── config/pipeline.yaml
│   ├── config/profiles/dgx/   ← NVIDIA DGX A100 Snakemake profile
│   └── rules/                  ← 14 .smk rule files (00_audit → 13_validate)
│
├── scripts/                    ← Thin CLI entrypoints (delegate to src/)
│   └── (16 scripts, one per pipeline stage)
│
├── src/
│   └── alchemia/               ← Main Python package
│       ├── admet/              ← ADMET-AI runner, chunked + checkpointed
│       ├── classification/     ← 196-class SMARTS classifier
│       ├── conformers/         ← ETKDGv3 generator + 3D QC
│       ├── descriptors/        ← 2D properties + drug-likeness
│       ├── filters/            ← PAINS, Brenk, NIH, toxicophores
│       ├── fingerprints/       ← ECFP4/6, MACCS, RDKit FP, AtomPair, Torsion
│       ├── io/                 ← Streaming SDF/MOL2/CSV readers
│       ├── minimization/       ← AIMNet2 via Auto3D + MMFF94s fallback
│       ├── pdbqt/              ← meeko PDBQT preparation
│       ├── postgres/           ← Schema, staging, loaders
│       ├── sources/            ← Per-source profiler + parser
│       ├── standardization/    ← RDKit MolStandardize + InChI dedup
│       ├── utils/              ← Logging, checksums, QC logger
│       └── viz/                ← Molecular image grid generator
│
├── tests/unit/                 ← 22 test modules, 63+ tests
│
├── utils/
│   ├── pains.json              ← 480+ PAINS SMARTS patterns
│   ├── smarts.json             ← 196-class chemical taxonomy
│   └── unwanted_substructures.csv
│
├── docs/
│   ├── decisions/              ← Architecture Decision Records
│   ├── runbooks/               ← current_state.md, next_actions.md
│   └── superpowers/            ← Implementation plans + design specs
│
└── data/                       ← Pipeline outputs (gitignored; .gitkeep only)
    ├── admet/ · classification/ · conformers/ · fingerprints/
    ├── minimization/ · pdbqt/ · properties/ · standardized/
    └── tables/ · viz/

Credits

Lead scientist: Aryel J. A. Bezerra (@HighScientist)
Lead developer: Gabriel C. Furniel (@gabriel1734)

Acknowledgments

The authors gratefully acknowledge Barretos Cancer Hospital (Hospital de Amor, Barretos, São Paulo, Brazil) for providing access to its high-performance computing infrastructure.

All 3D energy minimization stages — conformer generation, AIMNet2 neural network potential optimization, and MMFF94s fallback refinement across millions of molecules — were carried out on the institution's NVIDIA DGX A100 cluster (8× NVIDIA A100 80GB SXM4, CUDA 12.8). Without this computational infrastructure, minimizing structures at the scale of 8.7M+ compounds would not have been feasible.

We also extend our sincere gratitude to the following researchers for their scientific guidance, institutional support, and contributions to the drug discovery mission that motivates this work:

Dr. Rui Manuel Reis — for scientific leadership and vision in oncology research
Dra. Luciane Sussuchi — for contributions to molecular oncology and research coordination
Dr. Renato J. S. Oliveira — for support in computational and structural biology initiatives and preclinical drug discovery
Dra. Simone Queiroz Pantaleão — for contributions to chemistry research and drug discovery
Dr. André Luiz Pinto Santos and the Digital Health AI Laboratory (LiaaOnco) — for hardware support and access to the NVIDIA DGX A100 cluster at Barretos Cancer Hospital

We also thank the HPC support team at Barretos Cancer Hospital for technical assistance, and the open-source scientific software community — particularly the developers of RDKit, Auto3D, AIMNet2, ADMET-AI, meeko, Polars, Snakemake, and PostgreSQL — whose tools made this project possible.

References

Tools & Models

Tool	Citation
RDKit	Landrum G. RDKit: Open-source cheminformatics. rdkit.org
Auto3D	Liu Z et al. J Chem Inf Model. 2022;62:5373. doi:10.1021/acs.jcim.2c00817
AIMNet2	Anstine DM et al. ChemRxiv 2023. doi:10.26434/chemrxiv-2023-296ch
ADMET-AI	Swanson K et al. Bioinformatics 2024. doi:10.1093/bioinformatics/btae416
meeko	Forli S et al. AutoDock Meeko. github.com/forlilab/meeko
ECFP	Rogers D, Hahn M. J Chem Inf Model. 2010;50:742. doi:10.1021/ci100050t
PAINS	Baell JB, Holloway GA. J Med Chem. 2010;53:2719. doi:10.1021/jm901137j
Snakemake	Mölder F et al. F1000Research 2021. doi:10.12688/f1000research.29032.2
pgvector	Holzer A. pgvector: Open-source vector similarity search for PostgreSQL. github.com/pgvector/pgvector

Databases

Database	Citation
BrNPDB	Pilon AC et al. J Chem Inf Model. 2017;57(7):1652–1657. doi:10.1021/acs.jcim.7b00083
NCI	Zaharevitz DW et al. National Cancer Institute Open Repository. cactus.nci.nih.gov
COCONUT	Sorokina M et al. J Cheminform. 2021;13:2. doi:10.1186/s13321-020-00478-9
ChEMBL	Mendez D et al. Nucleic Acids Res. 2019;47:D930. doi:10.1093/nar/gky1075
Enamine REAL	Grygorenko OO et al. iScience 2020;23(11):101681. doi:10.1016/j.isci.2020.101681
Molport	Molport SIA. Molport Compound Catalog. molport.com

LinkedIn · GitHub · preview.alchemiadatabase.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Early Access

Overview

Architecture

Phase 01 Pipeline

Source Databases

Features

Data Outputs

3D Structure Pipeline

Installation

Prerequisites

Local Setup

Docker Pipeline

Pipeline Scripts

Repository Structure

Credits

Acknowledgments

References

Tools & Models

Databases

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
assets		assets
configs		configs
data		data
docker		docker
docs		docs
pipeline		pipeline
playbooks		playbooks
scripts		scripts
src/alchemia		src/alchemia
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Early Access

Overview

Architecture

Phase 01 Pipeline

Source Databases

Features

Data Outputs

3D Structure Pipeline

Installation

Prerequisites

Local Setup

Docker Pipeline

Pipeline Scripts

Repository Structure

Credits

Acknowledgments

References

Tools & Models

Databases

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages