This project is a cheminformatics and chemical data science workflow for adsorption-based removal of organic water contaminants (pharmaceuticals, herbicides, fungicides, dyes, etc.).
It combines:
- real experimental adsorption data (~3,700 records)
- molecular representations (RDKit descriptors, fingerprints, ChemBERTa embeddings)
- physics-informed feature engineering
- a hybrid Langmuir + machine learning model
- an interactive Streamlit app for research support
Given:
- pollutant (structure or descriptors)
- adsorbent properties (biochar composition, surface area, pore structure)
- adsorption conditions (pH, dosage, time, temperature)
Predict:
- adsorption capacity (mg/g)
and support:
- adsorbent design and condition optimization
Primary dataset:
data/raw/ec_biochar_adsorption_raw.csv :contentReference[oaicite:0]{index=0}
- ~3,757 experimental data points
- ~24 variables
- Includes:
- adsorption conditions
- biochar properties
- elemental composition
- experimental adsorption capacity
- heterogeneous sources
- varying experimental conditions
- partial missing data (SMILES coverage ~0.76)
π This reflects real industrial chemical data, not curated benchmarks.
A key strength of this project is the use of structure-derived molecular descriptors, including:
- logP (hydrophobicity)
- TPSA (polarity)
- HBD / HBA (hydrogen bonding)
- molecular weight
- ring count
These descriptors are:
β computed from structure (SMILES or manual input) β independent of specific datasets β applicable to any organic pollutant
π This enables the model and app to:
- evaluate new contaminants not in the dataset
- support exploratory chemical screening
- generalize beyond fixed pollutant lists
This project implements a modular workflow:
- cleaning and validation of experimental data
- handling missing and inconsistent values
- RDKit descriptors (interpretable)
- Morgan fingerprints (structural similarity)
- ChemBERTa embeddings (learned representations)
- concentration-to-dosage ratio (C/dosage)
- log transformations
- time-to-dosage interactions
π This combines domain knowledge + ML-ready features
- Random Forest
- HistGradientBoosting
- SVR (RBF kernel)
- Kernel Ridge
- tree models β step-like, non-physical behavior
- kernel ridge β unstable curvature
- SVR β smooth and physically consistent trends
π SVR was selected for hybrid modeling.
Final model integrates machine learning with adsorption physics:
[ q = \frac{q_{max} K C}{1 + K C} ]
Where:
- ML predicts qmax (capacity limit)
- ML predicts K (adsorption affinity)
- smooth monotonic increase
- realistic saturation behavior
- avoids plateau/decrease artifacts
| Model | MAE | RΒ² |
|---|---|---|
| Hybrid (qmax only) | ~10.4 | ~0.84 |
| Hybrid (qmax + K) | ~9.9 | ~0.88 |
- strong agreement with experiments
- RΒ² β 0.88
- robust across wide capacity range
- capacity correlates more strongly with C/dosage than concentration alone
- RDKit PCA β physicochemical similarity
- fingerprints β structural similarity
- ChemBERTa β learned relationships
- captures physicochemical properties
- separates molecules by polarity & size
- captures structural similarity
- clusters similar compounds
- learned chemical representation
- captures deeper molecular relationships
- hybrid model preserves physical adsorption trends
- sensitivity analysis shows realistic saturation
- smooth increase
- realistic saturation
- consistent with adsorption physics
A prototype parser demonstrates how semi-structured chemical records can be converted into structured features.
π This simulates real workflows where data comes from:
- reports
- industrial systems
- regulatory text
The application is designed as a research support tool, not just a predictor.
Users can:
- input pollutant descriptors (manual or SMILES-based)
- modify adsorbent properties (surface area, composition)
- adjust experimental conditions
Explore how changes in biochar properties and conditions affect adsorption capacity
π This helps researchers:
- design better adsorbents
- test hypotheses
- guide experiments
The app provides chemistry-informed guidance based on:
- hydrophobicity (logP)
- polarity (TPSA)
- aromaticity (ΟβΟ interactions)
- surface area and pore structure
- adsorption conditions
π Intended as screening support, not a replacement for experiments.
Model reliability is constrained by:
- training feature ranges
- chemical space coverage
π Input constraints prevent unreliable extrapolation.
- incomplete SMILES coverage (~76%)
- heterogeneous experimental data
- Langmuir assumption may not hold for all systems
- reduced accuracy at extreme values
- full ChemBERTa integration into model
- pKa / pHpzc incorporation
- multi-component adsorption
- improved SMILES coverage (PubChem mapping)
- integration with chemical databases
This project demonstrates:
- chemical data pipeline design
- molecular representation (RDKit + embeddings)
- physics-informed feature engineering
- hybrid ML + scientific modeling
- interpretability and domain awareness




