✨✨ A curated collection of resources on artificial intelligence for spectral data analysis, covering computational methods for mass spectrometry (MS), NMR, IR, and XRD data.
- 1. Mass Spectrometry (Small Molecules)
- 2. Mass Spectrometry (Peptides)
- 3. NMR Spectroscopy (Small Molecules)
- 4. IR Spectroscopy (Small Molecules)
- 5. Multimodal Spectroscopy (Small Molecules)
- 6. X-ray Diffraction (XRD) (Crystals)
Computational approaches for predicting mass spectra from molecular structures
AI methods for molecular identification and elucidation from mass spectra
| Paper Title & Link | Feasible scene | Venue | Code | Notes |
|---|---|---|---|---|
| matchms- processing and similarity evaluation of mass spectrometry data | raw mass spectra to pre- and post-processe | The Journal of Open Source Software | python package | |
| MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra | compare tandem mass spectra | Journal of Cheminformatics | ||
| Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships | NLP-inspired Model, Word2Vec | PLOS Computational Biology | Spec2Vec | |
| Chemically informed analyses of metabolomics mass spectrometry data with Qemistree | Machine Learning, Tree-based approach | Nature chemical biology | Qemistree | |
| MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics | Software Tool | Nature Communications | MetaboAnalystR 4.0, LC-MS data processing, R package, Project Link: MetaboAnalyst | |
| Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning | Decision Trees, Neural Network, LSTM | Journal of the American Chemical Society | MEDUSA, mass spectrum analysis overall framework, Project Link: project_link | |
| The METLIN small molecule dataset for machine learning-based retention time prediction | Deep Learning | Nature Communications | project website, retention time prediction, R package | |
| An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles | Deep Learning | Nature Communications | DeepMSProfiler, disease-specific metabolic profiling | |
| Trackable and scalable LC-MS metabolomics data processing using asari | open-source software tool | Nature Communications | asari, LC-MS data processing, Project Link: https://pypi.org/project/asari-metabolomics/ | |
| Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa | Ranking algorithm | Mass Spectrom (Tokyo) | Peak list annotation, MAGMa | |
| MIST-CF: Chemical Formula Inference from Tandem Mass Spectra | Spectrum Transformers | Journal of Chemical Information and Modeling | Mass Spectra Formula Inference | |
| Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra | Deep Neural Network | Nature Biotechnology | compound class annotation, CANOPUS |
| Name | Type | Venue | Website | Notes |
|---|---|---|---|---|
| MassSpecGym: A benchmark for the discovery and identification of molecules | Benchmark, Transformer, GNN | Advances in Neural Information Processing Systems | MassSpecGym | |
| Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking | Mass Spectrum Database | Nature Biotechnology | Database link | |
| A versatile toolkit for drug metabolism studies with GNPS2: from drug development to clinical monitoring | Mass Spectrum Database | Nature Protocals | Database link | |
| MassBank: a public repository for sharing mass spectral data for life sciences | Mass Spectrum Database | Journal of mass spectrometry | Database link-EU Database link-NA | |
| Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond | Neural architectures, ML-empowered solution | arXiv | review | |
| BMDMS-NP: A comprehensive ESI-MS/MS spectral library of natural compounds | Spectral Library | Phytochemistry | BMDMS-NP, ESI-MS/MS spectral library | |
| Annual Review of Analytical Chemistry Machine Learning in Small-Molecule Mass Spectrometry | Sequence-based Model, Graph-based Model, Deep Learning Model, Siamese Neural Network, Natural Language Processing Model, Transformer, MLP, Random Forest, Bayesian Regularized Neural Network, XGBoost, Light Gradient Boosting Machine, CNN, SVR | Annual Review of Analytical Chemistry | Review Paper, SELFIES, Graph Neural Networks, MPNNs, SchNet, DimeNet++, ComENet, DeepMASS, MS2DeepScore, Spec2Vec, CLERMS, NEIMS, MassFormer, 3DMolMS, CFM-ID 4.0, SCARF, ICEBERG, Retip, METLIN-DLM, GNN-RT, DeepGCN-RT, RT-transformer, DNNpwa-TL, MetCCS, AllCCS, CCSBase, CCSP 2.0, DeepCCS, SigmaCCS, AllCCS2, SIRIUS, BUDDY, mass spectrometry analysis | |
| Critical Assessment of Small Molecule Identification | Competation, Benchmark | Phytochemistry Letters | CASMI 2012, 2013, 2014, 2016, 2017, 2022, project website | |
| Insights into predicting small molecule retention times in liquid chromatography using deep learning | SVM, Deep Learning, Transformer, GNN, RF, MLP, DNN, CNN, RNN | Journal of Cheminformatics | CSI-FingerID, SIRIUS 4, MSNovelist, MassGenie, Smiles-Bert, Smiles transformer, Chemformer, PredRet, GCN, RGCN, MPNN, GIN, DNNpwa-TL, CMM-RT, 1D CNN-TL, AWD-LSTM, TransformerXL, MDC-ANN, retention time_GNN, QGeGNN, RT-transformer, mt-QSRR, MultiConditionRT, HighResNPS, GNN-RT-TL, Retip, SMRT, GNN-TL, review paper | |
| Quantum chemical electron impact mass spectrum prediction for de novo structure elucidation: Assessment against experimental reference data and comparison to competitive fragmentation modeling | Quantum Chemical Model, Expert system, Spectrum Calculator | International Journal of Quantum Chemistry | QCEIMS, CFM-EI, comparison between first principle simulation and expert system | |
| FragHub: A Mass Spectral Library Data Integration Workflow | Workflow | Analytical Chemistry | FragHub, mass spectral lib integration | |
| Evaluation of the performance of a tandem mass spectral library with mass spectral data extracted from literature | Database | Drug Testing and Analysis | MSforID, mass spectrum analysis | |
| Comparative Evaluation of Electron Ionization Mass Spectral Prediction Methods | Quantum Chemistry, Machine Learning, Algorithm | QCEIMS, CFM-EI, NEIMS comparison | ||
| Computational mass spectrometry for small-molecule fragmentation | General models of fragmentation, Simulation Software, Fragmentation Prediction Software, Machine Learning, Heuristic Method, Combination of MetFrag and spectral library search, Classifier, Kernel-based method, Combinatorial Optimization Model | TrAC Trends in Analytical Chemistry | DENDRAL, Mass Frontier 4, Mass Frontier 6, ACD Fragmenter, ISIS, MetFrag, MetFusion, Varmuza feature-based classification approach, Heinonen et al. kernel-based approach, Fragmentation Trees, mass spectrometry fragmentation, review paper | |
| Searching molecular structure databases using tandem MS data: are we there yet? | Automated Method, Machine Learning | Current Opinion in Chemical Biology | CFM-ID, MetFrag, MAGMa, FingerID, CSI:FingerID, MetFrag2.2, MAGMa+, MSFINDER, MIDAS, IOKR version of CSI:FingerID, metabolite identification, review paper | |
| Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry | Unsupervised machine learning | Mass Spectrometry Reviews | PCA, IMS data analysis, Project Link: BioMap, review paper | |
| Machine Learning in Small-Molecule Mass Spectrometry | GNN, Transformer, Machine learning | Annual Review of Analytical Chemistry | ||
| NIST20 | data library | project website, purchase required | ||
| NPLIB1 | dataset,benchmark | data source, deriving from CANOPUS by filtering to [M+H] spectra | ||
| GeMS | dataset | dataset collected by Dreams, LC-MS, 714 million MS/MS spectra | ||
| MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries | Spectral Library | Nature Methods | MSnLib |
| Name & Link | Request Type | Description | Return information | API_link |
|---|---|---|---|---|
| GNPS | html | This gets you all spectra but without peaks. | Spectrum Metadata | json |
| MoNA | curl | Search for Mass Spectrum |
Computational methods for predicting peptides mass spectra
| Paper Title & Link | Method Type | Venue | Code | Notes |
|---|---|---|---|---|
| Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network | CNN | Analytical Chemistry 2020 | PredFull | |
| Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning | Transformer | Nature Methods 2019 | Prosit | |
| High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis | LSTM | Nature Methods 2019 | DeepMass:Prism |
AI approaches for peptides identification and quantification
| Paper Title & Link | Method Type | Venue | Code | Notes |
|---|---|---|---|---|
| AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics | Benchmark | Nature communications | AlphaPeptDeep | |
| NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics | Benchmark | NeurIPS 2024 | NovoBench |
| Paper Title & Link | Feasible scene | Venue | Code | Notes |
|---|---|---|---|---|
| MSBooster: improving peptide identification rates using deep learning-based features | Rating between LC-MS/MS spectra and Piptide | Nature communications | python package |
| Paper Title & Link | Method Type | Venue/Year | Data Source | Metric | Code / Data | Notes |
|---|---|---|---|---|---|---|
| Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations | - | preprint | ^1H/^13C NMR spectra | - | — | |
| Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts | Transformer | Nature Computional Science 2025 | 3Dstructure to NMR | Zenodo | ||
| Deep Learning Network for NMR Spectra Reconstruction in Time-Frequency Domain and Quality Assessment | Transformer + CNN hybrid | Nature Communications 2025 | 1D/2D NMR time-series | RMSE (dB) | SpectraNet · Zenodo | Time-frequency domain reconstruction; spectral quality scoring |
| GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts | Graph + Transformer | GT-NMR · | ||||
| PROSPRE: Solvent-aware ^1H NMR chemical shift prediction using deep learning | Deep Learning + Solvent-aware | Metabolites 2024 | Experimental NMR data | MAE < 0.10 ppm | — | Trained on large-scale solvent-specific datasets |
| Prediction of chemical shift in NMR: A review | Review | — | — | — | — | Survey of methods & datasets |
| iShiftML: Highly Accurate Prediction of NMR Chemical Shifts | Hybrid ML + QM | — | QM descriptors | < 0.2 ppm | — | Fast inference; QM features required |
| TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning | GNN + Multitask | Chem paper 2023 | Graph-based input | 2D spectra | Code · Data | 1D→2D solvent-aware prediction |
| NMR shift prediction from small data quantities | ML | — | NMRShiftDB2 | MAE (ppm) | — | Small-data learning |
| NMR-spectrum prediction for dynamic molecules | ML-Dynamics | — | Simulated ensembles | Time-avg ppm | — | Conformational averaging |
| Machine learning in NMR spectroscopy | Review (DL) | — | NMRShiftDB2 | — | — | Multitask trends & outlook |
| A framework for automated structure elucidation from routine NMR spectra | ML-based structure elucidation | Chem. Sci. 2021 | 1D ^1H/^13C NMR spectra | Top-10 accuracy | — | ML framework for automated structure elucidation |
| Deep learning enabled ultra-high quality NMR chemical shift prediction from spin echo spectra | Deep Learning + Signal Processing | Science 2024 | Spin echo NMR spectra | MAE (ppm) | — | High-resolution chemical shift prediction |
| Paper Title & Link | Method Type | Venue/Year | Input | Metric | Code / Data | Notes |
|---|---|---|---|---|---|---|
| Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra | - | openreview | ^1H/^13C NMR spectra | - | — | DiT Based 3D molecular generation |
| NMRMind: A Transformer-Based Model Enabling the Elucidation from Multidimensional NMR to Structures | Transformer | Analytical Chemistry 2025 | ^1H/^13C NMR spectra | Top-k accuracy | — | |
| DIFFNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation | Arxiv 2025 | ^1H/^13C NMR spectra | Top-k accuracy | — | Diffusion Based Generation | |
| A Transformer Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds | Transformer-based Generative AI | Arxiv 2024 | IR, UV, ^1H NMR spectra | Top-15 accuracy | — | End-to-end structure elucidation via chemical language modeling |
| NMR-Solver: Automated Structure Elucidation via Large-Scale Spectral Matching and Physics-Guided Fragment Optimization | Hybrid Spectral Matching + Physics-guided Optimization | Arxiv 2025 | ^1H/^13C NMR spectra | Top-1 accuracy | — | Combines spectral matching with physics-guided optimization |
| Accurate and Efficient Structure Elucidation from Routine One-Dimensional NMR Spectra Using Multitask Machine Learning | CNN + Transformer (multitask) | Arxiv 2024 | 1D spectra | Top-1 / Top-k | — | High accuracy with 1D only |
| Learning the Language of NMR Structure Elucidation from NMR Spectra Using Transformer Models | Transformer (sequence modeling) | ChemRxiv 2023 | 1D spectra | — | — | Treats NMR as a language; structure reasoning |
| DeepSAT: Learning Molecular Structures from NMR Data | Multimodal DL | — | NMR spectra | Structure accuracy | — | Uses NPAtlas, NPASS, GNPS etc.; multimodal molecular structure learning |
| Bayesian approach to structural elucidation with crystalline-state SSNMR | Bayesian / probabilistic | arXiv 2019 | Solid-state NMR | Top-5 | — | Requires crystal info |
| Deep RL + GCN for NMR inverse problem | RL (MCTS) + GCN | J. Phys. Chem. Lett. 2022 | Shift table | Top-3 | — | Effective for small molecules |
| Name & Link | Type | Venue/Year | Size / Modality | Real / Sim | Download / Access |
|---|---|---|---|---|---|
| Multimodal Spectroscopic Dataset (includes NMR) | Benchmark (multi-spectra) | NeurIPS 2024 | 7.9e5+ synthetic spectra | Sim | Zenodo (via paper) |
| IR–NMR Multimodal Computational Spectra Dataset | Dataset (IR + NMR) | Nature Sci. Data 2025 | 177,461 spectra | Sim (MD + DFT + ML) | Zenodo |
| NMRShiftDB2 | Database (small molecules) | — | ~50k; ¹H/¹³C | Real | Open |
| BMRB | Database (bio-molecules) | — | >13k biomolecules; ¹H/¹³C/¹⁵N… | Real | FTP/STAR |
| SDBS | Database (multi-modal) | — | ~14k; ¹H/¹³C/IR/MS/UV | Real | Crawl/script |
| 2DNMRGym (HSQC) | Simulated 2D dataset | 2024 | 22k+ HSQC | Sim | HuggingFace/Zenodo |
| NMRMixDB | Mixtures | — | ~3k; ¹H | Real | Open |
| HMDB 5.0 | Database (metabolites) | — | 4,149; ¹H/¹³C | Real | Open-source |
| BMRB20 (metabolites subset) | Database (metabolites) | — | 1,200+; ¹H/¹³C | Real | Open-source |
| NMRShiftDB2 (updated) | Database (small molecules) | — | 53,954; ¹H/¹³C | Real | Open-source |
| SDBS23 | Database (natural products) | — | 15,218 (¹H) / 13,457 (¹³C); 900+ NPs | Real | Open-source |
| NP-MRD | Database (natural products) | — | 1,290; ¹H/¹³C | Real | Open-source |
| NAPROC-13 | Database (natural products) | — | 6,000+; ¹³C | Real | Open-source |
| CH-NMR-NP | Database (natural products) | — | 30,500; 926 compounds; ¹H/¹³C | Real | Open-source |
| Spektraris-NMR | Specialized subset | — | 466 samples; 250 taxanes; ¹H/¹³C | Real | Open-source |
| C6H6 | Database | — | 506 entries; ¹H/¹³C | Real | Open-source |
| Ilm-NMR-P31 | Element-specific database | — | 14,250 (31P spectra) / 13,730 entries | Real | Open-source |
| KnowItAll NMR | Commercial database | — | 1.28 M+ spectra; ¹H/¹³C | Real (Commercial) | Commercial (Wiley) |
| Micronmr | Commercial database | — | 1.0 M+ spectra; ¹³C | Real (Commercial) | Commercial |
| NMRBank | Open repository | — | 225,809 (¹H) / 149,135 (¹³C) | Real | Open-source |
|
| Paper Title & Link | Method | Venue/Year | Task | Notes |
|---|---|---|---|---|
| Statistical HOmogeneous Cluster SpectroscopY (SHOCSY) (Anal. Chem. 2014) | Statistical clustering | 2014 | ¹H metabolomics clustering | Classic baseline |
| Sparse Convex Wavelet Clustering (includes NMR signals) | Wavelet + convex clustering | arXiv | Signal clustering | Joint denoising + clustering |
| Deep representation learning for NMR spectral clustering | Autoencoder / DL | — | Spectral clustering | DL embedding → K-means/HC |
| Model / Paper | Family | Venue/Year | Link | Notes |
|---|---|---|---|---|
| ChemGPT | Large chemistry LM | — | — | Molecular generative & reasoning; spectra-to-structure transfer |
| DETANet | DL architecture | — | — | Chemistry perception; potential for spectra-conditioned tasks |
| DreaMS (MS foundation model) | Transformer (spectra FM) | Nat. Biotech. 2025 | GitHub | Cross-modal FM idea transferrable to NMR |
| Category | Name | What it offers | Typical Use | Link |
|---|---|---|---|---|
| Dataset & Pipeline | ARTINA – 100‑Protein NMR Dataset | 1329 2D–4D spectra + chemical shift assignments + protein structures | Train/test protein NMR auto‑assignment & structure inference | https://www.nature.com/articles/s41597-023-02879-5 |
| Database (LLM‑extracted) | NMRBank | ~225k small‑molecule records (SMILES, ¹H/¹³C shifts) | Build ML models for shift/spectrum prediction | https://pmc.ncbi.nlm.nih.gov/articles/PMC12118362/ |
| Synthetic multimodal | IR‑NMR Dataset | 177k IR spectra + 1.2k NMR shifts (DFT+ML) | Cross‑modal learning / pretraining | https://www.nature.com/articles/s41597-025-05729-8 |
| Carbohydrates | GlycoNMR | 2,609 glycans with 211k NMR shifts | Domain‑specific ML (carbohydrates) | https://arxiv.org/abs/2311.17134 |
| 2D spectra | 2DNMRGym | 22k+ HSQC spectra + SMILES (partly human‑annotated) | Train/benchmark 2D NMR predictors | https://arxiv.org/abs/2505.18181 |
| Tool | What it does | How to use | Required Inputs | Output | Link |
|---|---|---|---|---|---|
| MestReNova | Comprehensive NMR data analysis software for processing 1D/2D spectra, including peak picking, integration, and chemical shift assignment | Process NMR spectra, analyze data, and assign chemical shifts using MestReNova interface | NMR spectra: 1D, 2D, 3D; chemical shift (ppm), coupling constants (Hz), integration, peak assignment | Processed spectra, chemical shift assignments, peak integration | MestReNova Official Website |
| Spinach (Matlab) | Physics‑based simulation of 1D/2D spectra (COSY, HSQC, NOESY), relaxation, MAS NMR | Define spin system + select pulse sequence + run simulation in Matlab | Spin system: isotopes, chemical shifts (ppm), J couplings (Hz), CSA/dipolar terms; external field (B₀); pulse sequence | Time‑domain FID & frequency‑domain spectra (1D/2D) | https://en.wikipedia.org/wiki/Spinach_(software) |
| ORCA / NWChem | Quantum chemistry calculation of shielding tensors and J couplings | Run DFT/ab initio with NMR keyword | 3D geometry (XYZ/MOL/PDB), basis set, method, optional solvent model | Isotropic shieldings (σ), J‑couplings, CSA tensors | https://github.com/nwchemgit/nwchem |
| NMRDB | Online prediction of ¹H/¹³C (1D and 2D COSY/HSQC/HMBC) | Draw molecule or paste SMILES on web UI | Molecular structure (drawn or SMILES), optional solvent/field strength | Simulated 1D/2D spectra, JCAMP export | https://www.nmrdb.org/ |
| ChemAxon NMR Predictor | Predicts ¹H/¹³C chemical shifts and spectra (GUI + CLI) | Use MarvinSketch or cxcalc nmr CLI |
Structure input (SMILES, MOL, SDF), optional solvent/field | Chemical shifts, spectra, JCAMP files | https://docs.chemaxon.com/display/docs/NMR%2BPredictor |
| NMRbox | VM platform bundling many NMR tools (TopSpin, Sparky, CCPN, etc.) | Launch VM, import spectra, run pipelines | Experimental or synthetic spectra, peak lists | Processed spectra, assignments, structural analysis | https://nmrbox.nmrhub.org/software |
Infrared spectrum prediction from molecular structures
Molecular characterization from infrared spectra
| Paper Title & Link | Venue | Method Type | Output | Data Source | Code | CKPT |
|---|---|---|---|---|---|---|
| Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation (2024) | NeurIPS2024 | LLM (GPT‑4o, Claude‑3, Gemini, etc.) | SMILES | — | — | |
| Leveraging Infrared Spectroscopy for Automated Structure Elucidation (2024) | Commun. Chem. | DL (Transformer) | SMILES | Simulated: PubChem Experimental: NIST |
||
| Transformer-Based Models for Predicting Molecular Structures from Infrared Spectra Using Patch-Based Self-Attention (2025) | J. Phys. Chem. A | DL (Transformer + Patch-based Self-Attention) | SMILES | Simulated: PubChem, QM9S Experimental: NIST |
||
| Revolutionizing Spectroscopic Analysis Using Sequence-to-Sequence Models I: From Infrared Spectra to Molecular Structures (2025) | ChemRxiv | DL (GRU, LSTM, GPT, Transformer) | SELFIES | QM9, PC9 | — | — |
| Setting new benchmarks in AI‑driven infrared structure elucidation (2025) | Digit. Discov. | DL (Transformer + Patch‑based Self‑Attention) | SMILES | Simulated: IRtoMol, Multimodal Spectroscopic Dataset Experimental: NIST |
GitHub |
| Dataset / Method Name & Link | Size | Data Source | Real / Simulated | Element Coverage |
|---|---|---|---|---|
| Chemprop-IR |
85,232 | PubChem (SMILES for molecular structures) | Simulated (GFN2-xTB) | C, H, O, N, S, P, Si, F, Cl, Br, I |
| CMPNN |
31,570 | GDB (SMILES for molecular structures) | Simulated (DFT) | C, H |
| Multimodal Spectroscopic Dataset |
794,403 | USPTO reaction dataset (SMILES for molecular structures) | Simulated (MD) | C, H, O, N, S, P, Si, B, F, Cl, Br, I |
| IRtoMol |
634,585 | PubChem (SMILES for molecular structures) | Simulated (MD + PCFF forcefield) | C, H, O, N, S, P, F, Cl, Br, I |
| MolPuzzle |
216 (Picture format) | — | Mixed | — |
| IR–NMR Multimodal Computational Spectra Dataset |
177,461 | USPTO (SMILES for molecular structures) | Simulated (MD + DFT + ML) | C, H, O, N, S, P, Si, B, F, Cl, Br, I |
| QM9S |
133,885 | QM9 (re-optimized geometries) | Simulated (DFT) | C, H, O, N, F |
| SRD 35 |
5,228 (gas-phase) | — | Real | — |
| NIST Chemistry WebBook |
>16,000 | — | Real | — |
| Coblentz Society Spectral Database |
>9,500 | — | Real | — |
| NWIR | ~1,000–1,500 (gas-phase) | — | Real | — |
| AIST SDBS | ~54,100 | — | Real | — |
| Tool & Link | Type | Main Functions / Uses |
|---|---|---|
| RDKit |
Open-source library (Python/C++) | Molecular descriptors for IR–structure fusion, Morgan/ECFP fingerprints for IR-guided search, SMARTS substructure from IR functional-group hints, Conformers + MMFF/UFF for vibrational context, InChI/SMILES I/O to link spectra and structures |
| Psi4 |
Open-source library (Python/C++) | High‑accuracy, open‑source ab initio quantum chemistry suite for molecular properties |
| GFN2-xTB |
Open-source semiempirical QM (Fortran/Python) | Fast IR frequency estimation, IR peak-to-structure attribution, Conformer effects on IR bands, Batch geometry preparation, Solvent-adjusted IR shifts |
| cclib |
Open-source library (Python) | Provides parsers for output files of computational chemistry packages |
| SELFIES |
Open-source library (Python) | SMILES↔SELFIES conversion |
| Open Babel | Open-source chemistry toolbox (C++/Python) | Format conversion, 3D generation and conformer search (MMFF94/UFF), SMARTS substructure and fingerprints (FP2/FP3/MACCS) |
| Gaussian 16 | Commercial quantum chemistry software | DFT IR frequencies/intensities, Harmonic and anharmonic IR (VPT2), Normal‑mode peak assignment, Solvent models (PCM/SMD) for IR |
| ChemCraft | Commercial visualization software | Visualize vibrational spectra with Lorentzian/Gaussian broadening, Compare calculated vs. experimental spectra, View MOs/SCF graphs, Parse outputs from Gaussian/ORCA/GAMESS/NWChem etc. |
| Q‑Chem | Commercial quantum chemistry software | DFT IR frequencies/intensities, Harmonic and anharmonic IR (VPT2/VCI/TOSH), Normal‑mode and Raman analysis |
Joint prediction of multiple spectral modalities from molecular structures
| Paper Title & Link | Modalities | Method Type | Venue/Year | Code | Notes |
|---|---|---|---|---|---|
| Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry | IR + MS + ^1H/^13C/HSQC-NMR (simulated) | Dataset + baselines (CNN/XGBoost) | NeurIPS D&B 2024 | Provides ~790k multimodal spectra and forward-prediction baselines to pretrain joint predictors. | |
| IR–NMR Multimodal Computational Spectra Dataset | IR + NMR (simulated) | Dataset (MD/DFT/ML) | Sci. Data 2025 | — | Curated paired IR–NMR set enabling multi-task forward models and cross-modal transfer. |
Multimodal integration for enhanced molecular identification
| Paper Title & Link | Modalities | Method Type | Venue/Year | Code | Notes |
|---|---|---|---|---|---|
| A Transformer-Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds | IR + UV/Vis + ^1H NMR | Encoder–decoder Transformer (generative CASE) | J. Cheminformatics 2025 | — | End-to-end spectra→structure generation; strong single-paper reference for joint use of IR/UV/NMR. |
| Leveraging Infrared Spectroscopy for Automated Structure Elucidation | IR | Transformer | Commun. Chem. 2024 | Single-modality IR baseline frequently paired with NMR/MS to form multimodal pipelines. |
| Paper Title & Link | What it Brings | Method Type | Venue/Year | Code | Notes |
|---|---|---|---|---|---|
| Ming-Omni: A Unified Multimodal Model for Perception and Generation | Unified encoder/decoders; modality-specific MoE routers | MoE-UMM (image/text/audio/video) | arXiv 2025 | Useful design for preventing cross-modal conflicts when adding spectral modalities as new “experts”. | |
| Boosting Multimodal Learning via Disentangled Gradient Learning (DGL) | Decouples encoder vs. fusion optimization to avoid gradient interference | Training framework | arXiv 2025 | Practical recipe when IR/NMR/MS encoders underperform once fused. | |
| I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts | Interaction-specialized experts + sample-level explanations | MoE fusion | arXiv/ICML 2025 | Lets you see which modality interactions (e.g., IR×NMR) drive a prediction. | |
| MM-Embed: Universal Multimodal Retrieval with MLLMs | Bi-encoder + MLLM re-ranker for flexible text↔image retrieval | Multimodal embeddings | ICLR 2025 | HF Card | Template for spectra/text retrieval (e.g., query by IR peaks + structural hints). |
| SpecEmbedding: Supervised Contrastive Learning Leads to More Reasonable Spectral Embeddings | MS spectral encoder with supervised contrastive learning | Transformer encoder + SupCon | Analytical Chemistry ASAP 2025 | High-quality MS embeddings usable as one branch in IR+NMR+MS fusion. | |
| UAE: Can Understanding and Generation Truly Benefit Together — or Just Coexist? | Unified auto-encoder view (I2T ↔ T2I) with RL to align understanding & generation | Unified multimodal learning (UMM) | arXiv 2025 | Useful recipe for tying inverse (spectra→structure) with forward (structure→spectra) via reconstruction. |
| Name & Link | Modalities | Type | Size | Real / Simulated | Website / Code | Notes |
|---|---|---|---|---|---|---|
| Multimodal Spectroscopic Dataset (MSD) | IR + MS + ^1H/^13C/HSQC-NMR | Benchmark + Dataset | ~790k molecules | Simulated | Site · Zenodo · GitHub | Common pretraining/eval source for multimodal spectral models. |
| IR–NMR Multimodal Computational Spectra Dataset | IR + NMR | Dataset | 177,461 spectra | Simulated | Article · Zenodo | Paired IR–NMR for cross-modal learning and evaluation. |
| SDBS (AIST) | IR + MS + ^1H/^13C-NMR + Raman + ESR | Database | ~34k molecules | Real | SDBS | Broad real-world spectra across modalities for validation. |
| NIST Chemistry WebBook / SRD-35 | IR (gas-phase) + MS + UV/Vis | Database | 5,228 IR (SRD-35); 16k+ IR & 33k+ MS overall | Real | WebBook · SRD-35 | Standard references and downloadable IR libraries. |
| SimXRD-4M (related) | Powder XRD | Benchmark + Dataset | 4.06M patterns | Simulated | OpenReview | Not small-molecule spectra, but relevant template for large-scale physics-faithful simulation & benchmarking. |
Prediction of XRD patterns from crystal structures
Crystal structure determination from XRD patterns
📄 This project is licensed under the MIT License — see the LICENSE file for details.
