Skip to content

AIMS-Lab-HKUSTGZ/Awesome-SpectraAI-Resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 

Repository files navigation

Awesome-SpectraAI-Resources

Awesome SpectraAI Resources

✨✨ A curated collection of resources on artificial intelligence for spectral data analysis, covering computational methods for mass spectrometry (MS), NMR, IR, and XRD data.


1. Mass Spectrometry (Small Molecules)

1.1 Forward Task (Molecule → Spectrum)

Computational approaches for predicting mass spectra from molecular structures

Paper Title & Link Method Type Venue Code Notes
Inferring CID by Estimating Breakage Events and Reconstructing their Graphs GNN, Transformer Analytical Chemistry, NeurIPS2023 Star ICEBERG, SCARF
FIORA: Local neighborhood-based prediction of compound mass spectra from single fragmentation events GNN Nature Communication Star FIORA
Efficiently predicting high resolution mass spectra with graph neural networks GNN ICML2023 Star GRAFF-MS
CFM-ID 4.0: More Accurate ESI MS/MS Spectral Prediction and Compound Identification Machine Learning Model analytical chemistry CFM-ID 4.0, project website
Computational prediction of electron ionization mass spectra to assist in GC-MS compound identification Artificial Neural Network, Probabilistic Generative Model analytical chemistry CFM-EI,old version of CFM-ID, project website
Tandem mass spectrum prediction for small molecules using graph transformers Graph Transformer, Deep Learning Model Nature Machine Intelligence Star MassFormer
Rapid Prediction of Electron-Ionization Mass Spectrometry Using Neural Networks Neural Network, Graph-Convolutional Network ACS Central Science Star NEIMS
Mass Spectra Prediction with Structural Motif-based Graph Neural Networks GNN, MLP, Graph Transformer Scientific Reports MoMS-Net
Prediction of electron ionization mass spectra based on graph convolutional networks GCN, MLP International Journal of Mass Spectrometry Baojie Zhang mass spectra prediction
QCxMS and QCEIMS related publications Born-Oppenheimer Molecular Dynamics Star QCxMS, QCxMS2, QCEIMS, First Principles Calculation, project website
3DMolMS: prediction of tandem mass spectra from 3D molecular conformations 3D Molecular Network Bioinformatics Star 3DMolMs,project website
Rapid Approximate Subset-Based Spectra Prediction for Electron Ionization–Mass Spectrometry GNN Analytical Chemistry Star rassp
Towards First Principles Calculation of Electron Impact Mass Spectra of Molecules Mass Spectroscopy computational therory Angewandte International Edition Chemie theory introduction

1.2 Inverse Task (Spectrum → Molecule)

AI methods for molecular identification and elucidation from mass spectra

Paper Title & Link Method Type Venue Code Notes
An end-to-end deep learning framework for translating mass spectra to de-novo molecules GRU, CNN communications chemistry Star Spec2mol
Searching molecular structure databases with tandem mass spectra using CSI:FingerID Machine Learning Proceedings of the National Academy of Sciences project website CSI:FingerID
Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra DNN, Kernel-based Model, SVM Bioinformatics project website Deep kernel learning method
CSU-MS $^2$ : A Contrastive Learning Framework for Cross-Modal Compound Identification from MS/MS Spectra to Molecular Structures Contrastive Learning Framework, Transformer, GNN Analytical Chemistry Star CSU-MS²
MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra Transformer, Variational Autoencoder Biomolecules Star MassGenie, FragGenie, VAE-Sim
JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data Joint Embedding Space Technique, GNN, MLP Bioinformatics Star JESTR
Metabolite Identification through Machine Learning — Tackling CASMI Challenge Using FingerID Machine Learning, Kernel-based approach, SVM Metabolites FingerID
Annotating metabolite mass spectra with domain-inspired chemical formula transformers Transformer, Neural Network, Contrastive Learning Model Nature Machine Intelligence Star MIST
Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS Transformer, Foundation Model Nature Biotechnology Star DreaMS, Foundation model
Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra DNN, SVMs Nature Biotechnology Star CANOPUS
Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa Software, Substructure-based algorithm Mass Spectrometry project website MAGMa
DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra Transformer, Discrete Graph Diffusion Model arXiv Star DiffMS
MetFID: artificial neural network-based compound fingerprint prediction for metabolite annotation Artificial Neural Network Metabolomics MetFID
MADGEN - MASS-SPEC ATTENDS TO DE NOVO MOLECULAR GENERATION Attention-based generative model, LSTM, CNN arxiv Star MADGEN
An end-to-end deep learning framework for translating mass spectra to de-novo molecules Deep Learning architecture, Autoencoder Communications Chemistry Star Spec2Mol
DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models Diffusion Model, Transformer arXiv DiffSpectra
MS2Mol: A transformer model for illuminating dark chemical space from mass spectra Transformer, Generative Model MS2Mol
Improved metabolite identification with MIDAS and MAGMa through MS/MS spectral dataset-driven parameter optimization Machine Learning, Rule-based Metabolomics Star MIDAS, MAGMa, CSI: FingerID
Using Graph Neural Networks for Mass Spectrometry Prediction GNN arXiv GNN-based models
MSNovelist: de novo structure generation from mass spectra Encoder-decoder neural network, RNN Nature Methods Star MSNovelist
Predicting a Molecular Fingerprint from an Electron Ionization Mass Spectrum with Deep Neural Networks Deep Neural Network analytical chemistry Star DeepEI, FP Model, molecular fingerprint prediction
Deep MS/MS-Aided Structural-Similarity Scoring for Unknown Metabolite Identification Deep Neural Network Star DeepMASS, mass spectrum-based library match for Metabolite
Ultra-fast and accurate electron ionization mass spectrum matching for compound identification with million-scale in-silico library Word2vec, HNSW nature communications Star FastEI, spectrum simulation expansion, spectrum search for compoundidentification
In silico fragmentation for computer assisted identification of metabolite mass spectra Combinatorial Fragmenter BMC Bioinformatics MetFrag, mass spectrum-based metabolite identification, Project Link
MS2Query: reliable and scalable MS2 mass spectra-based analogue search embedding-based chemical similarity predictors Nature Communications Star MS2Query, mass spectra analogue search, Project Link: https://doi.org/10.5281/zenodo.6124553

1.3 General Tools

Paper Title & Link Feasible scene Venue Code Notes
matchms- processing and similarity evaluation of mass spectrometry data raw mass spectra to pre- and post-processe The Journal of Open Source Software Star python package
MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra compare tandem mass spectra Journal of Cheminformatics Star
Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships NLP-inspired Model, Word2Vec PLOS Computational Biology Star Spec2Vec
Chemically informed analyses of metabolomics mass spectrometry data with Qemistree Machine Learning, Tree-based approach Nature chemical biology Star Qemistree
MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics Software Tool Nature Communications Star MetaboAnalystR 4.0, LC-MS data processing, R package, Project Link: MetaboAnalyst
Fully Automated Unconstrained Analysis of High-Resolution Mass Spectrometry Data with Machine Learning Decision Trees, Neural Network, LSTM Journal of the American Chemical Society Star MEDUSA, mass spectrum analysis overall framework, Project Link: project_link
The METLIN small molecule dataset for machine learning-based retention time prediction Deep Learning Nature Communications project website, retention time prediction, R package
An end-to-end deep learning method for mass spectrometry data analysis to reveal disease-specific metabolic profiles Deep Learning Nature Communications Star DeepMSProfiler, disease-specific metabolic profiling
Trackable and scalable LC-MS metabolomics data processing using asari open-source software tool Nature Communications Star asari, LC-MS data processing, Project Link: https://pypi.org/project/asari-metabolomics/
Automatic Compound Annotation from Mass Spectrometry Data Using MAGMa Ranking algorithm Mass Spectrom (Tokyo) Peak list annotation, MAGMa
MIST-CF: Chemical Formula Inference from Tandem Mass Spectra Spectrum Transformers Journal of Chemical Information and Modeling Star Mass Spectra Formula Inference
Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra Deep Neural Network Nature Biotechnology Star compound class annotation, CANOPUS

1.4 Datasets, Benchmark and Review

Name Type Venue Website Notes
MassSpecGym: A benchmark for the discovery and identification of molecules Benchmark, Transformer, GNN Advances in Neural Information Processing Systems Star MassSpecGym
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking Mass Spectrum Database Nature Biotechnology Database link
A versatile toolkit for drug metabolism studies with GNPS2: from drug development to clinical monitoring Mass Spectrum Database Nature Protocals Database link
MassBank: a public repository for sharing mass spectral data for life sciences Mass Spectrum Database Journal of mass spectrometry Database link-EU Database link-NA
Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond Neural architectures, ML-empowered solution arXiv review
BMDMS-NP: A comprehensive ESI-MS/MS spectral library of natural compounds Spectral Library Phytochemistry Star BMDMS-NP, ESI-MS/MS spectral library
Annual Review of Analytical Chemistry Machine Learning in Small-Molecule Mass Spectrometry Sequence-based Model, Graph-based Model, Deep Learning Model, Siamese Neural Network, Natural Language Processing Model, Transformer, MLP, Random Forest, Bayesian Regularized Neural Network, XGBoost, Light Gradient Boosting Machine, CNN, SVR Annual Review of Analytical Chemistry Review Paper, SELFIES, Graph Neural Networks, MPNNs, SchNet, DimeNet++, ComENet, DeepMASS, MS2DeepScore, Spec2Vec, CLERMS, NEIMS, MassFormer, 3DMolMS, CFM-ID 4.0, SCARF, ICEBERG, Retip, METLIN-DLM, GNN-RT, DeepGCN-RT, RT-transformer, DNNpwa-TL, MetCCS, AllCCS, CCSBase, CCSP 2.0, DeepCCS, SigmaCCS, AllCCS2, SIRIUS, BUDDY, mass spectrometry analysis
Critical Assessment of Small Molecule Identification Competation, Benchmark Phytochemistry Letters CASMI 2012, 2013, 2014, 2016, 2017, 2022, project website
Insights into predicting small molecule retention times in liquid chromatography using deep learning SVM, Deep Learning, Transformer, GNN, RF, MLP, DNN, CNN, RNN Journal of Cheminformatics Star CSI-FingerID, SIRIUS 4, MSNovelist, MassGenie, Smiles-Bert, Smiles transformer, Chemformer, PredRet, GCN, RGCN, MPNN, GIN, DNNpwa-TL, CMM-RT, 1D CNN-TL, AWD-LSTM, TransformerXL, MDC-ANN, retention time_GNN, QGeGNN, RT-transformer, mt-QSRR, MultiConditionRT, HighResNPS, GNN-RT-TL, Retip, SMRT, GNN-TL, review paper
Quantum chemical electron impact mass spectrum prediction for de novo structure elucidation: Assessment against experimental reference data and comparison to competitive fragmentation modeling Quantum Chemical Model, Expert system, Spectrum Calculator International Journal of Quantum Chemistry QCEIMS, CFM-EI, comparison between first principle simulation and expert system
FragHub: A Mass Spectral Library Data Integration Workflow Workflow Analytical Chemistry Star FragHub, mass spectral lib integration
Evaluation of the performance of a tandem mass spectral library with mass spectral data extracted from literature Database Drug Testing and Analysis MSforID, mass spectrum analysis
Comparative Evaluation of Electron Ionization Mass Spectral Prediction Methods Quantum Chemistry, Machine Learning, Algorithm QCEIMS, CFM-EI, NEIMS comparison
Computational mass spectrometry for small-molecule fragmentation General models of fragmentation, Simulation Software, Fragmentation Prediction Software, Machine Learning, Heuristic Method, Combination of MetFrag and spectral library search, Classifier, Kernel-based method, Combinatorial Optimization Model TrAC Trends in Analytical Chemistry DENDRAL, Mass Frontier 4, Mass Frontier 6, ACD Fragmenter, ISIS, MetFrag, MetFusion, Varmuza feature-based classification approach, Heinonen et al. kernel-based approach, Fragmentation Trees, mass spectrometry fragmentation, review paper
Searching molecular structure databases using tandem MS data: are we there yet? Automated Method, Machine Learning Current Opinion in Chemical Biology CFM-ID, MetFrag, MAGMa, FingerID, CSI:FingerID, MetFrag2.2, MAGMa+, MSFINDER, MIDAS, IOKR version of CSI:FingerID, metabolite identification, review paper
Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry Unsupervised machine learning Mass Spectrometry Reviews PCA, IMS data analysis, Project Link: BioMap, review paper
Machine Learning in Small-Molecule Mass Spectrometry GNN, Transformer, Machine learning Annual Review of Analytical Chemistry
NIST20 data library project website, purchase required
NPLIB1 dataset,benchmark data source, deriving from CANOPUS by filtering to [M+H] spectra
GeMS dataset dataset collected by Dreams, LC-MS, 714 million MS/MS spectra
MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries Spectral Library Nature Methods MSnLib

1.5 Spectrum Databases (API Usage)

Name & Link Request Type Description Return information API_link
GNPS html This gets you all spectra but without peaks. Spectrum Metadata json
MoNA curl Search for Mass Spectrum

2. Mass Spectrometry (Peptides)

2.1 Forward Task (Peptides → Spectrum)

Computational methods for predicting peptides mass spectra

Paper Title & Link Method Type Venue Code Notes
Full-Spectrum Prediction of Peptides Tandem Mass Spectra using Deep Neural Network CNN Analytical Chemistry 2020 Star PredFull
Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning Transformer Nature Methods 2019 Star Prosit
High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis LSTM Nature Methods 2019 Star DeepMass:Prism

2.2 Inverse Task (Spectrum → Peptides)

AI approaches for peptides identification and quantification

Paper Title & Link Method Type Venue Code Notes
InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments Transformer, diffusion model nature machine intelligence Star InstaNovo
De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model Transformer ICML 2022 Star CasaNovo
Bridging the Gap between Database Search and De Novo Peptide Sequencing with SearchNovo Transformer, Database ICLR 2025 SearchNovo
De novo peptide sequencing by deep learning CNN, LSTM PNAs 2017 Star deepNovo
Computationally Instrument-Resolution-independent De Novo Peptide Sequencing for High-Resolution Devices CNN, LSTM Nature Machine Intelligence 2021 PointNovo
AdaNovo: Adaptive De Novo Peptide Sequencing with Conditional Mutual Information Transformer NeurIPS 2024 Star AdaNovo
Accurate de novo peptide sequencing using fully convolutional neural networks CNN Nature Communications 2023 Star PepNet
A transformer model for de novo sequencing of data-independent acquisition mass spectrometry data Transformer Nature Method 2025 Star cascadia
Towards highly sensitive deep learning-based end-to-end database search for tandem mass spectrometry Transformer Nature machine intelligence 2025 Star Deepsearch, database search
π-PrimeNovo: an accurate and efficient nonautoregressive deep learning model for de novo peptide sequencing Transformer Nature Communications 2025 Star π-PrimeNovo

2.3 Datasets, Benchmark and Review

Paper Title & Link Method Type Venue Code Notes
AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics Benchmark Nature communications Star AlphaPeptDeep
NovoBench: Benchmarking Deep Learning-based De Novo Peptide Sequencing Methods in Proteomics Benchmark NeurIPS 2024 Star NovoBench

2.4 General Tools

Paper Title & Link Feasible scene Venue Code Notes
MSBooster: improving peptide identification rates using deep learning-based features Rating between LC-MS/MS spectra and Piptide Nature communications Star python package

3. NMR Spectroscopy (Small Molecules)

3.1 Forward Task (Molecule → Spectrum)

Paper Title & Link Method Type Venue/Year Data Source Metric Code / Data Notes
Enhancing NMR Shielding Predictions of Atoms-in-Molecules Machine Learning Models with Neighborhood-Informed Representations - preprint ^1H/^13C NMR spectra -
Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts Transformer Nature Computional Science 2025 3Dstructure to NMR Zenodo
Deep Learning Network for NMR Spectra Reconstruction in Time-Frequency Domain and Quality Assessment Transformer + CNN hybrid Nature Communications 2025 1D/2D NMR time-series RMSE (dB) SpectraNet · Zenodo Time-frequency domain reconstruction; spectral quality scoring
GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts Graph + Transformer GT-NMR ·
PROSPRE: Solvent-aware ^1H NMR chemical shift prediction using deep learning Deep Learning + Solvent-aware Metabolites 2024 Experimental NMR data MAE < 0.10 ppm Trained on large-scale solvent-specific datasets
Prediction of chemical shift in NMR: A review Review Survey of methods & datasets
iShiftML: Highly Accurate Prediction of NMR Chemical Shifts Hybrid ML + QM QM descriptors < 0.2 ppm Fast inference; QM features required
TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning GNN + Multitask Chem paper 2023 Graph-based input 2D spectra Code · Data 1D→2D solvent-aware prediction
NMR shift prediction from small data quantities ML NMRShiftDB2 MAE (ppm) Small-data learning
NMR-spectrum prediction for dynamic molecules ML-Dynamics Simulated ensembles Time-avg ppm Conformational averaging
Machine learning in NMR spectroscopy Review (DL) NMRShiftDB2 Multitask trends & outlook
A framework for automated structure elucidation from routine NMR spectra ML-based structure elucidation Chem. Sci. 2021 1D ^1H/^13C NMR spectra Top-10 accuracy ML framework for automated structure elucidation
Deep learning enabled ultra-high quality NMR chemical shift prediction from spin echo spectra Deep Learning + Signal Processing Science 2024 Spin echo NMR spectra MAE (ppm) High-resolution chemical shift prediction

3.2 Inverse Task (Spectrum → Molecule)

Paper Title & Link Method Type Venue/Year Input Metric Code / Data Notes
Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra - openreview ^1H/^13C NMR spectra - DiT Based 3D molecular generation
NMRMind: A Transformer-Based Model Enabling the Elucidation from Multidimensional NMR to Structures Transformer Analytical Chemistry 2025 ^1H/^13C NMR spectra Top-k accuracy
DIFFNMR: Diffusion Models for Nuclear Magnetic Resonance Spectra Elucidation Arxiv 2025 ^1H/^13C NMR spectra Top-k accuracy Diffusion Based Generation
A Transformer Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds Transformer-based Generative AI Arxiv 2024 IR, UV, ^1H NMR spectra Top-15 accuracy End-to-end structure elucidation via chemical language modeling
NMR-Solver: Automated Structure Elucidation via Large-Scale Spectral Matching and Physics-Guided Fragment Optimization Hybrid Spectral Matching + Physics-guided Optimization Arxiv 2025 ^1H/^13C NMR spectra Top-1 accuracy Combines spectral matching with physics-guided optimization
Accurate and Efficient Structure Elucidation from Routine One-Dimensional NMR Spectra Using Multitask Machine Learning CNN + Transformer (multitask) Arxiv 2024 1D spectra Top-1 / Top-k High accuracy with 1D only
Learning the Language of NMR Structure Elucidation from NMR Spectra Using Transformer Models Transformer (sequence modeling) ChemRxiv 2023 1D spectra Treats NMR as a language; structure reasoning
DeepSAT: Learning Molecular Structures from NMR Data Multimodal DL NMR spectra Structure accuracy Uses NPAtlas, NPASS, GNPS etc.; multimodal molecular structure learning
Bayesian approach to structural elucidation with crystalline-state SSNMR Bayesian / probabilistic arXiv 2019 Solid-state NMR Top-5 Requires crystal info
Deep RL + GCN for NMR inverse problem RL (MCTS) + GCN J. Phys. Chem. Lett. 2022 Shift table Top-3 Effective for small molecules

3.3 Dataset

Name & Link Type Venue/Year Size / Modality Real / Sim Download / Access
Multimodal Spectroscopic Dataset (includes NMR) Benchmark (multi-spectra) NeurIPS 2024 7.9e5+ synthetic spectra Sim Zenodo (via paper)
IR–NMR Multimodal Computational Spectra Dataset Dataset (IR + NMR) Nature Sci. Data 2025 177,461 spectra Sim (MD + DFT + ML) Zenodo
NMRShiftDB2 Database (small molecules) ~50k; ¹H/¹³C Real Open
BMRB Database (bio-molecules) >13k biomolecules; ¹H/¹³C/¹⁵N… Real FTP/STAR
SDBS Database (multi-modal) ~14k; ¹H/¹³C/IR/MS/UV Real Crawl/script
2DNMRGym (HSQC) Simulated 2D dataset 2024 22k+ HSQC Sim HuggingFace/Zenodo
NMRMixDB Mixtures ~3k; ¹H Real Open
HMDB 5.0 Database (metabolites) 4,149; ¹H/¹³C Real Open-source
BMRB20 (metabolites subset) Database (metabolites) 1,200+; ¹H/¹³C Real Open-source
NMRShiftDB2 (updated) Database (small molecules) 53,954; ¹H/¹³C Real Open-source
SDBS23 Database (natural products) 15,218 (¹H) / 13,457 (¹³C); 900+ NPs Real Open-source
NP-MRD Database (natural products) 1,290; ¹H/¹³C Real Open-source
NAPROC-13 Database (natural products) 6,000+; ¹³C Real Open-source
CH-NMR-NP Database (natural products) 30,500; 926 compounds; ¹H/¹³C Real Open-source
Spektraris-NMR Specialized subset 466 samples; 250 taxanes; ¹H/¹³C Real Open-source
C6H6 Database 506 entries; ¹H/¹³C Real Open-source
Ilm-NMR-P31 Element-specific database 14,250 (31P spectra) / 13,730 entries Real Open-source
KnowItAll NMR Commercial database 1.28 M+ spectra; ¹H/¹³C Real (Commercial) Commercial (Wiley)
Micronmr Commercial database 1.0 M+ spectra; ¹³C Real (Commercial) Commercial
NMRBank Open repository 225,809 (¹H) / 149,135 (¹³C) Real Open-source
                                 |

3.4 Clustering & Representation Learning for NMR

Paper Title & Link Method Venue/Year Task Notes
Statistical HOmogeneous Cluster SpectroscopY (SHOCSY) (Anal. Chem. 2014) Statistical clustering 2014 ¹H metabolomics clustering Classic baseline
Sparse Convex Wavelet Clustering (includes NMR signals) Wavelet + convex clustering arXiv Signal clustering Joint denoising + clustering
Deep representation learning for NMR spectral clustering Autoencoder / DL Spectral clustering DL embedding → K-means/HC

3.5 Foundation Models & Chemistry LMs related to NMR

Model / Paper Family Venue/Year Link Notes
ChemGPT Large chemistry LM Molecular generative & reasoning; spectra-to-structure transfer
DETANet DL architecture Chemistry perception; potential for spectra-conditioned tasks
DreaMS (MS foundation model) Transformer (spectra FM) Nat. Biotech. 2025 GitHub Cross-modal FM idea transferrable to NMR

📚 Synthetic NMR Papers & Datasets

Category Name What it offers Typical Use Link
Dataset & Pipeline ARTINA – 100‑Protein NMR Dataset 1329 2D–4D spectra + chemical shift assignments + protein structures Train/test protein NMR auto‑assignment & structure inference https://www.nature.com/articles/s41597-023-02879-5
Database (LLM‑extracted) NMRBank ~225k small‑molecule records (SMILES, ¹H/¹³C shifts) Build ML models for shift/spectrum prediction https://pmc.ncbi.nlm.nih.gov/articles/PMC12118362/
Synthetic multimodal IR‑NMR Dataset 177k IR spectra + 1.2k NMR shifts (DFT+ML) Cross‑modal learning / pretraining https://www.nature.com/articles/s41597-025-05729-8
Carbohydrates GlycoNMR 2,609 glycans with 211k NMR shifts Domain‑specific ML (carbohydrates) https://arxiv.org/abs/2311.17134
2D spectra 2DNMRGym 22k+ HSQC spectra + SMILES (partly human‑annotated) Train/benchmark 2D NMR predictors https://arxiv.org/abs/2505.18181

🛠️ Tools for Synthetic NMR Generation

Tool What it does How to use Required Inputs Output Link
MestReNova Comprehensive NMR data analysis software for processing 1D/2D spectra, including peak picking, integration, and chemical shift assignment Process NMR spectra, analyze data, and assign chemical shifts using MestReNova interface NMR spectra: 1D, 2D, 3D; chemical shift (ppm), coupling constants (Hz), integration, peak assignment Processed spectra, chemical shift assignments, peak integration MestReNova Official Website
Spinach (Matlab) Physics‑based simulation of 1D/2D spectra (COSY, HSQC, NOESY), relaxation, MAS NMR Define spin system + select pulse sequence + run simulation in Matlab Spin system: isotopes, chemical shifts (ppm), J couplings (Hz), CSA/dipolar terms; external field (B₀); pulse sequence Time‑domain FID & frequency‑domain spectra (1D/2D) https://en.wikipedia.org/wiki/Spinach_(software)
ORCA / NWChem Quantum chemistry calculation of shielding tensors and J couplings Run DFT/ab initio with NMR keyword 3D geometry (XYZ/MOL/PDB), basis set, method, optional solvent model Isotropic shieldings (σ), J‑couplings, CSA tensors https://github.com/nwchemgit/nwchem
NMRDB Online prediction of ¹H/¹³C (1D and 2D COSY/HSQC/HMBC) Draw molecule or paste SMILES on web UI Molecular structure (drawn or SMILES), optional solvent/field strength Simulated 1D/2D spectra, JCAMP export https://www.nmrdb.org/
ChemAxon NMR Predictor Predicts ¹H/¹³C chemical shifts and spectra (GUI + CLI) Use MarvinSketch or cxcalc nmr CLI Structure input (SMILES, MOL, SDF), optional solvent/field Chemical shifts, spectra, JCAMP files https://docs.chemaxon.com/display/docs/NMR%2BPredictor
NMRbox VM platform bundling many NMR tools (TopSpin, Sparky, CCPN, etc.) Launch VM, import spectra, run pipelines Experimental or synthetic spectra, peak lists Processed spectra, assignments, structural analysis https://nmrbox.nmrhub.org/software

4. IR Spectroscopy (Small Molecules)

4.1 Forward Task (Molecule → Spectrum)

Infrared spectrum prediction from molecular structures

Paper Title & Link Venue Method Type Input Data Source Code CKPT
Machine Learning Molecular Dynamics for the Simulation of Infrared Spectra (2017) Chem. Sci. ML (HDNNP + NN Dipole) 3D coordinates
A Machine Learning Protocol for Predicting Protein Infrared Spectra (2020) J. Am. Chem. Soc. ML (MLP) 3D coordinates
Predicting Infrared Spectra with Message Passing Neural Networks (2021) J. Chem. Inf. Model. DL (MPNN + FFNN) Simulated: SMILES
Experimental: SMILES+Phase
Simulated: PubChem
Experimental: NIST, PNNL, AIST, Coblentz
Star
Graphormer-IR: Graph Transformers Predict Experimental IR Spectra (2024) J. Chem. Inf. Model. DL (Graph Transformer + MLP, 1D-CNN) SMILES+Phase NIST, AIST, Coblentz Star
Neural Network Approach for Predicting Infrared Spectra from 3D Molecular Structure (2024) Chem. Phys. Lett. DL (MPNN) 3D Molecular Structure NIST Star
Infrared Spectra Prediction for Functional Group Region Utilizing a Machine Learning Approach with Structural Neighboring Mechanism (2024) Anal. Chem. ML SMILES CIAD Star --
Infrared Spectra Prediction Using Attention-Based Graph Neural Networks (2024) Digital Discovery DL (GNN) SMILES NIST Star
Prediction of the Infrared Absorbance Intensities and Frequencies of Hydrocarbons: A Message Passing Neural Network Approach (2024) J. Phys. Chem. A DL (MPNN + FFNN) SMILES GDB
Unlocking the Potential of Machine Learning in Enhancing Quantum Chemical Calculations for Infrared Spectral Prediction (2025) ACS Omega ML (Multioutput Regressor + Random Forest Regressor) 3D Molecular Structure Gaussian 16 Supporting Information

4.2 Inverse Task (Spectrum → Molecule)

Molecular characterization from infrared spectra

Paper Title & Link Venue Method Type Output Data Source Code CKPT
Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation (2024) NeurIPS2024 LLM (GPT‑4o, Claude‑3, Gemini, etc.) SMILES Star
Leveraging Infrared Spectroscopy for Automated Structure Elucidation (2024) Commun. Chem. DL (Transformer) SMILES Simulated: PubChem
Experimental: NIST
Star
Transformer-Based Models for Predicting Molecular Structures from Infrared Spectra Using Patch-Based Self-Attention (2025) J. Phys. Chem. A DL (Transformer + Patch-based Self-Attention) SMILES Simulated: PubChem, QM9S
Experimental: NIST
Star
Revolutionizing Spectroscopic Analysis Using Sequence-to-Sequence Models I: From Infrared Spectra to Molecular Structures (2025) ChemRxiv DL (GRU, LSTM, GPT, Transformer) SELFIES QM9, PC9
Setting new benchmarks in AI‑driven infrared structure elucidation (2025) Digit. Discov. DL (Transformer + Patch‑based Self‑Attention) SMILES Simulated: IRtoMol, Multimodal Spectroscopic Dataset
Experimental: NIST
Star GitHub

4.3 IR Datasets

Dataset / Method Name & Link Size Data Source Real / Simulated Element Coverage
Chemprop-IR 85,232 PubChem (SMILES for molecular structures) Simulated (GFN2-xTB) C, H, O, N, S, P, Si, F, Cl, Br, I
CMPNN 31,570 GDB (SMILES for molecular structures) Simulated (DFT) C, H
Multimodal Spectroscopic Dataset 794,403 USPTO reaction dataset (SMILES for molecular structures) Simulated (MD) C, H, O, N, S, P, Si, B, F, Cl, Br, I
IRtoMol 634,585 PubChem (SMILES for molecular structures) Simulated (MD + PCFF forcefield) C, H, O, N, S, P, F, Cl, Br, I
MolPuzzle 216 (Picture format) Mixed
IR–NMR Multimodal Computational Spectra Dataset 177,461 USPTO (SMILES for molecular structures) Simulated (MD + DFT + ML) C, H, O, N, S, P, Si, B, F, Cl, Br, I
QM9S 133,885 QM9 (re-optimized geometries) Simulated (DFT) C, H, O, N, F
SRD 35 5,228 (gas-phase) Real
NIST Chemistry WebBook >16,000 Real
Coblentz Society Spectral Database >9,500 Real
NWIR ~1,000–1,500 (gas-phase) Real
AIST SDBS ~54,100 Real

4.4 General Tools

Tool & Link Type Main Functions / Uses
RDKit Star Open-source library (Python/C++) Molecular descriptors for IR–structure fusion, Morgan/ECFP fingerprints for IR-guided search, SMARTS substructure from IR functional-group hints, Conformers + MMFF/UFF for vibrational context, InChI/SMILES I/O to link spectra and structures
Psi4 Star Open-source library (Python/C++) High‑accuracy, open‑source ab initio quantum chemistry suite for molecular properties
GFN2-xTB Star Open-source semiempirical QM (Fortran/Python) Fast IR frequency estimation, IR peak-to-structure attribution, Conformer effects on IR bands, Batch geometry preparation, Solvent-adjusted IR shifts
cclib Star Open-source library (Python) Provides parsers for output files of computational chemistry packages
SELFIES Star Open-source library (Python) SMILES↔SELFIES conversion
Open Babel Open-source chemistry toolbox (C++/Python) Format conversion, 3D generation and conformer search (MMFF94/UFF), SMARTS substructure and fingerprints (FP2/FP3/MACCS)
Gaussian 16 Commercial quantum chemistry software DFT IR frequencies/intensities, Harmonic and anharmonic IR (VPT2), Normal‑mode peak assignment, Solvent models (PCM/SMD) for IR
ChemCraft Commercial visualization software Visualize vibrational spectra with Lorentzian/Gaussian broadening, Compare calculated vs. experimental spectra, View MOs/SCF graphs, Parse outputs from Gaussian/ORCA/GAMESS/NWChem etc.
Q‑Chem Commercial quantum chemistry software DFT IR frequencies/intensities, Harmonic and anharmonic IR (VPT2/VCI/TOSH), Normal‑mode and Raman analysis

5. Multimodal Spectroscopy (Small Molecules)

5.1 Forward Task (Molecule → Multiple Spectra)

Joint prediction of multiple spectral modalities from molecular structures

Paper Title & Link Modalities Method Type Venue/Year Code Notes
Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry IR + MS + ^1H/^13C/HSQC-NMR (simulated) Dataset + baselines (CNN/XGBoost) NeurIPS D&B 2024 Star Provides ~790k multimodal spectra and forward-prediction baselines to pretrain joint predictors.
IR–NMR Multimodal Computational Spectra Dataset IR + NMR (simulated) Dataset (MD/DFT/ML) Sci. Data 2025 Curated paired IR–NMR set enabling multi-task forward models and cross-modal transfer.

5.2 Inverse Task (Multiple Spectra → Molecule)

Multimodal integration for enhanced molecular identification

Paper Title & Link Modalities Method Type Venue/Year Code Notes
A Transformer-Based Generative Chemical Language AI Model for Structural Elucidation of Organic Compounds IR + UV/Vis + ^1H NMR Encoder–decoder Transformer (generative CASE) J. Cheminformatics 2025 End-to-end spectra→structure generation; strong single-paper reference for joint use of IR/UV/NMR.
Leveraging Infrared Spectroscopy for Automated Structure Elucidation IR Transformer Commun. Chem. 2024 Star Single-modality IR baseline frequently paired with NMR/MS to form multimodal pipelines.

5.3 Representation Learning & Fusion (Model-Agnostic Methods used for Spectra)

Paper Title & Link What it Brings Method Type Venue/Year Code Notes
Ming-Omni: A Unified Multimodal Model for Perception and Generation Unified encoder/decoders; modality-specific MoE routers MoE-UMM (image/text/audio/video) arXiv 2025 Star Useful design for preventing cross-modal conflicts when adding spectral modalities as new “experts”.
Boosting Multimodal Learning via Disentangled Gradient Learning (DGL) Decouples encoder vs. fusion optimization to avoid gradient interference Training framework arXiv 2025 Star Practical recipe when IR/NMR/MS encoders underperform once fused.
I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts Interaction-specialized experts + sample-level explanations MoE fusion arXiv/ICML 2025 Star Lets you see which modality interactions (e.g., IR×NMR) drive a prediction.
MM-Embed: Universal Multimodal Retrieval with MLLMs Bi-encoder + MLLM re-ranker for flexible text↔image retrieval Multimodal embeddings ICLR 2025 HF Card Template for spectra/text retrieval (e.g., query by IR peaks + structural hints).
SpecEmbedding: Supervised Contrastive Learning Leads to More Reasonable Spectral Embeddings MS spectral encoder with supervised contrastive learning Transformer encoder + SupCon Analytical Chemistry ASAP 2025 Star · Demo High-quality MS embeddings usable as one branch in IR+NMR+MS fusion.
UAE: Can Understanding and Generation Truly Benefit Together — or Just Coexist? Unified auto-encoder view (I2T ↔ T2I) with RL to align understanding & generation Unified multimodal learning (UMM) arXiv 2025 Star Useful recipe for tying inverse (spectra→structure) with forward (structure→spectra) via reconstruction.

5.4 Datasets & Benchmarks (Multimodal, Small Molecules)

Name & Link Modalities Type Size Real / Simulated Website / Code Notes
Multimodal Spectroscopic Dataset (MSD) IR + MS + ^1H/^13C/HSQC-NMR Benchmark + Dataset ~790k molecules Simulated Site · Zenodo · GitHub Common pretraining/eval source for multimodal spectral models.
IR–NMR Multimodal Computational Spectra Dataset IR + NMR Dataset 177,461 spectra Simulated Article · Zenodo Paired IR–NMR for cross-modal learning and evaluation.
SDBS (AIST) IR + MS + ^1H/^13C-NMR + Raman + ESR Database ~34k molecules Real SDBS Broad real-world spectra across modalities for validation.
NIST Chemistry WebBook / SRD-35 IR (gas-phase) + MS + UV/Vis Database 5,228 IR (SRD-35); 16k+ IR & 33k+ MS overall Real WebBook · SRD-35 Standard references and downloadable IR libraries.
SimXRD-4M (related) Powder XRD Benchmark + Dataset 4.06M patterns Simulated OpenReview Not small-molecule spectra, but relevant template for large-scale physics-faithful simulation & benchmarking.

6. X-ray Diffraction (XRD) (Crystals)

6.1 Forward Task (Crystal → Pattern)

Prediction of XRD patterns from crystal structures

6.2 Inverse Task (Pattern → Crystal)

Crystal structure determination from XRD patterns


License

📄 This project is licensed under the MIT License — see the LICENSE file for details.


About

✨✨ Latest Advances on AI for Spectra Data Analysis (SpectraAI)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors