Research code accompanying the MSc thesis project on Gaussian-process–guided optimization over LLM embeddings for molecular design (property focus: logP). The repository implements a GOLLuM-style deep-kernel Gaussian Process (GP) coupled with LLM-based molecular embeddings and benchmarks multiple sequence decoders (MLP, GRU and Vec2Text-style) within Bayesian optimization workflows.
At a glance
- Deep-kernel GP operating on LLM embeddings (T5-family; LoRA-friendly design).
- Decoders for SELFIES/SMILES reconstruction: MLP, GRU and Vec2Text-style.
- Two BO modes:
- Iterative BO using the Deep GP as the objective in latent space.
- Decoder-driven BO treating the decoder as a black-box objective.
A clean Python environment (≥ 3.10) and a CUDA-enabled GPU are recommended.
- Create and activate an environment
# Using conda (recommended) conda create -n gollum_env python=3.10 -y conda activate gollum_env - Install PyTorch
- Install a CUDA-compatible build via the official selector for your GPU.
- Example (adjust the CUDA tag to your system):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install core dependencies
pip install transformers peft accelerate
pip install botorch gpytorch
pip install rdkit-pypi selfies
pip install numpy pandas scikit-learn matplotlib optuna
# optional logging
pip install wandbData & Assumptions
- Input molecules as SMILES (converted to SELFIES internally when decoding).
- Target property: logP (Crippen, RDKit implementation).
- Tokenization and vocabulary are expected to be consistent across training/validation.
Typical Workflow
- Embed molecules with the LLM featurizer
- Use functions/utilities in
model.gollum_LLM.pyandutil.gollum_util.pyto tokenize strings, run the LLM, and pool hidden states (e.g., CLS, mean, weighted). - Optionally apply LoRA/parameter-efficient adaptation if configured in your workflow.
- Use functions/utilities in
- Fit the Deep-Kernel GP
- Train the GP in latent space using embeddings as inputs and logP as targets.
- Implementation scaffold in
model.gollum_DeepGP.py.
- Train decoders
- SimpleMLP decoder
model.MLP_decoder.py - GRU decoder:
modle.GRU_decoder.py - Vec2Text-style decoder:
Vec2Text_decoder10.py(Base + Corrector; iterative refinement) - Utilities for SELFIES/SMILES conversions and token metrics are in
util.decoder_util.pyandutil.util.py.
- SimpleMLP decoder
- Run Bayesian Optimization
- Approach 1 (Iterative BO; recommended):
optimization.Approach1.pyUses the Deep GP as the objective in latent space (e.g., qEI-style acquisition). - Approach 2 (Decoder-driven BO; experimental):
optimization.Approach2_2.pyTreats the decoder as the black-box objective.
- Approach 1 (Iterative BO; recommended):
Reproducibility Notes
- Fix random seeds (Python, NumPy, Torch) for controlled comparisons.
- The Vec2Text-style decoder involves iterative refinement; set a consistent max iteration budget and early stopping policy for fair benchmarking.
- GPU memory requirements may increase when using LLM + GP + decoder jointly.
Citing If you use this repository, please cite the thesis and the primary methodological references:
-
Thesis
- M. Soh, “GOLLuM+: Integrated GP with LLM application in molecular design,” MSc Thesis, 2025.
-
Foundations
- Gómez-Bombarelli, R., et al. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS Central Science, 2018.
- Morris, J. X., et al. “Text Embeddings Reveal (Almost) As Much As Text,” 2023 (Vec2Text).
- Ranković, B., et al. “GOLLuM: Gaussian Process Optimized LLMs,” 2025.