GOLLuM+ — GP–LLM Integration for Molecular Design

Research code accompanying the MSc thesis project on Gaussian-process–guided optimization over LLM embeddings for molecular design (property focus: logP). The repository implements a GOLLuM-style deep-kernel Gaussian Process (GP) coupled with LLM-based molecular embeddings and benchmarks multiple sequence decoders (MLP, GRU and Vec2Text-style) within Bayesian optimization workflows.

At a glance

Deep-kernel GP operating on LLM embeddings (T5-family; LoRA-friendly design).

Decoders for SELFIES/SMILES reconstruction: MLP, GRU and Vec2Text-style.

Two BO modes:

Iterative BO using the Deep GP as the objective in latent space.

Decoder-driven BO treating the decoder as a black-box objective.

Installation

A clean Python environment (≥ 3.10) and a CUDA-enabled GPU are recommended.

Create and activate an environment

# Using conda (recommended)
conda create -n gollum_env python=3.10 -y
conda activate gollum_env

Install PyTorch

Install a CUDA-compatible build via the official selector for your GPU.

Example (adjust the CUDA tag to your system):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install core dependencies

pip install transformers peft accelerate
pip install botorch gpytorch
pip install rdkit-pypi selfies
pip install numpy pandas scikit-learn matplotlib optuna
# optional logging
pip install wandb

Data & Assumptions

Input molecules as SMILES (converted to SELFIES internally when decoding).
Target property: logP (Crippen, RDKit implementation).
Tokenization and vocabulary are expected to be consistent across training/validation.

Typical Workflow

Embed molecules with the LLM featurizer
- Use functions/utilities in model.gollum_LLM.py and util.gollum_util.py to tokenize strings, run the LLM, and pool hidden states (e.g., CLS, mean, weighted).
- Optionally apply LoRA/parameter-efficient adaptation if configured in your workflow.
Fit the Deep-Kernel GP
- Train the GP in latent space using embeddings as inputs and logP as targets.
- Implementation scaffold in model.gollum_DeepGP.py.
Train decoders
- SimpleMLP decoder model.MLP_decoder.py
- GRU decoder: modle.GRU_decoder.py
- Vec2Text-style decoder: Vec2Text_decoder10.py (Base + Corrector; iterative refinement)
- Utilities for SELFIES/SMILES conversions and token metrics are in util.decoder_util.py and util.util.py.
Run Bayesian Optimization
- Approach 1 (Iterative BO; recommended): optimization.Approach1.py Uses the Deep GP as the objective in latent space (e.g., qEI-style acquisition).
- Approach 2 (Decoder-driven BO; experimental): optimization.Approach2_2.py Treats the decoder as the black-box objective.

Reproducibility Notes

Fix random seeds (Python, NumPy, Torch) for controlled comparisons.
The Vec2Text-style decoder involves iterative refinement; set a consistent max iteration budget and early stopping policy for fair benchmarking.
GPU memory requirements may increase when using LLM + GP + decoder jointly.

Citing If you use this repository, please cite the thesis and the primary methodological references:

Thesis
- M. Soh, “GOLLuM+: Integrated GP with LLM application in molecular design,” MSc Thesis, 2025.
Foundations
- Gómez-Bombarelli, R., et al. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS Central Science, 2018.
- Morris, J. X., et al. “Text Embeddings Reveal (Almost) As Much As Text,” 2023 (Vec2Text).
- Ranković, B., et al. “GOLLuM: Gaussian Process Optimized LLMs,” 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Notebook_result_1		Notebook_result_1
data		data
model		model
optimization		optimization
util		util
.gitattributes		.gitattributes
GRU_decoder_param.pt		GRU_decoder_param.pt
LICENSE		LICENSE
LoRA_only.pt		LoRA_only.pt
MLP_decoder_param.pt		MLP_decoder_param.pt
README.md		README.md
Vec2Text_decoder_param.pt		Vec2Text_decoder_param.pt
additional_analysis.ipynb		additional_analysis.ipynb
enviromnet.yaml		enviromnet.yaml
projection_only.pt		projection_only.pt
requirement.txt		requirement.txt
toy example.ipynb		toy example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOLLuM+ — GP–LLM Integration for Molecular Design

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GOLLuM+ — GP–LLM Integration for Molecular Design

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages