Biomarker Glycan Structure Terms (bGST) Workflow

LLM-powered pipeline for extracting & normalizing glycan structure terminology
Explore the docs »

View Demo · Report Bug · Contact Us

Table of Contents

About this project
Getting started
Usage
Data model
License
Acknowledgements

About This Project

Biomarker Glycan Structure Terms (bGST) is a controlled vocabulary of glycan structure terms extracted from literature and databases. It captures textual representations of glycans and glycan-related structural features, including full structures, motifs, epitopes, and substructures.

Because glycan structures are described inconsistently across sources, this project uses an LLM-assisted retrieval and entity resolution workflow to map terms to existing Glycan Structure Dictionary (GSD) entries or register new ones when needed. This helps unify heterogeneous glycan terminology into a normalized, de-duplicated reference knowledgebase.

Previous Work:

Vora J, Navelkar R, Vijay-Shanker K, Edwards N, Martinez K, Ding X, Wang T, Su P, Ross K, Lisacek F, Hayes C, Kahsay R, Ranzinger R, Tiemeyer M, Mazumder R. The Glycan Structure Dictionary-a dictionary describing commonly used glycan structure terms. Glycobiology. 2023 Jun 3;33(5):354-357. doi: 10.1093/glycob/cwad014. PMID: 36799723; PMCID: PMC10243773.

Local LLM inference
Run the pipeline entirely locally via Ollama, with configurable model selection and hardware setup.

Structured term normalization
Extract, normalize, and align glycan terminology through a state-driven workflow orchestrated by LangGraph.

Vector search + embeddings
Build and query vector stores (Chroma) using embedded representations for similarity lookup.

back to top ▲

Getting Started

Follow these steps to get a local copy up and running.

Prerequisites

Ollama is a local LLM inference runtime and model management layer that lets you pull and serve foundation models on-device. It abstracts backend details such as model packaging and request orchestration so developers can run local models with minimal and across different setups.

Install Ollama from https://ollama.com/download, or alternatively:
```
curl -fsSL https://ollama.com/install.sh | sh
```
- Ollama version >=v0.15.0 is recommended
HPC Users Only:
- On servers that run on environment modules (Lmod), use the following to view pre-installed modules:
```
module avail
```
- To display default version of Ollama:
```
module -d avail ollama
```

Installation

Clone this repo:

git clone https://github.com/glygener/glycan-structure-dictionary.git
cd glycan-structure-dictionary

Pull the required Ollama models:

A thinking model and an embedding model are required. If you chose to use other models, remember to update the model names at configs/models.yaml. This pipeline was developed using a locally hosted Ollama server where GPU acceleration is almost necessary. Otherwise, Ollama also offers cloud models with limited free usage. For accessing cloud models and obtaining a Ollama API key, refer to their documentation

Start your local ollama service at a separate terminal window (close this window after verifying downloads):

Non-HPC users:
```
ollama serve
```
HPC Users Only:
- Load the ollama module using module load ollama every time when opening a new terminal window:
```
module load ollama
ollama serve
```
Back to your main terminal window - Download your reasoning model and your embedding model (more models):
```
ollama pull gpt-oss:20b
ollama pull mxbai-embed-large:335m
```
Verify the downloads:
```
ollama list
```
```
# NAME                         ID              SIZE      MODIFIED
# mxbai-embed-large:335m       468836162de7    669 MB    7 weeks ago
# gpt-oss:20b                  17052f91a42e    13 GB     7 weeks ago
```
(You may now close the terminal window that runs the Ollama server)
Install Python dependencies:

(Optional) create a virtual environment with Python 3.12:
```
python3.12 -m venv .venv
source .venv/bin/activate
```
Install packages:
```
python -m pip install -r requirements.txt
```
Start Ollama server:

For Non-HPC users:

Every python script that utilizes LLM requires the hosting of an Ollama server. You may utilize these scripts to start/stop/check a server:
```
python scripts/ollama/start_server.py
python scripts/ollama/stop_server.py
python scripts/ollama/status_server.py
```
For HPC (Slurm) users only:

Ollama server is managed using the shell script ./main_slurm.sh. It serves as a template with resource pre-sets. To run a Python LLM script through the Slurm system, use main_slurm.sh, passing the target script path as an argument:
```
sbatch main_slurm.sh <SCRIPT.PY_PATH>
```
Example:
```
sbatch main_slurm.sh src/gsd/part1_textbook/01_ingest.py
```
On successful job submission, you will find the logs at logs/slurm-<job-id>_output.txt and logs/slurm-<job-id>_error.txt.

More on basic Slurm commands

back to top ▲

Usage

Workflow

Part 1: Term extraction from EoG and relations mapping

Creating ChromaDB from Essentials of Glycobiology (EoG) documents

unzip data/inputs/eog/raw_chapters/unzip_me_before_running_01_ingest.py.zip -d data/inputs/eog/raw_chapters/

python src/gsd/part1_textbook/01_ingest.py
# Or for HPC users here and thereafter:
sbatch main_slurm.sh src/.../TargetScript.py

Extract terms from EoG documents (from vectorstore)
```
python src/gsd/part1_textbook/02_extract.py
```

HPC users: The default time limit in main_slurm.sh is 24 hours. Override it at submission time if needed:
sbatch --time=7-00:00:00 main_slurm.sh src/gsd/part1_textbook/02_extract.py

Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/ doi: 10.1101/9781621824213

Part 2: Incoporating heterogeneous data sources and build a deduplicated master list of terms

This part builds a master dictionary of glycan structure terms by:

Ingesting heterogeneous source term sets (Essentials of Glycobiology, legacy GSD v0, curated publications, composition lists, curator-supplied sets, etc.).
Normalizing and formatting raw term JSONL inputs into a canonical intermediate structure.
Creating a semantic vector store (Chroma + OpenAI embeddings) for retrieval-augmented AI mapping.
Running AI-assisted mapping agents to (a) map synonyms to existing concepts or (b) propose creation of new canonical terms.
Reconciling AI action logs into term-to-UUID mappings.
Post-processing: merging multiple sources into consolidated node (master_nodes.json) and edge (master_edges.json) registries with quality checks and backups.

Build embeddings

python src/gsd/part2_enrichment/1_ai-assisted_term_matching/01_create_vectordb.py

Run AI mapping for a source

python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02a_ai_mapping_gsdv0.py

Reconcile mapping decisions

python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02b_match_gsdv0_ai_mapping_with_uuid.py

(Repeat analogous steps for pubdictionaries)

python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03a_ai_mapping_pubdictionaries.py
python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03b_match_pubdict_ai_mapping_with_uuid.py

Merge into master dictionaries

python src/gsd/part2_enrichment/2_generate_mappings/postprocessing.py

Note

An OpenAI API key enables the application to access LLM services. Where to obtain an API key?

Project Structure

.
├── README.md
├── configs                       # YAML-based configuration for models, paths, and tooling
│   ├── base.yaml
│   ├── chroma.yaml               # Persist directories + retriever params
│   ├── models.yaml               # LLM labels + params
│   ├── ollama.yaml               # Ollama configs
│   ├── paths.yaml
│   ├── schemas                       # JSON/schema definitions for bGST data model
│   └── prompts                   # Collection of system prompts in markdown format
├── data
│   ├── inputs                    # Raw/normalized source data for the pipelines
│   │   ├── _resource_template    # Folder template for integrating new resources
│   │   │   ├── metadata
│   │   │   ├── normalized
│   │   │   └── raw
│   │   └── ...                   # Source data + merging audit records, grouped by folders
│   ├── outputs                   # Mapped terms (current/previous) + vectorstore snapshots (previous)
│   │   └── releases
│   └── workspace                 # Vectorstores of current release
│       └── chroma
├── docs                          # Supplementary documentation + notes
├── requirements.txt
├── scripts
│   └── ollama                    # Ollama server helpers (env var + pid management)
│       ├── start_server.py
│       ├── status_server.py
│       └── stop_server.py
├── src                           # Python library code for the GSD pipeline
│   └── gsd
│       ├── __init__.py
│       ├── adapters              # Higher level adapter tools
│       ├── part1_textbook        # EoG term extraction pipeline
│       ├── part2_enrichment      # GSD resource enrichment pipeline
│       ├── cli.py
│       ├── config.py             # Config loaders
│       ├── models.py
│       └── utils.py
└── tests                         # Unit tests

LLM Workflows

Workflow	Description	Directory
GST Extraction	Extracts and classifies GST from a preprocessed text document, and creates sentence-level citations as supporting evidence. Identify GST entity pairs (i.e. `has_abbr`, `has_formula`). Example parses Essentials of Glycobiology 4e as a Chroma document.	`src/gsd/part1_textbook/02_extract/`
RAG For Term Generation	Starts with deduplicated glycan structure terms. Retrieve top-k document chunks from the Essentials of Glycobiology 4e, and synthesize a term summary in terms of `definition`, `cellular component`, `molecular function`, and `biological process`.	`src/gsd/part1_textbook/04_annotate`
bGST Enrichment With New Datasets	Starts with a seed GST vectorstore (persistdirectory = `src/data/workspace/chroma/gsd/`). Parses query GST entities one at a time - searches against existing term entries from the vectorstore, and decides to i. _link query to existing entity or ii. register new entity. The vector store is dynamically updated in the iteration, whilst a list of AI term-linking audits is generated for human review (before incorporating into the production GST datasets).	`src/gsd/part2_enrichment/02_link/`

back to top ▲

Data

Data Source

Resource	URL	Entities	Notes
GlycoMotif	https://glycomotif.glyomics.org/	701	Secondary: Glydin, UniCarbKB, GlyTouCan, CCRC, GlyGen
Glydin	https://glycoproteome.expasy.org/epitopes/		Secondary: SugarbindDB, GlycoEpitope, Cummings, BioOligo-DB
SugarbindDB	https://sugarbind.expasy.org/	204
GlycoEpitope	https://www.glycoepitope.jp/	173	Also available at https://glycosmos.org/glycoepitope
Cummings	https://pubmed.ncbi.nlm.nih.gov/19756298/
BioOligo-DB	https://glyco3d.cermav.cnrs.fr/search.php?type=bioligo
Monosac-DB	https://glycopedia.eu/resources/presentation/
UniLectin3D	https://unilectin.unige.ch/unilectin3D/
GlycoMaple	https://glycosmos.org/glycomaple/Human

Data Model

Describe the core data model(s) used by this project, including how glycan structure terms are represented, stored, and linked to external resources.

Primary storage: (e.g., JSONL, SQLite)
Key entities:

Each source terms file (*terms.jsonl) after formatting should produce lines like:

{
    "lbl": "sialyl Lewis x",
    "term_uuid": "GSD:32e928fb-1550-5e0a-945f-2218ac79b83c",
    "gtc_id": [
      "G00054MO"
    ],
    "sources": [
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:EOG_VARKI_4E",
        "src_uuid": "SRC:66cc8ff8-5b05-4882-8c47-8ab4f036bed3"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:GSD_GLYGEN_V0",
        "src_uuid": "SRC:0e4ec742-01a0-4d61-b1fb-655f380ac009"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:PUBDICTIONARIES-GLYCAN-IMAGE",
        "src_uuid": "SRC:5c02589c-9c5e-489f-8863-e0bd2618d901"
      }
    ],
    "gsd_id": "GSD000151"
  },

Edges (*edges.jsonl) follow:

{
    "subj": "GSD:a7868da4-a6c2-4825-97b9-c86700b1c213",
    "pred": "is_a_related_synonym_of",
    "obj": "GSD:8ce1f4e6-8cbe-5167-8ece-a1cfc850d3a5",
    "comment": "GA1 is a related synonym of asialo-GM1"
  },

back to top ▲

License

See LICENSE for more details.

back to top ▲

Acknowledgements

Placeholder

Placeholder for contributor/organization 1
Placeholder for contributor/organization 2
Placeholder for contributor/organization 3

back to top ▲

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biomarker Glycan Structure Terms (bGST) Workflow

About This Project

Getting Started

Prerequisites

Installation

Usage

Workflow

Part 1: Term extraction from EoG and relations mapping

Part 2: Incoporating heterogeneous data sources and build a deduplicated master list of terms

Project Structure

LLM Workflows

Data

Data Source

Data Model

License

Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
configs		configs
data		data
docs		docs
scripts/ollama		scripts/ollama
src/gsd		src/gsd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_slurm.sh		main_slurm.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Biomarker Glycan Structure Terms (bGST) Workflow

About This Project

Getting Started

Prerequisites

Installation

Usage

Workflow

Part 1: Term extraction from EoG and relations mapping

Part 2: Incoporating heterogeneous data sources and build a deduplicated master list of terms

Project Structure

LLM Workflows

Data

Data Source

Data Model

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages