Skip to content

glygener/glycan-structure-dictionary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Release Notes Issues

BiomarkerKB GlyGen Wiki Page


Logo

Biomarker Glycan Structure Terms (bGST) Workflow

LLM-powered pipeline for extracting & normalizing glycan structure terminology
Explore the docs »

View Demo · Report Bug · Contact Us

About This Project

Biomarker Glycan Structure Terms (bGST) is a controlled vocabulary of glycan structure terms extracted from literature and databases. It captures textual representations of glycans and glycan-related structural features, including full structures, motifs, epitopes, and substructures.

Because glycan structures are described inconsistently across sources, this project uses an LLM-assisted retrieval and entity resolution workflow to map terms to existing Glycan Structure Dictionary (GSD) entries or register new ones when needed. This helps unify heterogeneous glycan terminology into a normalized, de-duplicated reference knowledgebase.

Previous Work:

Vora J, Navelkar R, Vijay-Shanker K, Edwards N, Martinez K, Ding X, Wang T, Su P, Ross K, Lisacek F, Hayes C, Kahsay R, Ranzinger R, Tiemeyer M, Mazumder R. The Glycan Structure Dictionary-a dictionary describing commonly used glycan structure terms. Glycobiology. 2023 Jun 3;33(5):354-357. doi: 10.1093/glycob/cwad014. PMID: 36799723; PMCID: PMC10243773.


Ollama
Local LLM inference
Run the pipeline entirely locally via Ollama, with configurable model selection and hardware setup.
LangGraph
Structured term normalization
Extract, normalize, and align glycan terminology through a state-driven workflow orchestrated by LangGraph.
Chroma
Vector search + embeddings
Build and query vector stores (Chroma) using embedded representations for similarity lookup.

back to top ▲

Getting Started

Follow these steps to get a local copy up and running.

Prerequisites

Ollama is a local LLM inference runtime and model management layer that lets you pull and serve foundation models on-device. It abstracts backend details such as model packaging and request orchestration so developers can run local models with minimal and across different setups.

  • Install Ollama from https://ollama.com/download, or alternatively:

    curl -fsSL https://ollama.com/install.sh | sh
    • Ollama version >=v0.15.0 is recommended
  • HPC Users Only:

    • On servers that run on environment modules (Lmod), use the following to view pre-installed modules:

      module avail
    • To display default version of Ollama:

      module -d avail ollama

Installation

  1. Clone this repo:

    git clone https://github.com/glygener/glycan-structure-dictionary.git
    cd glycan-structure-dictionary
  2. Pull the required Ollama models:

    A thinking model and an embedding model are required. If you chose to use other models, remember to update the model names at configs/models.yaml. This pipeline was developed using a locally hosted Ollama server where GPU acceleration is almost necessary. Otherwise, Ollama also offers cloud models with limited free usage. For accessing cloud models and obtaining a Ollama API key, refer to their documentation

    Start your local ollama service at a separate terminal window (close this window after verifying downloads):

    Non-HPC users:

    ollama serve

    HPC Users Only:

    • Load the ollama module using module load ollama every time when opening a new terminal window:

      module load ollama
      ollama serve

    Back to your main terminal window - Download your reasoning model and your embedding model (more models):

    ollama pull gpt-oss:20b
    ollama pull mxbai-embed-large:335m

    Verify the downloads:

    ollama list
    # NAME                         ID              SIZE      MODIFIED
    # mxbai-embed-large:335m       468836162de7    669 MB    7 weeks ago
    # gpt-oss:20b                  17052f91a42e    13 GB     7 weeks ago

    (You may now close the terminal window that runs the Ollama server)

  3. Install Python dependencies:

    (Optional) create a virtual environment with Python 3.12:

    python3.12 -m venv .venv
    source .venv/bin/activate

    Install packages:

    python -m pip install -r requirements.txt
  4. Start Ollama server:

    For Non-HPC users:

    Every python script that utilizes LLM requires the hosting of an Ollama server. You may utilize these scripts to start/stop/check a server:

    python scripts/ollama/start_server.py
    python scripts/ollama/stop_server.py
    python scripts/ollama/status_server.py

    For HPC (Slurm) users only:

    Ollama server is managed using the shell script ./main_slurm.sh. It serves as a template with resource pre-sets. To run a Python LLM script through the Slurm system, use main_slurm.sh, passing the target script path as an argument:

    sbatch main_slurm.sh <SCRIPT.PY_PATH>

    Example:

    sbatch main_slurm.sh src/gsd/part1_textbook/01_ingest.py

    On successful job submission, you will find the logs at logs/slurm-<job-id>_output.txt and logs/slurm-<job-id>_error.txt.

    More on basic Slurm commands

back to top ▲

Usage

Workflow

Part 1: Term extraction from EoG and relations mapping

  1. Creating ChromaDB from Essentials of Glycobiology (EoG) documents

    unzip data/inputs/eog/raw_chapters/unzip_me_before_running_01_ingest.py.zip -d data/inputs/eog/raw_chapters/
    python src/gsd/part1_textbook/01_ingest.py
    # Or for HPC users here and thereafter:
    sbatch main_slurm.sh src/.../TargetScript.py
  2. Extract terms from EoG documents (from vectorstore)

    python src/gsd/part1_textbook/02_extract.py

HPC users: The default time limit in main_slurm.sh is 24 hours. Override it at submission time if needed:

sbatch --time=7-00:00:00 main_slurm.sh src/gsd/part1_textbook/02_extract.py

Varki A, Cummings RD, Esko JD, et al., editors. Essentials of Glycobiology [Internet]. 4th edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 2022. Available from: https://www.ncbi.nlm.nih.gov/books/NBK579918/ doi: 10.1101/9781621824213

Part 2: Incoporating heterogeneous data sources and build a deduplicated master list of terms

This part builds a master dictionary of glycan structure terms by:

  • Ingesting heterogeneous source term sets (Essentials of Glycobiology, legacy GSD v0, curated publications, composition lists, curator-supplied sets, etc.).
  • Normalizing and formatting raw term JSONL inputs into a canonical intermediate structure.
  • Creating a semantic vector store (Chroma + OpenAI embeddings) for retrieval-augmented AI mapping.
  • Running AI-assisted mapping agents to (a) map synonyms to existing concepts or (b) propose creation of new canonical terms.
  • Reconciling AI action logs into term-to-UUID mappings.
  • Post-processing: merging multiple sources into consolidated node (master_nodes.json) and edge (master_edges.json) registries with quality checks and backups.
  1. Build embeddings

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/01_create_vectordb.py
  2. Run AI mapping for a source

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02a_ai_mapping_gsdv0.py
  3. Reconcile mapping decisions

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/02b_match_gsdv0_ai_mapping_with_uuid.py

    (Repeat analogous steps for pubdictionaries)

    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03a_ai_mapping_pubdictionaries.py
    python src/gsd/part2_enrichment/1_ai-assisted_term_matching/03b_match_pubdict_ai_mapping_with_uuid.py
  4. Merge into master dictionaries

    python src/gsd/part2_enrichment/2_generate_mappings/postprocessing.py

Note

An OpenAI API key enables the application to access LLM services. Where to obtain an API key?

Project Structure

.
├── README.md
├── configs                       # YAML-based configuration for models, paths, and tooling
│   ├── base.yaml
│   ├── chroma.yaml               # Persist directories + retriever params
│   ├── models.yaml               # LLM labels + params
│   ├── ollama.yaml               # Ollama configs
│   ├── paths.yaml
│   ├── schemas                       # JSON/schema definitions for bGST data model
│   └── prompts                   # Collection of system prompts in markdown format
├── data
│   ├── inputs                    # Raw/normalized source data for the pipelines
│   │   ├── _resource_template    # Folder template for integrating new resources
│   │   │   ├── metadata
│   │   │   ├── normalized
│   │   │   └── raw
│   │   └── ...                   # Source data + merging audit records, grouped by folders
│   ├── outputs                   # Mapped terms (current/previous) + vectorstore snapshots (previous)
│   │   └── releases
│   └── workspace                 # Vectorstores of current release
│       └── chroma
├── docs                          # Supplementary documentation + notes
├── requirements.txt
├── scripts
│   └── ollama                    # Ollama server helpers (env var + pid management)
│       ├── start_server.py
│       ├── status_server.py
│       └── stop_server.py
├── src                           # Python library code for the GSD pipeline
│   └── gsd
│       ├── __init__.py
│       ├── adapters              # Higher level adapter tools
│       ├── part1_textbook        # EoG term extraction pipeline
│       ├── part2_enrichment      # GSD resource enrichment pipeline
│       ├── cli.py
│       ├── config.py             # Config loaders
│       ├── models.py
│       └── utils.py
└── tests                         # Unit tests

LLM Workflows

Workflow Description Directory
GST Extraction Extracts and classifies GST from a preprocessed text document, and creates sentence-level citations as supporting evidence. Identify GST entity pairs (i.e. has_abbr, has_formula). Example parses Essentials of Glycobiology 4e as a Chroma document. src/gsd/part1_textbook/02_extract/
RAG For Term Generation Starts with deduplicated glycan structure terms. Retrieve top-k document chunks from the Essentials of Glycobiology 4e, and synthesize a term summary in terms of definition, cellular component, molecular function, and biological process. src/gsd/part1_textbook/04_annotate
bGST Enrichment With New Datasets Starts with a seed GST vectorstore (persistdirectory = src/data/workspace/chroma/gsd/). Parses query GST entities one at a time - searches against existing term entries from the vectorstore, and decides to i. _link query to existing entity or ii. register new entity. The vector store is dynamically updated in the iteration, whilst a list of AI term-linking audits is generated for human review (before incorporating into the production GST datasets). src/gsd/part2_enrichment/02_link/

back to top ▲

Data

Data Source

Resource URL Entities Notes
GlycoMotif https://glycomotif.glyomics.org/ 701 Secondary: Glydin, UniCarbKB, GlyTouCan, CCRC, GlyGen
Glydin https://glycoproteome.expasy.org/epitopes/ Secondary: SugarbindDB, GlycoEpitope, Cummings, BioOligo-DB
SugarbindDB https://sugarbind.expasy.org/ 204
GlycoEpitope https://www.glycoepitope.jp/ 173 Also available at https://glycosmos.org/glycoepitope
Cummings https://pubmed.ncbi.nlm.nih.gov/19756298/
BioOligo-DB https://glyco3d.cermav.cnrs.fr/search.php?type=bioligo
Monosac-DB https://glycopedia.eu/resources/presentation/
UniLectin3D https://unilectin.unige.ch/unilectin3D/
GlycoMaple https://glycosmos.org/glycomaple/Human

Data Model

Describe the core data model(s) used by this project, including how glycan structure terms are represented, stored, and linked to external resources.

  • Primary storage: (e.g., JSONL, SQLite)

  • Key entities:

Each source terms file (*terms.jsonl) after formatting should produce lines like:

{
    "lbl": "sialyl Lewis x",
    "term_uuid": "GSD:32e928fb-1550-5e0a-945f-2218ac79b83c",
    "gtc_id": [
      "G00054MO"
    ],
    "sources": [
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:EOG_VARKI_4E",
        "src_uuid": "SRC:66cc8ff8-5b05-4882-8c47-8ab4f036bed3"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:GSD_GLYGEN_V0",
        "src_uuid": "SRC:0e4ec742-01a0-4d61-b1fb-655f380ac009"
      },
      {
        "src_lbl": "sialyl Lewis x",
        "src": "SRC:PUBDICTIONARIES-GLYCAN-IMAGE",
        "src_uuid": "SRC:5c02589c-9c5e-489f-8863-e0bd2618d901"
      }
    ],
    "gsd_id": "GSD000151"
  },

Edges (*edges.jsonl) follow:

{
    "subj": "GSD:a7868da4-a6c2-4825-97b9-c86700b1c213",
    "pred": "is_a_related_synonym_of",
    "obj": "GSD:8ce1f4e6-8cbe-5167-8ece-a1cfc850d3a5",
    "comment": "GA1 is a related synonym of asialo-GM1"
  },

back to top ▲

License

MIT License. Copyright (c) 2025 GlyGen

See LICENSE for more details.

back to top ▲

Acknowledgements

Placeholder

  • Placeholder for contributor/organization 1
  • Placeholder for contributor/organization 2
  • Placeholder for contributor/organization 3

back to top ▲

About

This repository maintains the most updated version of the Glycan Structure Dictionary

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors