Skip to content

feat: hackathon/implement SBML Model Annotation and Knowledge Graph Integration (Team Sanofi US)#138

Open
sahneh wants to merge 15 commits into
VirtualPatientEngine:mainfrom
sahneh:sbml-annotator-us
Open

feat: hackathon/implement SBML Model Annotation and Knowledge Graph Integration (Team Sanofi US)#138
sahneh wants to merge 15 commits into
VirtualPatientEngine:mainfrom
sahneh:sbml-annotator-us

Conversation

@sahneh

@sahneh sahneh commented Mar 8, 2025

Copy link
Copy Markdown

For authors

Description

Please:

  1. Provide a summary of the modifications made and any associated issue (if applicable).
  2. Include relevant context and motivation for the changes.
  3. If this relates to a change in any website's frontend, kindly attach a screenshot of the adjustment from your localhost.
  4. List any dependencies necessary for implementing this change.

Contributors:

  • Faryad Sahneh
  • Travis Ahn-Horst
  • Mahasweta Bhattacharya

This PR adds functionality to annotate SBML models using LLMs and integrate them with Biomedical Knowledge Graphs by:

  • Developing a multi-step annotation process using OCR and LLM-based entity recognition
  • Establishing connections between model species and ontological entities via Bio-Ontology API and UMLS mappings
  • Bridging the gap between dynamic biological processes (SBML) and static knowledge repositories (BKGs)

Files added:

  • Readme file proposing a framework that treats SBML models as first-class nodes in knowledge graphs
  • Processing scripts for Bio-Ontology API and UMLS extraction
  • JSON output files containing species annotations with ontology IDs
  • Integration methodology for connecting with PrimeKG
  • CSV file mapping BSML species to nodes of PrimeKG

The main files to look at are the followings:

  • species_dict_annotated.json: Species dictionary enriched with bio-ontology annotations
  • species2primekg_map.csv: A final mapping between primekg nodes and species
  • species_dict_umls.json: Species dictionary with UMLS codes

Fixes # (issue) Mention the issue number.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests you conducted to verify your changes. These may involve creating new test scripts or updating existing ones.

  • Added new test(s) in the tests folder
  • Added new function(s) to an existing test(s) (e.g.: tests/testX.py)
  • No new tests added (Please explain the rationale in this case)

Checklist

  • My code follows the style guidelines mentioned in the Code/DevOps guides
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. MkDocs)
  • My changes generate no new warnings
  • I have added or updated tests (in the tests folder) that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

For reviewers

Checklist pre-approval

  • Is there enough documentation?
  • If a new feature has been added, or a bug fixed, has a test been added to confirm good behavior?
  • Does the test(s) successfully test edge/corner cases?
  • Does the PR pass the tests? (if the repository has continuous integration)

Checklist post-approval

  • Does this PR merge develop into main? If so, please make sure to add a prefix (feat/fix/chore) and/or a suffix BREAKING CHANGE (if it's a major release) to your commit message.
  • Does this PR close an issue? If so, please make sure to descriptively close this issue when the PR is merged.

Checklist post-merge

  • When you approve of the PR, merge and close it (Read this article to know about different merge methods on GitHub)
  • Did this PR merge develop into main and is it suppose to run an automated release workflow (if applicable)? If so, please make sure to check under the "Actions" tab to see if the workflow has been initiated, and return later to verify that it has completed successfully.

@gurdeep330 gurdeep330 changed the title feat/Implement SBML Model Annotation and Knowledge Graph Integration feat: hackathon/implement SBML Model Annotation and Knowledge Graph Integration (Team Sanofi US) Mar 10, 2025

@dmccloskey dmccloskey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation of the problem and some useful tools to continue towards a complete solution @sahneh and the rest of Team Sanofi US 👏.

I appreciate the two approaches and there implementations:

  1. Lookup using BioPortals API
  2. Lookup in UMLS using SciSpaCy
    After using the OCR processed PDF article and SBML species descriptions to prompt an LLM to create a more complete description that could be used for lookup

I noticed that the API calls to BioPortals were not just for lookup but also for enriching the species with their ontology annotations. If you had additional time, was the idea to also do some type of semantic search between the enriched annotations (after textual embedding) and the descriptions extracted from the article/sbml model?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well explained👍

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great feedback @dmccloskey. Excellent observations!

Our approach was guided by two key principles:

  1. Leverage reasoning-focused LLMs with complete context rather than PDF RAG. Our justification was that SBML annotation requires holistic understanding of biological systems rather than fragmented inferences from text chunks. This approach also eliminates many of the technical challenges associated with making a RAG pipeline work properly.

  2. Utilize established biomedical ontology tools instead of relying solely on semantic search. Biological entity mapping requires nuanced understanding that goes beyond simple text similarity, and there's a rich ecosystem of specialized technologies in this domain that provide significant advantages.

Regarding the enriched descriptions: they serve complementary purposes aligned with our goal of connecting SBML models to knowledge graphs:

  • The ontological annotations create structured connections to KGs through standardized identifiers
  • The textual descriptions make the models more accessible to LLMs.

For future work, semantic search would indeed be valuable. The following usecases are particularly interesting:

  • Post-filter annotations for more precise connections
  • Enable "white space exploration" beyond the explicit SBML model boundaries (across species, pathways, or disease contexts)

Overall, the point is the combination of KG-friendly ontological mapping and LLM-friendly textual descriptions creates a solid bridge between computational models and broader biological knowledge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants