feat: hackathon/implement SBML Model Annotation and Knowledge Graph Integration (Team Sanofi US)#138
feat: hackathon/implement SBML Model Annotation and Knowledge Graph Integration (Team Sanofi US)#138sahneh wants to merge 15 commits into
Conversation
…s4Pharma into sbml-annotator-us
dmccloskey
left a comment
There was a problem hiding this comment.
Great explanation of the problem and some useful tools to continue towards a complete solution @sahneh and the rest of Team Sanofi US 👏.
I appreciate the two approaches and there implementations:
- Lookup using BioPortals API
- Lookup in UMLS using SciSpaCy
After using the OCR processed PDF article and SBML species descriptions to prompt an LLM to create a more complete description that could be used for lookup
I noticed that the API calls to BioPortals were not just for lookup but also for enriching the species with their ontology annotations. If you had additional time, was the idea to also do some type of semantic search between the enriched annotations (after textual embedding) and the descriptions extracted from the article/sbml model?
There was a problem hiding this comment.
Thanks for the great feedback @dmccloskey. Excellent observations!
Our approach was guided by two key principles:
-
Leverage reasoning-focused LLMs with complete context rather than PDF RAG. Our justification was that SBML annotation requires holistic understanding of biological systems rather than fragmented inferences from text chunks. This approach also eliminates many of the technical challenges associated with making a RAG pipeline work properly.
-
Utilize established biomedical ontology tools instead of relying solely on semantic search. Biological entity mapping requires nuanced understanding that goes beyond simple text similarity, and there's a rich ecosystem of specialized technologies in this domain that provide significant advantages.
Regarding the enriched descriptions: they serve complementary purposes aligned with our goal of connecting SBML models to knowledge graphs:
- The ontological annotations create structured connections to KGs through standardized identifiers
- The textual descriptions make the models more accessible to LLMs.
For future work, semantic search would indeed be valuable. The following usecases are particularly interesting:
- Post-filter annotations for more precise connections
- Enable "white space exploration" beyond the explicit SBML model boundaries (across species, pathways, or disease contexts)
Overall, the point is the combination of KG-friendly ontological mapping and LLM-friendly textual descriptions creates a solid bridge between computational models and broader biological knowledge.
For authors
Description
Please:
Contributors:
This PR adds functionality to annotate SBML models using LLMs and integrate them with Biomedical Knowledge Graphs by:
Files added:
The main files to look at are the followings:
species_dict_annotated.json: Species dictionary enriched with bio-ontology annotationsspecies2primekg_map.csv: A final mapping between primekg nodes and speciesspecies_dict_umls.json: Species dictionary with UMLS codesFixes # (issue) Mention the issue number.
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests you conducted to verify your changes. These may involve creating new test scripts or updating existing ones.
testsfoldertests/testX.py)Checklist
testsfolder) that prove my fix is effective or that my feature worksFor reviewers
Checklist pre-approval
Checklist post-approval
developintomain? If so, please make sure to add a prefix (feat/fix/chore) and/or a suffix BREAKING CHANGE (if it's a major release) to your commit message.Checklist post-merge
developintomainand is it suppose to run an automated release workflow (if applicable)? If so, please make sure to check under the "Actions" tab to see if the workflow has been initiated, and return later to verify that it has completed successfully.