feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway) by SandeepRed · Pull Request #137 · VirtualPatientEngine/AIAgents4Pharma

SandeepRed · 2025-03-07T23:24:26Z

For authors

Description

Get species descriptions(get_descriptions_fromPdf.py) and save JSON file**(descriptions_output.json)** to and identify GO terms related to the article(extract_relevant_GOTerms.py).
Ideally would embed the whole text from article and vector search for GO nodes
Retrieve associated gene/protein nodes by linking them to the selected GO terms from PrimeKG and embed (embed_genes.py).
Perform semantic search using FAISS on OpenAI embeddings to find the most relevant Entrez Gene IDs based on query descriptions**(embed_descriptions_search_ncbi.py)**
Final output: species_gene_matches.csv

…logies

…gene/protein nodes related to those GO and semantic search for entrez

dmccloskey

Really cool work @SandeepRed and the rest of Team Galway!

Can you please confirm my high-level understanding of the proposed solution to map from the PDF article to PrimeKG?

Extract disease terms from the PDF using OpenAI Textual Embeddings
Extract species descriptions from the PDF using OpenAI Textual Embeddings
Extract disease subgraph from PrimeKG by matching disease terms from step 1 to GO terms and descriptions in PrimeKG
Extract gene/protein nodes linked to GO terms from step 3
Embed the gene/protein descriptions from step 4 using OpenAI Textual Embeddings
Compare the extracted species description embeddings from step 2 to the gene/protein description embeddings from step 5 using FAISS.

SandeepRed · 2025-03-20T16:41:57Z

Dear Douglas,

Yes, that is a perfect summary. Apologies for the delay—It was a bank holiday weekend and I was unwell.

Regarding the extraction of species descriptions from the PDF using OpenAI Textual Embeddings:
We initially thought RAG wasn't necessary due to the long context length. However, as you suggested during the meeting, if we consider cross-citations, then yes, RAG makes sense.

Our disease-based approach didn't work as well as expected (surprisingly, likely due to missing mappings), so we went with using high-level GO terms for the subgraph instead

We wanted all of this fully automated, maybe merging disease and GO subgraphs or have a weighted scoring approach and utilize metadata like descriptions
The main reason for using subgraphs based approach was to reduce false positives while ensuring high sensitivity

Sandeep Chenna added 2 commits March 7, 2025 12:15

extract terms, embed, extract disease subrgaph with Gene, GO and onto…

9200202

…logies

Get descriptions for species, relevant high level GO Terms and fetch …

f9fa051

…gene/protein nodes related to those GO and semantic search for entrez

gurdeep330 requested review from awmulyadi, dmccloskey and lilijap March 8, 2025 06:33

gurdeep330 assigned SandeepRed Mar 8, 2025

gurdeep330 added T2B T2KG labels Mar 8, 2025

gurdeep330 changed the title ~~Feat/kg model hack Jack Saleh Sandeep~~ feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway) Mar 10, 2025

dmccloskey mentioned this pull request Mar 12, 2025

FEATURE: Require multi-PDF support in Question_and_answer.py (tool) of pdf_agent.py (agent) #131

Open

dmccloskey reviewed Mar 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway)#137

feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway)#137
SandeepRed wants to merge 2 commits into
VirtualPatientEngine:mainfrom
SandeepRed:feat/KGModelHackJSS

SandeepRed commented Mar 7, 2025

Uh oh!

dmccloskey left a comment

Uh oh!

SandeepRed commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SandeepRed commented Mar 7, 2025

For authors

Description

Uh oh!

dmccloskey left a comment

Choose a reason for hiding this comment

Uh oh!

SandeepRed commented Mar 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants