Skip to content

feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway)#137

Open
SandeepRed wants to merge 2 commits into
VirtualPatientEngine:mainfrom
SandeepRed:feat/KGModelHackJSS
Open

feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway)#137
SandeepRed wants to merge 2 commits into
VirtualPatientEngine:mainfrom
SandeepRed:feat/KGModelHackJSS

Conversation

@SandeepRed

Copy link
Copy Markdown

For authors

Description

  1. Get species descriptions(get_descriptions_fromPdf.py) and save JSON file**(descriptions_output.json)** to and identify GO terms related to the article(extract_relevant_GOTerms.py).
    Ideally would embed the whole text from article and vector search for GO nodes
  2. Retrieve associated gene/protein nodes by linking them to the selected GO terms from PrimeKG and embed (embed_genes.py).
  3. Perform semantic search using FAISS on OpenAI embeddings to find the most relevant Entrez Gene IDs based on query descriptions**(embed_descriptions_search_ncbi.py)**
    Final output: species_gene_matches.csv

Sandeep Chenna added 2 commits March 7, 2025 12:15
@gurdeep330 gurdeep330 changed the title Feat/kg model hack Jack Saleh Sandeep feat: hackathon/kg model hack Jack Saleh Sandeep (Team Galway) Mar 10, 2025

@dmccloskey dmccloskey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool work @SandeepRed and the rest of Team Galway!

Can you please confirm my high-level understanding of the proposed solution to map from the PDF article to PrimeKG?

  1. Extract disease terms from the PDF using OpenAI Textual Embeddings
  2. Extract species descriptions from the PDF using OpenAI Textual Embeddings
  3. Extract disease subgraph from PrimeKG by matching disease terms from step 1 to GO terms and descriptions in PrimeKG
  4. Extract gene/protein nodes linked to GO terms from step 3
  5. Embed the gene/protein descriptions from step 4 using OpenAI Textual Embeddings
  6. Compare the extracted species description embeddings from step 2 to the gene/protein description embeddings from step 5 using FAISS.

@SandeepRed

Copy link
Copy Markdown
Author

Dear Douglas,

Yes, that is a perfect summary. Apologies for the delay—It was a bank holiday weekend and I was unwell.

Regarding the extraction of species descriptions from the PDF using OpenAI Textual Embeddings:
We initially thought RAG wasn't necessary due to the long context length. However, as you suggested during the meeting, if we consider cross-citations, then yes, RAG makes sense.

Our disease-based approach didn't work as well as expected (surprisingly, likely due to missing mappings), so we went with using high-level GO terms for the subgraph instead

We wanted all of this fully automated, maybe merging disease and GO subgraphs or have a weighted scoring approach and utilize metadata like descriptions
The main reason for using subgraphs based approach was to reduce false positives while ensuring high sensitivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants