🧠 SOKE Graph: A Semantic-linked Ontological Framework for Domain-Specific Knowledge Discovery in Scientific Literature
SOKE Graph is a powerful, end-to-end pipeline designed to extract structured knowledge from scientific PDFs using ontology-driven classification and AI-assisted language models. It enables automated discovery and categorisation of domain-specific information—such as catalyst types, reaction conditions, and performance metrics—by parsing research papers, classifying concepts across multiple layers (e.g., Process, Environment, Reaction), and storing the results in a knowledge graph.
This tool can be tailored for accelerating literature analysis in any domain of research; in our case, we have focused on material science fields like green hydrogen production and water electrolysis.
- 🧠 SOKE Graph: A Semantic-linked Ontological Framework for Domain-Specific Knowledge Discovery in Scientific Literature
- 🚀 How to Run This Python Project on Windows, macOS, and Linux
- Step 1: Open the Command Line / Terminal
- Step 2: Clone the Project (Download the Code)
- 🐳 Quick Start with Docker (Easiest Option!)
- Alternative: Manual Python Installation
- Step 3: Create a Virtual Environment (Conda Recommended)
- Step 4: Activate the Environment
- Step 5: Install Project Dependencies
- Step 6: Recommended Editor – Visual Studio Code (VS Code)
- Step 7: Run the Project
- 📂 Preparing Input Files for SOKEGraph
- 1) 🧭 Ontology File (
Ontology.json) - 2) 📄 Paper Query File (
paper_query.txt) - 3) 🔑 Keyword Query File (
keyword_query.txt) - 4) 🔑 LLM API Key File (
apikeys_xxx.txt) - 5) 🗝️ Neo4j Credentials File (
neo4j_credentials.json) - Point the Streamlit app / notebooks to these files when prompted.
- Step 8: Deactivate Virtual Environment (Optional)
- 1) 🧭 Ontology File (
- 🔍 Retrieve papers from Semantic Scholar or your PDF collection
- 🤖 Use AI (OpenAI, Gemini, ...) to extract ontological concepts and metadata
- 📊 Rank papers based on query relevance and extracted metadata
- 🧱 Build knowledge graphs (Neo4j supported) from structured paper data
This guide will walk you through running this project on your computer, regardless of your operating system or prior Python knowledge.
You'll be able to enter commands here.
-
Windows: Press
Win + R, typecmd, and press Enter to open Command Prompt.
Or, pressWin + X, then select Windows PowerShell or Windows Terminal if installed. -
macOS: Press
Cmd + Space, typeTerminal, and press Enter. -
Linux: Look for the Terminal app in your applications menu, or press
Ctrl + Alt + T.
In the command line window you opened, type:
git clone https://github.com/sokematgraph/SOKE-Graph.git👉 If Git is not installed on your system, please see INSTALLATION.md for details.
After cloning, navigate into the project folder:
cd SOKE-GraphIf you have Docker installed, you can run SOKEGraph without installing Python or any dependencies.
chmod +x docker-run.sh
./docker-run.shStep 1: Make sure Docker Desktop is installed and running
Step 2: Open PowerShell (NOT Command Prompt)
- Press
Win + Xand select Windows PowerShell or Windows Terminal - Navigate to the SOKE-Graph folder if you're not already there:
cd SOKE-Graph
Step 3: Run the PowerShell script:
.\docker-run.ps1💡 Windows Tip: If you see an error about "execution policy", run this first:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Alternative for Windows users with Git Bash or WSL:
bash docker-run.shOpen http://localhost:8501 in your browser.
Your files will be saved to:
./data/outputs/- Ranked papers and results./external/output/- Knowledge graphs and exports
To stop the application:
docker compose downFor more details, see DOCKER_README.md
We recommend using Conda (or Miniconda/Mamba) to manage dependencies for this project.
Conda makes it easy to install and manage scientific packages across platforms.
conda create -n sokegraph python=3.9conda activate sokegraphYou’ll need to activate this environment every time before running the project.
With the environment active, install required packages:
pip install --upgrade pip
pip install -r requirements.txt⚡ Note for advanced users:
You may also use Python’s built-invenvif you prefer, but Conda is the recommended and tested way for this project.
We recommend using Visual Studio Code (VS Code) for working with this project, whether you want to edit code, run the Streamlit app, or work in Jupyter Notebooks.
If you don’t already have VS Code installed, please see INSTALLATION.md for detailed instructions on how to download and install it.
You can open the SOKE-raph project folder in two ways:
-
Option 1: From VS Code directly
- Open VS Code
- Go to File > Open Folder... and select the
SOKE-Graphfolder
-
Option 2: From the Terminal
If VS Code is installed and added to your PATH, you can run:cd SOKE-Graph code .
After opening, use the integrated terminal (View > Terminal) to activate your virtual environment (see Step 4) and start running the project.
- Python (Microsoft)
- Jupyter (Microsoft)
These extensions make it easier to run and edit .py or .ipynb files directly inside VS Code.
💡 Tip: You can run Jupyter notebooks inside VS Code without opening a separate browser window.
You can choose the method that best fits your skills and setup. For most users, Streamlit app is the easiest way to get started.
The Streamlit app provides a simple graphical interface to run the entire pipeline without writing code.
From your project folder, run:
streamlit run streamlit-app.pyThe app will open in your browser. You can configure the pipeline with the following inputs:
-
Paper source: Choose how to retrieve papers
Semantic ScholarJournal API
-
Number of papers: The maximum number of papers to fetch (for Semantic Scholar or Journal API).
-
Upload Paper Query file (
paper_query.txt): A text file with one search query per line. -
Upload Base Ontology file (
Ontology.json): Defines categories, subcategories, and keywords for concept detection. -
Field of interest: Enter your research domain (e.g., materials science, biology, medicine).
-
LLM: Select which LLM to use for ontology enrichment and paper analysis (
OpenAI,Gemini,Llama,Ollama, orClaude). -
LLM API Key (
apikeys_xxx.txt): A text file containing the API keys required for accessing AI models and/or Journal APIs. -
Keyword Query file (
keyword_query.txt): A list of keywords used for ranking and filtering papers. -
Knowledge Graph backend: Choose the graph engine:
networkx(in-memory, default)neo4j(requires credentials file)
👉 Note: If you don’t know how to create these files (Ontology.json, paper_query.txt, keyword_query.txt, apikeys_xxx.txt, or neo4j_credentials.json), see the section 📂 Preparing Input Files for SOKEGraph below.
- Compute Device:
By default, the program runs with GPU acceleration if your system supports it (e.g., CUDA, MPS).- To force the program to run on CPU only, check the option “Force CPU (ignore GPU)” in the sidebar.
Once all inputs are set, click 🚀 Run Pipeline.
The app will:
- Fetch papers from your chosen source using queries.
- Enrich the ontology with AI.
- Rank the papers using keywords.
- Build and display the knowledge graph.
- Export results (ranked papers, ontology, graph data) into the
external/output/folder.
This notebook is designed for users who are comfortable modifying code directly.
- 🔧 You should define all parameters manually in a Python dictionary called
params. - ✅ Once configured, you run the pipeline with a single function call.
- 📂 Best for quick experiments or automation in notebook environments.
from types import SimpleNamespace
from sokegraph.full_pipeline import full_pipeline_main
params = SimpleNamespace(
paper_source="Semantic Scholar", # Options: "Semantic Scholar", "PDF Zip", "Journal API"
number_papers=10, # Number of papers to fetch from Semantic Scholar
paper_query_file="topics.txt", # Text file with one search query per line
pdfs_file=None, # Optional: ZIP file with PDFs (for PDF source)
api_key_file="api_journal_api.txt", # API key file for Journal API source
ontology_file="base_ontology.json", # Base ontology file (JSON or OWL)
AI="openAI", # Options: "openAI", "gemini", "llama", "ollama", "claude"
API_keys="openai_keys.json", # API key file for AI tools
keyword_query_file="keywords.txt", # Text file listing keywords
model_knowledge_graph="neo4j", # Options: "neo4j", "networkx"
credentials_for_knowledge_graph="neo4j_credentials.json", # Graph DB credentials
output_dir="output/" # Output directory
)
full_pipeline_main(params)"number_papers" + "paper_query_file"
OR
"number_papers" + "paper_query_file" + "api_key_journal_api"
OR
"pdfs_file"
depending on whether you're searching for papers or uploading PDFs.
💡 Make sure that all file paths in your params are valid and that services like Neo4j, Ollama, or your Journal API access are available before starting the pipeline.
This notebook uses ipywidgets to provide an interactive form-like interface for running the pipeline.
It’s helpful if you want a guided, cell-by-cell execution without writing code manually.
-
Allows you to select how you want to retrieve papers:
- 📁 Upload a ZIP file of PDFs (PDF source)
- 🔎 Search and fetch papers from Semantic Scholar using a query file
- 🌐 Fetch papers via the Journal API using a query file and an API key
-
Provides dropdowns and file pickers to easily select files like:
- Ontology
- Keyword queries
- API keys
- Output folder
-
Runs each pipeline step independently, so you can see exactly what happens at every stage.
-
📄 Paper Retrieval
- Based on your selected
paper_source:Semantic Scholar: Downloads papers using yourpaper_query_filePDF Zip: Loads and processes PDFs from the uploaded ZIP fileJournal API: Retrieves paper metadata from the Web of Science API using query + API key
- Based on your selected
-
🧠 Ontology Enrichment
- The chosen AI agent (
openAI,gemini,llama,ollama, orclaude) analyzes the papers and expands your base ontology - Adds new keywords, concepts, synonyms, and relationships
- The chosen AI agent (
-
📊 Paper Ranking
- Ranks the papers using:
- Exact keyword matches
- Synonyms and expanded terms
- Opposite-term filtering to down-rank irrelevant papers
- Ranks the papers using:
-
🕸 Knowledge Graph Construction
- Converts enriched data into a structured graph using:
Neo4j(with login credentials)- Or
NetworkX(in-memory option)
- Graph includes:
- Ontology categories
- Paper-concept links
- Metadata associations
- Converts enriched data into a structured graph using:
-
💾 Output
- Saves everything in your selected
output_dir, including:- Enriched ontology file
- Ranked papers (CSV/JSON)
- Saves everything in your selected
✅ No need to modify code manually – just fill out the form and click Run for each step.
💡 Make sure required services like Neo4j, Ollama, or your Journal API credentials are ready before starting the pipeline.
SOKEGraph uses four input files. Place them in your project (e.g., ./inputs/) and point the app/notebook to their paths.
Defines categories → subcategories → keywords/synonyms that guide concept detection and search.
Format
{
"Category": {
"Subcategory": ["keyword1", "keyword2", "keyword3"]
}
}Example
{
"Environment": {
"Acidic": ["pH < 7", "acidic"],
"Alkaline": ["pH > 7", "alkaline", "basic"]
},
"Process": {
"Water Electrolysis": ["electrolysis of water", "splitting H2O"],
"Fuel Cells": ["fuel cell", "PEM", "proton exchange membrane"]
}
}Tips
- Include common variants (symbols, abbreviations, spacing:
pH<7vspH < 7). - Validate JSON (e.g., jsonlint). Save as
Ontology.json.
Each line is one search query sent to Semantic Scholar / other engines.
Example
Acidic earth abundant catalysts for water splitting
Nickel-based electrocatalysts for OER
Graph neural networks for chemical reaction predictionThe keyword_query.txt file contains keywords or short phrases (e.g., acidic HER water splitting) that the system uses to rank papers during search.
Example
acidic HER water splittingThe application requires API keys to access AI agents (OpenAI, Gemini, Claude, LLaMA, etc.) and external Journal APIs.
- For each AI agent, you should create a separate text file (e.g.,
openai_keys.txt,gemini_keys.txt,claude_keys.txt,llama_keys.txt). - Each file can contain multiple API keys, one per line.
- The application will automatically iterate over these keys if one is rate-limited or exhausted.
Example – openai_keys.txt
sk-openai-xxxxxxxxxxxxxxxxxxxxxxxx
sk-openai-yyyyyyyyyyyyyyyyyyyyyyyyExample – gemini_keys.txt
ya29.gemini-xxxxxxxxxxxxxxxx
ya29.gemini-yyyyyyyyyyyyyyyyExample – claude_keys.txt
claude-xxxxxxxxxxxxxxxx
claude-yyyyyyyyyyyyyyyyExample – llama_keys.txt
llama-xxxxxxxxxxxxxxxx
llama-yyyyyyyyyyyyyyyyExample – journal_api_keys.txt
journal-abc123456789
journal-def987654321-
OpenAI:
- Sign up at https://platform.openai.com.
- Go to View API keys.
- Create a new secret key and copy it into
openai_keys.txt.
-
Google Gemini (Vertex AI / Google AI Studio):
- Go to Google AI Studio or Google Cloud Console.
- Enable Gemini API.
- Generate an API key and add it to
gemini_keys.txt.
-
Anthropic Claude:
- Sign up at https://console.anthropic.com.
- Generate an API key.
- Save it into
claude_keys.txt.
-
Meta LLaMA (Together):
- Go to Together: Together AI
- Create an API key in the console.
- Save it into
llama_keys.txt.
-
Journal API (e.g., Web of Science, Scopus, or other provider):
- Log in to the provider’s portal.
- Request an API token.
- Save it into
journal_api_keys.txt.
-
Ollama:
Ollama runs offline locally on your machine and does not require an API key. You just need to have Ollama installed and running.
👉 Keep all API key files private and never commit them to GitHub.
When running the app, simply upload the relevant file(s) in the Streamlit interface.
Provide your Neo4j connection details in a small JSON file.
Example — neo4j_credentials.json
{
"uri": "bolt://localhost:7687",
"username": "neo4j",
"password": "YOUR_PASSWORD",
}Recommended Layout
inputs/
Ontology.json
paper_query.txt
keyword_query.txt
apikeys_xxx.txt
neo4j_credentials.json
When you are done working, you can leave the environment by running:
conda deactivate👉 Whenever you want to use the tool again, just activate the environment:
conda activate sokegraphThen run the project as shown in Step 7.