GraphRAG Pipeline for 3d Mirror Symmetry

A hybrid graph- and vector- retrieval-augmented generation (RAG) system built over a theoretical physics PhD thesis on 3d $\mathcal{N}\leq 2$ mirror symmetry, building on an existing RAG pipeline by moving from hierarchical vector retrieval with reranking to a graph-structured retrieval paradigm designed for multi-hop reasoning.

Note: the thesis itself is not available now, but the content of the relevant papers can be found here.

Motivation

Standard RAG pipelines perform well on local semantic retrieval, but struggle with:

multi-hop theoretical dependencies
cross-referenced physical constructions
structured reasoning across definitions, dualities, and mappings

In this domain (supersymmetric gauge theories and mirror symmetry), answers often require composing information across multiple linked concepts. HierarchicalRAG + reranking improves relevance, but remains fundamentally flat in structure, limiting reasoning depth.

We introduce a GraphRAG layer over scientific corpora, where:

nodes represent theories and operators
edges encode relations (duality, RG flows, symmetry maps)
retrieval operates over graph neighborhoods rather than isolated chunks

This enables multi-hop contextual expansion aligned with the structure of theoretical physics knowledge.

Corpus

Source: PhD thesis "The ABCD's of Mirror Symmetry", Pages 39–327 (289 pages) — front matter and bibliography excluded
Rationale for exclusion: Bibliography pages produced hallucinated summaries (e.g. holomorphic 3-forms, Calabi-Yau geometry) in preliminary runs, adding noise without signal
Parser: Multimodal PDF parsing and structured extraction using Gemini 3.1 Flash-Lite (handles TikZ quiver diagrams and LaTeX equations)
Embedding Model: BAAI/bge-large-en-v1.5 (top MTEB leaderboard, free, runs on T4)
Index: VectorStoreIndex

Architecture

Notebook 1: Proof of Concept (`GraphRAG_PoC.ipynb`)

A summary of the paper "Planar Abelian mirror duals of $\mathcal{N}=2$ SQCD$_3$" was generated by claude-sonnet-4-5 to validate schema design prior to full corpus ingestion. This served as a synthetic validation corpus for testing structured extraction and graph construction.
llama-3.3-70b-versatile was used for structured entity/relation extraction into a Pydantic schema. The goal was to validate whether a typed ontology for physics concepts could be reliably induced from LLM outputs.
A typed knowledge graph was constructed using NetworkX and validated via multi-hop traversal queries. Early experiments confirmed that graph structure improves compositional retrieval over flat chunk-based RAG.
Initial runs showed correct local retrieval but missed longer operator chains in quiver gauge theories.
These were fixed via:
- manual seed augmentation for operator mappings
- improved extraction prompts for relation consistency

The resulting graph correctly supports multi-hop queries over operator mappings and extended quiver structures.

A lightweight synthesis layer using llama-3.3-70b-versatile generates natural language responses from retrieved subgraphs, demonstrating early graph-conditioned generation capability.

Notebook 2: Hybrid Pipeline (`GraphRAG_full_wMLOps.ipynb`)

The system reuses an existing vector-based RAG index (VectorStoreIndex) from a prior pipeline as a semantic retrieval backbone. This component is not retrained and is used as an external retrieval module.
A prototype validation stage was first constructed using:
- claude-sonnet-4-5 for chapter/page-level summaries
- llama-3.3-70b-versatile for structured entity and relation extraction on synthetic text This stage was used exclusively to validate schema design, test extraction robustness, and identify postprocessing issues prior to full corpus ingestion.
After validation, the full PDF corpus was processed using Gemini 3.1 Flash-Lite as the primary extraction model, which was selected for its ability to handle dense mathematical content, figures, and structured physics text.
Extracted outputs were normalized using a Pydantic-validated schema (a typed ontology with 3 node types (Theory, Quiver, Operator) and 5 edge types (mirror_of, rg_flows_to, abelian_dual_of, operator_map, symmetry_map) was hand-designed based on the physics of the domain), followed by postprocessing steps including duplicate entity removal, hallucinated node filtering, relation normalization, and consistency correction.
A typed knowledge graph was constructed using NetworkX, enabling multi-hop traversal over physics entities such as operators, dualities, and quiver gauge structures.

The retrieval system is hybrid, combining:
- vector similarity search over the legacy embedding index
- graph-based retrieval
Retrieval strategy uses a soft routing mechanism (Gemini 3.1 Flash-Lite), which classifies queries into:
- vector-based retrieval
- graph-based reasoning
- hybrid execution (allowed when both signals are relevant)
The final system operates at the level of retrieval + fusion, where Gemini performs LLM-based merging of graph neighborhood context and vector-retrieved passages The synthesis layer uses Gemini 3.1 Flash-Lite to merge graph neighborhood context and vector-retrieved passages into a natural language response
The system includes an experimental MLOps tracking layer using MLflow, used to log and compare entity/relation extraction prompts, graph construction variants, retrieval configurations (vector vs graph vs hybrid), and LLM model settings and routing behavior.
The inference layer is exposed via a Flask-based API service, which serves as the backend interface for vector retrieval queries, graph-based multi-hop queries, and hybrid retrieval requests.
A Gradio-based interactive interface was prototyped but not fully deployed due to dependency conflicts between Gradio, uvicorn, and Colab's Python 3.12 runtime. The Flask API serves as the stable inference interface for v1.

Future Work

Improved chunking strategies for equation-dense and LaTeX-heavy sections to better preserve mathematical structure during entity and relation extraction (e.g., equation-aware segmentation and structure-preserving parsing).
RAGAS evaluation: systematic evaluation of the full pipeline once generation is stabilized, including faithfulness, answer relevancy, context precision, and context recall for both vector and graph-based retrieval paths.
Exploration of domain-adaptive LLM fine-tuning or continued pretraining on the corpus to improve extraction quality, particularly for operator-level mappings and physics-specific entity consistency.

The fine-tuned LLM project can be found here.

Gradio UI stabilisation and deployment separation from the extraction environment.

Stack

Python 3.12, Google Colab T4
LlamaIndex (VectorStoreIndex), PyMuPDF, HuggingFace Transformers
BAAI/bge-large-en-v1.5 (embeddings)
Anthropic claude-sonnet-4-5 (summary generation)
Google Gemini 3.1 Flash-Lite (multimodal PDF parsing and schema extraction, soft router)
Meta llama-3.3-70b-versatile (schema extraction)
NetworkX (knowledge graph)
Pydantic (schema validation)
MLFlow
Flask API
Gradio

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
graphrag_extractions		graphrag_extractions
graphrag_extractions_gemini		graphrag_extractions_gemini
rag_hierarchy_index		rag_hierarchy_index
GraphRAG_PoC.ipynb		GraphRAG_PoC.ipynb
GraphRAG_full_wMLOPs.ipynb		GraphRAG_full_wMLOPs.ipynb
README.md		README.md
content_summaries_claude.json		content_summaries_claude.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAG Pipeline for 3d Mirror Symmetry

Motivation

Corpus

Architecture

Notebook 1: Proof of Concept (`GraphRAG_PoC.ipynb`)

Notebook 2: Hybrid Pipeline (`GraphRAG_full_wMLOps.ipynb`)

Future Work

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphRAG Pipeline for 3d Mirror Symmetry

Motivation

Corpus

Architecture

Notebook 1: Proof of Concept (GraphRAG_PoC.ipynb)

Notebook 2: Hybrid Pipeline (GraphRAG_full_wMLOps.ipynb)

Future Work

Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Notebook 1: Proof of Concept (`GraphRAG_PoC.ipynb`)

Notebook 2: Hybrid Pipeline (`GraphRAG_full_wMLOps.ipynb`)

Packages