A hybrid graph- and vector- retrieval-augmented generation (RAG) system built over a theoretical physics PhD thesis on 3d
Note: the thesis itself is not available now, but the content of the relevant papers can be found here.
Standard RAG pipelines perform well on local semantic retrieval, but struggle with:
- multi-hop theoretical dependencies
- cross-referenced physical constructions
- structured reasoning across definitions, dualities, and mappings
In this domain (supersymmetric gauge theories and mirror symmetry), answers often require composing information across multiple linked concepts. HierarchicalRAG + reranking improves relevance, but remains fundamentally flat in structure, limiting reasoning depth.
We introduce a GraphRAG layer over scientific corpora, where:
- nodes represent theories and operators
- edges encode relations (duality, RG flows, symmetry maps)
- retrieval operates over graph neighborhoods rather than isolated chunks
This enables multi-hop contextual expansion aligned with the structure of theoretical physics knowledge.
- Source: PhD thesis "The ABCD's of Mirror Symmetry", Pages 39–327 (289 pages) — front matter and bibliography excluded
- Rationale for exclusion: Bibliography pages produced hallucinated summaries (e.g. holomorphic 3-forms, Calabi-Yau geometry) in preliminary runs, adding noise without signal
- Parser: Multimodal PDF parsing and structured extraction using Gemini 3.1 Flash-Lite (handles TikZ quiver diagrams and LaTeX equations)
- Embedding Model:
BAAI/bge-large-en-v1.5(top MTEB leaderboard, free, runs on T4) - Index:
VectorStoreIndex
- A summary of the paper "Planar Abelian mirror duals of $\mathcal{N}=2$ SQCD$_3$" was generated by
claude-sonnet-4-5to validate schema design prior to full corpus ingestion. This served as a synthetic validation corpus for testing structured extraction and graph construction. -
llama-3.3-70b-versatilewas used for structured entity/relation extraction into a Pydantic schema. The goal was to validate whether a typed ontology for physics concepts could be reliably induced from LLM outputs. - A typed knowledge graph was constructed using NetworkX and validated via multi-hop traversal queries. Early experiments confirmed that graph structure improves compositional retrieval over flat chunk-based RAG.
- Initial runs showed correct local retrieval but missed longer operator chains in quiver gauge theories.
- These were fixed via:
- manual seed augmentation for operator mappings
- improved extraction prompts for relation consistency
The resulting graph correctly supports multi-hop queries over operator mappings and extended quiver structures.

- A lightweight synthesis layer using
llama-3.3-70b-versatilegenerates natural language responses from retrieved subgraphs, demonstrating early graph-conditioned generation capability.
- The system reuses an existing vector-based RAG index (
VectorStoreIndex) from a prior pipeline as a semantic retrieval backbone. This component is not retrained and is used as an external retrieval module. - A prototype validation stage was first constructed using:
claude-sonnet-4-5for chapter/page-level summariesllama-3.3-70b-versatilefor structured entity and relation extraction on synthetic text This stage was used exclusively to validate schema design, test extraction robustness, and identify postprocessing issues prior to full corpus ingestion.
- After validation, the full PDF corpus was processed using
Gemini 3.1 Flash-Liteas the primary extraction model, which was selected for its ability to handle dense mathematical content, figures, and structured physics text. - Extracted outputs were normalized using a Pydantic-validated schema (a typed ontology with 3 node types (Theory, Quiver, Operator) and 5 edge types (mirror_of, rg_flows_to, abelian_dual_of, operator_map, symmetry_map) was hand-designed based on the physics of the domain), followed by postprocessing steps including duplicate entity removal, hallucinated node filtering, relation normalization, and consistency correction.
- A typed knowledge graph was constructed using NetworkX, enabling multi-hop traversal over physics entities such as operators, dualities, and quiver gauge structures.
- The retrieval system is hybrid, combining:
- vector similarity search over the legacy embedding index
- graph-based retrieval
- Retrieval strategy uses a soft routing mechanism (
Gemini 3.1 Flash-Lite), which classifies queries into:- vector-based retrieval
- graph-based reasoning
- hybrid execution (allowed when both signals are relevant)
- The final system operates at the level of retrieval + fusion, where Gemini performs LLM-based merging of graph neighborhood context and vector-retrieved passages The synthesis layer uses Gemini 3.1 Flash-Lite to merge graph neighborhood context and vector-retrieved passages into a natural language response
- The system includes an experimental MLOps tracking layer using MLflow, used to log and compare entity/relation extraction prompts, graph construction variants, retrieval configurations (vector vs graph vs hybrid), and LLM model settings and routing behavior.
- The inference layer is exposed via a Flask-based API service, which serves as the backend interface for vector retrieval queries, graph-based multi-hop queries, and hybrid retrieval requests.
- A Gradio-based interactive interface was prototyped but not fully deployed due to dependency conflicts between Gradio, uvicorn, and Colab's Python 3.12 runtime. The Flask API serves as the stable inference interface for v1.
- Improved chunking strategies for equation-dense and LaTeX-heavy sections to better preserve mathematical structure during entity and relation extraction (e.g., equation-aware segmentation and structure-preserving parsing).
- RAGAS evaluation: systematic evaluation of the full pipeline once generation is stabilized, including faithfulness, answer relevancy, context precision, and context recall for both vector and graph-based retrieval paths.
- Exploration of domain-adaptive LLM fine-tuning or continued pretraining on the corpus to improve extraction quality, particularly for operator-level mappings and physics-specific entity consistency.
The fine-tuned LLM project can be found here.
- Gradio UI stabilisation and deployment separation from the extraction environment.
- Python 3.12, Google Colab T4
- LlamaIndex (
VectorStoreIndex), PyMuPDF, HuggingFace Transformers BAAI/bge-large-en-v1.5(embeddings)- Anthropic
claude-sonnet-4-5(summary generation) - Google
Gemini 3.1 Flash-Lite(multimodal PDF parsing and schema extraction, soft router) - Meta
llama-3.3-70b-versatile(schema extraction) - NetworkX (knowledge graph)
- Pydantic (schema validation)
- MLFlow
- Flask API
- Gradio