PRAG is a research codebase that combines retrieval-augmented generation (RAG) over medical textbooks with a Paninian rule engine inspired by classical Sanskrit grammar (utsarga-apavada, anuvrtti, paribhasha, nitya-anitya, antaranga-bahiranga). Every answer includes a full auditable rule trace — the main research contribution.
Built on the MedQA USMLE-style multiple-choice dataset (Jin et al., 2020).
| Approach | What it does |
|---|---|
| Standard RAG | Retrieve textbook chunks → answer |
| PRAG | Retrieve → apply Paninian clinical rules → answer using rule-approved context only |
Rules govern drug contraindications, pregnancy safety, dosage limits, diagnostic red flags, and guideline conflicts — with explainable traces for every decision.
MedQA Question
│
▼
TextbookStore (FAISS + sentence-transformers) ──► top-k chunks
│
▼
PaniniRuleEngine (32 medical rules) ──► filter / block / warn
│
▼
MCQ Answerer (FLAN-T5 default, or BiomedBERT) ──► PRAG answer + rule trace
| Module | Path | Purpose |
|---|---|---|
| Question loader | src/data/medqa_loader.py |
US train/dev/test JSONL |
| Textbook store | src/knowledge/textbook_store.py |
Chunk, embed, FAISS retrieve |
| Rule engine | src/rules/paninian_rule_engine.py |
32 Paninian-governed clinical rules |
| Pipeline | src/prag_pipeline.py |
End-to-end PRAG vs standard RAG |
| Ablation study | src/ablation_study.py |
Four-mode ablation (A–D) on 170 safety questions |
| Targeted eval | src/targeted_eval.py |
Keyword-filtered safety evaluation |
| Paper numbers | src/paper_numbers.py |
Export tables from experiment JSON |
git clone https://github.com/yuvrajrajput/PRAG.git
cd PRAG
git checkout development # active dev branch
pip install -r requirements.txtDownload from Google Drive and extract to MedQA/data/data_clean/.
pip install gdown
mkdir -p MedQA/data
gdown "https://drive.google.com/uc?id=1ImYUSLk9JbgHXOemfvyiDiirluZHPeQw" -O "MedQA/data/medqa_data.zip"
unzip MedQA/data/medqa_data.zip -d MedQA/data/python src/knowledge/textbook_store.pySaves to data/vector_store/ (18 textbooks, ~51k chunks).
# Single question demo
python src/prag_pipeline.py
# Benchmark PRAG vs standard RAG (50 dev questions, FLAN-T5)
python src/prag_pipeline.py --compare 50 --split dev --model flan-t5Results saved to outputs/benchmark_results.json.
# Four-mode ablation on 170 safety-critical questions
python src/ablation_study.py
# Targeted safety evaluation
python src/targeted_eval.py --split dev --model flan-t5
# Export paper tables from results JSON
python src/paper_numbers.py| Branch | Purpose |
|---|---|
main |
Stable releases |
development |
Active research (use this for contributions) |
PRAG |
Legacy initial branch (not updated) |
On 170 safety-critical questions (pregnancy, renal failure, NSAIDs, anticoagulants, paediatric contraindications), standard RAG retrieval actively degrades performance compared to the base model alone:
| Mode | Description | Accuracy |
|---|---|---|
| A — model only | No retrieval, no rules | 24.7% (42/170) |
| B — standard RAG | Retrieval, no rules | 17.6% (30/170) |
| C — rules only | Rules, no retrieval | 25.9% (44/170) |
| D — full PRAG | Retrieval + rules | 18.8% (32/170) |
RAG drops accuracy by 7.1 percentage points versus the base model on safety-critical cases. Full PRAG improves over standard RAG by +1.2 pp and prevents dangerous answers in clinically critical cases (see Finding 3).
| Metric | Safety-critical questions | General questions | Uplift |
|---|---|---|---|
| Rule firing rate | 57.65% | 37.75% | +52.7% |
| Context block rate | 37.65% | 17.79% | +111.6% |
Rules fire and block significantly more on exactly the questions where mistakes are clinically dangerous — not uniformly across all questions.
Two questions (dev_497, dev_678) were answered correctly only in
Mode D — wrong in model-alone, wrong in standard RAG, wrong in
rules-alone, correct only in full PRAG. This isolates the rule hierarchy
itself as the contributing factor, not retrieval or model prior knowledge.
Five additional questions show PRAG correct where standard RAG was wrong
(dev_401, dev_497, dev_678, dev_695, dev_822).
The most critical case (dev_678, eclampsia):
Query : 30-week pregnant woman, seizures, BP 170/102, hyperreflexia
RAG : Calcium gluconate ✗ (treatment for hypocalcaemia)
PRAG : Magnesium sulfate ✓ (correct first-line eclampsia treatment)
Rules fired:
RULE_P005 [block] — pregnancy-safety context filtered (Nitya, Antaranga)
Pāṇinian principle: Utsarga-Apavāda — exception overrides general rule
In a real clinical setting, the RAG answer could contribute to a patient not receiving the correct treatment for a life-threatening emergency.
Pre-computed results for all 170 questions are in outputs/ablation_results.json
and outputs/targeted_results.json. Run python src/paper_numbers.py to
regenerate copy-paste tables for the paper.
medical-qa medqa rag retrieval-augmented-generation clinical-decision-support paninian-grammar rule-engine faiss usmle healthcare-ai nlp explainable-ai
If you use PRAG in your research, please cite:
@software{rajput2026prag,
author = {Rajput, Yuvraj},
title = {{PRAG}: {P}aninian Retrieval-Augmented Generation
for Safety-Critical Medical {AI}},
year = {2026},
url = {https://github.com/yuvrajrajput/PRAG},
version = {1.0.0}
}Also cite the MedQA dataset:
@article{jin2020disease,
title = {What Disease does this Patient Have? A Large-scale Open Domain
Question Answering Dataset from Medical Exams},
author = {Jin, Di and Pan, Eileen and Oufattole, Nassim and
Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter},
journal = {arXiv preprint arXiv:2009.13081},
year = {2020}
}GitHub also reads citation metadata from CITATION.cff.
GPL-3.0 — see LICENSE.