- Swaminathan Chellappa
- Aaradhya Goyal
- Meeth Davda
- Aditya Sudhindra
- Karthik Venugopal
A three-level NLP pipeline for claim-level hallucination detection on the SciFact dataset. Each level progressively improves verdict accuracy by combining FActScore, uncertainty-quantified LLM scoring (uqlm), NLI, and ensemble methods to classify scientific claims as SUPPORT, CONTRADICT, or NEI (Not Enough Information).
- Project Structure
- System & Device
- Environment Setup
- Running the Code
- How Results Are Generated
- Results Summary
- Fine-tuning (TinyLlama + QLoRA)
.
├── main.py # Top-level runner: loads/reruns all levels and prints metrics
├── level1.py # Level 1: FActScore + uqlm baseline
├── level2.py # Level 2: BM25-gated FActScore + label-prompted uqlm (n=5)
├── level3/
│ ├── level3.py # Level 3: ensemble entry point
│ ├── factscore_runner.py # Level 3 FActScore module
│ ├── uqlm_runner.py # Level 3 uqlm module
│ ├── nli_runner.py # GPT-based NLI module
│ ├── tinyllama_runner.py # Fine-tuned TinyLlama NLI runner
│ ├── classifier_runner.py # RandomForest score classifier
│ ├── bm25_retriever.py # BM25 corpus retriever
│ ├── display_results.py # Result formatting utilities
│ └── finetuned_tinyllama/ # Fine-tuned adapter weights
├── finetuning/
│ └── V3_MNLI_QLoRA_Colab.ipynb # QLoRA fine-tuning notebook (Google Colab)
├── data/
│ └── scifact/ # SciFact corpus, claims, and SQLite DB
├── results/ # Saved JSON outputs and metrics per level
├── FActScore/ # Local fork of FActScore
├── requirements.txt
├── requirements_no_deps.txt
└── .env.example # Template for environment variables
| Item | Details |
|---|---|
| OS | macOS (Darwin 25.x) |
| Python | 3.13 |
| Hardware (inference) | Apple Silicon Mac (CPU/MPS) |
| Hardware (fine-tuning) | Google Colab (NVIDIA GPU with CUDA, A100/T4 recommended) |
| LLM API | OpenAI gpt-4o-mini via LangChain |
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txt
pip install -r requirements_no_deps.txt --no-depsThe FActScore/ directory contains a local fork of the FActScore library. Install it in editable mode without pulling in its pinned dependencies (which conflict with newer packages):
pip install -e FActScore/ --no-depspython -m spacy download en_core_web_smCopy the example env file and fill in your credentials:
cp .env.example .envOpen .env and set at minimum:
OPENAI_API_KEY=sk-... # Required for FActScore (ChatGPT) and uqlm
DATA_DIR=data/scifact/data
RESULTS_DIR=results
SCIFACT_DB=data/scifact/scifact_corpus.db
SCIFACT_CACHE=data/scifact/factscore_cache
FACTSCORE_DATA=~/.cache/factscore
N_CLAIMS=10
RAND_SEED=42All other values in .env.example are pre-configured with defaults. Adjust the value for N_CLAIMS to increase or decrease the number of claims to be processed.
The results/ directory contains pre-computed outputs. To print metrics without making any API calls:
python main.pypython main.py --rerun# Run only Level 1 and Level 2
python main.py --levels 1 2
# Run only Level 3
python main.py --levels 3# Level 1 - load saved results
python level1.py
# Level 1 - re-run full pipeline
python level1.py --rerun
# Level 2 - load saved results
python level2.py
# Level 2 - re-run full pipeline
python level2.py --rerun
# Level 3 - must be run from the level3/ directory
cd level3
python level3.pyThe pipeline evaluates scientific claims from the SciFact training set against three labels: SUPPORT, CONTRADICT, and NEI. Each level adds capabilities over the previous one.
A balanced sample of claims is drawn from data/scifact/data/claims_train.jsonl with equal representation across SUPPORT, CONTRADICT, and NEI labels (controlled by N_CLAIMS and RAND_SEED). The SciFact corpus (corpus.jsonl) is loaded into an in-memory dictionary and also indexed in a SQLite database (scifact_corpus.db) for FActScore retrieval.
Two independent systems score each claim:
-
FActScore (
retrieval+ChatGPT): Decomposes the claim into atomic facts and uses BM25 retrieval over the SciFact corpus to check each atom against the cited abstract. The ratio of supported atoms is mapped to a verdict:ratio >= 0.6→ SUPPORTratio <= 0.4→ CONTRADICT- otherwise → NEI
-
uqlm (
LongTextUQwith entailment scorer): Sends the claim + abstract togpt-4o-miniwithn=3sampled responses. Each response is parsed for a leading label (SUPPORTED/CONTRADICTED/NOT_ENOUGH_INFO). The majority vote across samples becomes the final verdict.
Results are merged per claim and written to results/level1_results.json and results/level1_metrics.json.
Improvements over Level 1:
-
BM25 relevance gate: Before calling the OpenAI API, a BM25 retriever scores each cited abstract against the claim. Claims with a maximum BM25 score below
RELEVANCE_GATE_THRESHOLDare declared NEI immediately, saving API calls. -
Improved FActScore verdict logic: Adds a NEI-atom fraction threshold (
FS_NEI_ATOM_FRAC) - if most atoms are unscorable, the verdict defaults to NEI. A small-sample CONTRADICT adjustment tightens the threshold for claims with ≤ 2 definitive atoms. -
uqlm with explicit label prompting and n=5: The prompt structure is more directive, and 5 sampled responses are used for a more stable majority vote.
Results are written to results/level2_results.json and results/level2_metrics.json.
Level 3 runs three sequential phases. Each phase uses a 3-signal ensemble of FActScore + uqlm + one NLI source - GPT and TinyLlama are alternatives, not used simultaneously.
Phase 1 - FActScore + uqlm + GPT-4o-mini NLI
- FActScore (same gated logic as Level 2, via
factscore_runner.py) - uqlm (label-prompted, n=5, via
uqlm_runner.py) - GPT NLI (
nli_runner.py): Sends claim + top-K BM25 abstract sentences togpt-4o-miniwith a strict NLI system prompt. Maps ENTAILMENT → SUPPORT, CONTRADICTION → CONTRADICT, NEUTRAL → NEI.
The three verdicts are combined: if all agree the result is used directly; if two agree the majority wins; if all three disagree, a confidence-weighted tiebreak is applied (uqlm weight 0.45, NLI fixed 0.30, FActScore proportional to evidence quality). Phase 1 results are saved to results/level3_results.json.
Phase 2 - TinyLlama NLI replaces GPT NLI
Re-runs the same 3-signal ensemble but substitutes the GPT NLI call with a fine-tuned TinyLlama model (tinyllama_runner.py). The adapter is loaded from level3/finetuned_tinyllama/ on top of TinyLlama/TinyLlama-1.1B-Chat-v1.0 and generates a label token for each claim. Results are saved to results/level3_tinyllama_results.json. Phase 2 is skipped if TINYLLAMA_MODEL is not set in .env.
Phase 3 - RandomForest Score Classifier
A RandomForest classifier (classifier_runner.py) is trained on numeric features (FActScore ratio, uqlm entailment score, confidence) using 5-fold cross-validation. It always runs once with FActScore + uqlm features only (no NLI). If Phase 2 ran, it also runs a second time with TinyLlama NLI features added.
A RandomForest score classifier (classifier_runner.py) is also available as an optional signal. It trains on numeric features (FActScore ratio, uqlm entailment score, confidence) using 5-fold cross-validation (CV_N_SPLITS=5) with RF_N_ESTIMATORS=200 trees.
The notebook finetuning/V3_MNLI_QLoRA_Colab.ipynb trains a QLoRA adapter on the MultiNLI dataset for claim verification. It is designed to run on Google Colab (requires a GPU).
Key settings:
- Base model:
TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Quantization: 4-bit NF4 with double quantization (
bitsandbytes) - LoRA rank:
r=16,alpha=32, targeting all attention and MLP projection layers - Dataset: 5,000 train / 1,000 eval examples from MultiNLI
- Label mapping:
entailment→ SUPPORT,contradiction→ CONTRADICT,neutral→ NOT_ENOUGH_INFO
The trained adapter is saved and placed at level3/finetuned_tinyllama/ for use by tinyllama_runner.py at inference time.