ragcitecheck is a lightweight tool for measuring citation and evidence stability in Retrieval-Augmented Generation (RAG) pipelines across repeated runs and configuration changes.
It helps answer a practical question: If I rerun the same RAG pipeline with a different retriever, chunk size, top-k, overlap, or prompt setting, do I still get the same evidence?
Research context: ragcitecheck is an open-source tool associated with peer-reviewed research accepted at the ACL 2026 EvalEval Workshop on evaluation of language models and systems.
Most RAG evaluation focuses on answer correctness, groundedness, retrieval recall, latency, or cost.
Those are useful, but they miss an important reliability question:
Is the cited evidence stable across runs?
A system may produce similar answers while silently changing:
- which documents it cites
- which spans inside documents it relies on
- how often evidence disappears entirely
ragcitecheck is designed to make that visible.
Given one or more run logs in JSONL format, ragcitecheck can:
- validate run structure with flexible key aliases
- compare document-level evidence stability
- compare document + span-level evidence stability
- compute pairwise overlap metrics across runs
- generate per-query instability summaries
- surface examples of unstable evidence
- report null-evidence patterns that can hide instability
ragcitecheck is useful for:
- RAG evaluators who want more than answer-level metrics
- framework maintainers who want evidence-level regression checks
- LLMOps / observability teams tracking provenance drift
- researchers studying retrieval jitter or citation stability
- product teams where evidence consistency matters
py -3.10 -m venv .venv
.venv\Scripts\activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install -e .python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install -e .python -m ragcitecheck.cli validate --runs ./tests/fixtures/runs_min --out ./out_checkpython -m ragcitecheck.cli report --runs ./tests/fixtures/runs_min --out ./out_report_doc --evidence-key docpython -m ragcitecheck.cli report --runs ./tests/fixtures/runs_min --out ./out_report_span --evidence-key doc_spanEach run is stored as JSONL with one record per query.
{"run_id":"runA","query_id":"q1","retrieved":[{"doc_id":"D1"},{"doc_id":"D2"}]}
{"run_id":"runA","query_id":"q2","retrieved":[{"doc_id":"D3"}]}{"run_id":"runA","query_id":"q1","retrieved":[{"doc_id":"D1","span_hash":"s1"},{"doc_id":"D2","span_hash":"s2"}]}
{"run_id":"runA","query_id":"q2","retrieved":[{"doc_id":"D3","span_hash":"s3"}]}run_idrunIdconfig_idrunid
query_idqidid
citeddocsdocumentsretrievedcontexts
doc_iddocument_iddocididsource_id
-
--allow-missing
Evaluate only shared queries across runs -
--docid-map path.csv
Canonicalize raw document IDs using araw,canonicalmapping -
--case-sensitive
Disable lowercasing during canonicalization -
--collapse-internal-whitespace
Normalize repeated whitespace before comparison
Validation outputs:
validation_summary.json
Report outputs may include:
run_quality.csvpairwise_config_stability.csvper_query_stability.csvinstability_examples.mdcitation_overlap_hist.pngreport_meta.json
See examples/golden/out for a sample generated report, including CSV summaries, instability examples, metadata, and a citation-overlap histogram.
-
Average overlap close to 1.0
Evidence is relatively stable across runs -
High flip rate
Queries frequently undergo major evidence changes -
High null-evidence rate
Many queries return empty evidence; this can make a system look stable while reducing usefulness -
Large doc vs span gap
The same documents may be cited across runs while cited spans still change substantially
ragcitecheck adds a capability that many RAG evaluation stacks do not report directly:
- evidence regression checks after retriever or chunking changes
- provenance drift diagnostics across releases
- separation of answer drift vs evidence drift
- lightweight post-hoc comparison from exported logs
ragcitecheck is best described as a:
- RAG evidence stability checker
- citation stability diagnostic tool
- post-hoc provenance stability evaluator
- evidence drift analysis utility for RAG
The tool supports experimental workflows studying evidence stability, citation consistency, and retrieval jitter in RAG systems. The repository is intended to be useful both as a research artifact and as a lightweight utility for post-hoc evidence stability checks in external RAG pipelines.
This project is actively maintained, with ongoing improvements to examples, reporting, and integrations.
See the LICENSE file in this repository.