ragcitecheck

ragcitecheck is a lightweight tool for measuring citation and evidence stability in Retrieval-Augmented Generation (RAG) pipelines across repeated runs and configuration changes.

It helps answer a practical question: If I rerun the same RAG pipeline with a different retriever, chunk size, top-k, overlap, or prompt setting, do I still get the same evidence?

Research context: ragcitecheck is an open-source tool associated with peer-reviewed research accepted at the ACL 2026 EvalEval Workshop on evaluation of language models and systems.

Why this matters

Most RAG evaluation focuses on answer correctness, groundedness, retrieval recall, latency, or cost.

Those are useful, but they miss an important reliability question:

Is the cited evidence stable across runs?

A system may produce similar answers while silently changing:

which documents it cites
which spans inside documents it relies on
how often evidence disappears entirely

ragcitecheck is designed to make that visible.

What it does

Given one or more run logs in JSONL format, ragcitecheck can:

validate run structure with flexible key aliases
compare document-level evidence stability
compare document + span-level evidence stability
compute pairwise overlap metrics across runs
generate per-query instability summaries
surface examples of unstable evidence
report null-evidence patterns that can hide instability

Who it is for

ragcitecheck is useful for:

RAG evaluators who want more than answer-level metrics
framework maintainers who want evidence-level regression checks
LLMOps / observability teams tracking provenance drift
researchers studying retrieval jitter or citation stability
product teams where evidence consistency matters

Installation

Windows

py -3.10 -m venv .venv
.venv\Scripts\activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install -e .

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
pip install -e .

Quickstart

Validate run logs

python -m ragcitecheck.cli validate --runs ./tests/fixtures/runs_min --out ./out_check

Generate a document-level report

python -m ragcitecheck.cli report --runs ./tests/fixtures/runs_min --out ./out_report_doc --evidence-key doc

Generate a document+span-level report

python -m ragcitecheck.cli report --runs ./tests/fixtures/runs_min --out ./out_report_span --evidence-key doc_span

Minimal input format

Each run is stored as JSONL with one record per query.

Document-level example

{"run_id":"runA","query_id":"q1","retrieved":[{"doc_id":"D1"},{"doc_id":"D2"}]}
{"run_id":"runA","query_id":"q2","retrieved":[{"doc_id":"D3"}]}

Document+span-level example

{"run_id":"runA","query_id":"q1","retrieved":[{"doc_id":"D1","span_hash":"s1"},{"doc_id":"D2","span_hash":"s2"}]}
{"run_id":"runA","query_id":"q2","retrieved":[{"doc_id":"D3","span_hash":"s3"}]}

Supported aliases

Run ID keys

run_id
runId
config_id
run
id

Query ID keys

query_id
qid
id

Evidence list keys

cited
docs
documents
retrieved
contexts

Document ID keys

doc_id
document_id
docid
id
source_id

Common options

--allow-missing
Evaluate only shared queries across runs
--docid-map path.csv
Canonicalize raw document IDs using a raw,canonical mapping
--case-sensitive
Disable lowercasing during canonicalization
--collapse-internal-whitespace
Normalize repeated whitespace before comparison

Outputs

Validation outputs:

validation_summary.json

Report outputs may include:

run_quality.csv
pairwise_config_stability.csv
per_query_stability.csv
instability_examples.md
citation_overlap_hist.png
report_meta.json

Sample report

See examples/golden/out for a sample generated report, including CSV summaries, instability examples, metadata, and a citation-overlap histogram.

Interpreting results

Average overlap close to 1.0
Evidence is relatively stable across runs
High flip rate
Queries frequently undergo major evidence changes
High null-evidence rate
Many queries return empty evidence; this can make a system look stable while reducing usefulness
Large doc vs span gap
The same documents may be cited across runs while cited spans still change substantially

Integration use cases

ragcitecheck adds a capability that many RAG evaluation stacks do not report directly:

evidence regression checks after retriever or chunking changes
provenance drift diagnostics across releases
separation of answer drift vs evidence drift
lightweight post-hoc comparison from exported logs

ragcitecheck is best described as a:

RAG evidence stability checker
citation stability diagnostic tool
post-hoc provenance stability evaluator
evidence drift analysis utility for RAG

Citation / research context

The tool supports experimental workflows studying evidence stability, citation consistency, and retrieval jitter in RAG systems. The repository is intended to be useful both as a research artifact and as a lightweight utility for post-hoc evidence stability checks in external RAG pipelines.

This project is actively maintained, with ongoing improvements to examples, reporting, and integrations.

License

See the LICENSE file in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
ragcitecheck		ragcitecheck
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ragcitecheck

Why this matters

What it does

Who it is for

Installation

Windows

macOS / Linux

Quickstart

Validate run logs

Generate a document-level report

Generate a document+span-level report

Minimal input format

Document-level example

Document+span-level example

Supported aliases

Run ID keys

Query ID keys

Evidence list keys

Document ID keys

Common options

Outputs

Sample report

Interpreting results

Integration use cases

Citation / research context

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ragcitecheck

Why this matters

What it does

Who it is for

Installation

Windows

macOS / Linux

Quickstart

Validate run logs

Generate a document-level report

Generate a document+span-level report

Minimal input format

Document-level example

Document+span-level example

Supported aliases

Run ID keys

Query ID keys

Evidence list keys

Document ID keys

Common options

Outputs

Sample report

Interpreting results

Integration use cases

Citation / research context

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages