A benchmark for retrieval-augmented question answering over carbon-credit project documents (Gold Standard Foundation registry).
📄 Companion paper: CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents (under review, NeurIPS 2026 Datasets and Benchmarks Track)
- 3 task tiers (
qa_t1.jsonl,qa_t2.jsonl,qa_t3.jsonl):- T1 — single-document factual QA (~8k questions)
- T2 — multi-document numeric reasoning (~1.9k questions; gold-annotated subset in
qa_t2_gold.jsonl) - T3 — cross-project comparison (~4.8k questions)
- Canonical splits:
carbonbench_train.jsonl/carbonbench_val.jsonl/carbonbench_test.jsonl(70/15/15) - Baseline results for 6 reader configurations included in
baseline_results/ - Headline scores (Hybrid Retrieval + Opus 4.7 + CoT, T2 metric):
- Composite: 26.2%
- LLM-judge (Opus 4.7): 50.0% with self-consistency-3
- See
v3_main_summary.jsonandv3_main_leaderboard.tex
# Clone the repository (canonical URL released upon paper acceptance)
git clone <repository-url>
cd carbonbench
# Install Python dependencies
pip install -r scripts/requirements.txt
# Set API keys for the readers you want to evaluate
export ANTHROPIC_API_KEY=... # required for Opus-4.7 reader + LLM-judge
export OPENAI_API_KEY=... # required for GPT-4o baseline
export GOOGLE_API_KEY=... # optional, for Gemini baseline
# Reproduce headline numbers (Hybrid + Opus-4.7 + CoT)
python scripts/evaluate_v3.py \
--method hybrid \
--reader opus-4-7 \
--cot \
--context v3
# Score with LLM-judge
python scripts/judge_all_tasks.py --workers 12carbonbench/
├── data/ # Canonical splits + per-task QA files
│ ├── carbonbench_full.jsonl
│ ├── carbonbench_{train,val,test}.jsonl
│ └── qa_t{1,2,2_gold,3}.jsonl
├── scripts/ # Evaluation + dataset generation
│ ├── evaluate_v3.py # Main eval (multi-reader + CoT + SC + tool-use)
│ ├── evaluate_baseline.py # Reader-only baseline (no retrieval)
│ ├── evaluate_rag.py # RAG variants
│ ├── evaluate_parser_accuracy.py
│ ├── judge_all_tasks.py # LLM-judge pipeline
│ ├── score_t2.py # T2 composite + decomposition
│ ├── generate_qa_t{1,2}.py # QA synthesis (LLM + filter)
│ └── generate_t2_gold_annotations.py
├── baseline_results/ # Pre-computed leaderboard JSON per config
├── dataset_card.json # Croissant-compatible dataset metadata
├── reproducibility_report.json
├── v3_main_summary.json + v3_main_leaderboard.tex # Headline numbers
├── t2_rescore_report.json + t2_rescore_table.tex # T2 diagnostic appendix
├── LICENSE # Apache-2.0 (code)
├── LICENSE-DATA # CC BY 4.0 (dataset artifacts)
└── NOTICE
- Code (everything under
scripts/): Apache License 2.0 — seeLICENSE - Dataset and results (everything under
data/,baseline_results/,*.json,*.tex): Creative Commons Attribution 4.0 International — seeLICENSE-DATA
The benchmark questions and gold answers are transformative derivative
works built on top of publicly available Gold Standard Foundation project
documents (© Gold Standard Foundation). See NOTICE for attribution
details.
If you use CarbonBench in your research, please cite:
@inproceedings{carbonbench2026,
title = {CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents},
author = {<authors anonymized for review>},
booktitle = {Advances in Neural Information Processing Systems},
series = {Datasets and Benchmarks Track},
year = {2026}
}(BibTeX entry will be updated with author list and DOI upon publication.)
All numbers in the companion paper can be reproduced from this repository
plus access to the listed APIs (Anthropic, OpenAI, Google Gemini). See
reproducibility_report.json for environment details (model versions,
seeds, retrieval cache hashes) and baseline_results/ for the exact
JSON outputs underlying the leaderboard.
The retrieval cache (rag_cache/) is not distributed in this
repository (~286 MB). Re-running the retrieval scripts will regenerate
it; see scripts/evaluate_rag.py for the cache key scheme.