CarbonBench

A benchmark for retrieval-augmented question answering over carbon-credit project documents (Gold Standard Foundation registry).

📄 Companion paper: CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents (under review, NeurIPS 2026 Datasets and Benchmarks Track)

Highlights

3 task tiers (qa_t1.jsonl, qa_t2.jsonl, qa_t3.jsonl):
- T1 — single-document factual QA (~8k questions)
- T2 — multi-document numeric reasoning (~1.9k questions; gold-annotated subset in qa_t2_gold.jsonl)
- T3 — cross-project comparison (~4.8k questions)
Canonical splits: carbonbench_train.jsonl / carbonbench_val.jsonl / carbonbench_test.jsonl (70/15/15)
Baseline results for 6 reader configurations included in baseline_results/
Headline scores (Hybrid Retrieval + Opus 4.7 + CoT, T2 metric):
- Composite: 26.2%
- LLM-judge (Opus 4.7): 50.0% with self-consistency-3
- See v3_main_summary.json and v3_main_leaderboard.tex

Quick start

# Clone the repository (canonical URL released upon paper acceptance)
git clone <repository-url>
cd carbonbench

# Install Python dependencies
pip install -r scripts/requirements.txt

# Set API keys for the readers you want to evaluate
export ANTHROPIC_API_KEY=...   # required for Opus-4.7 reader + LLM-judge
export OPENAI_API_KEY=...      # required for GPT-4o baseline
export GOOGLE_API_KEY=...      # optional, for Gemini baseline

# Reproduce headline numbers (Hybrid + Opus-4.7 + CoT)
python scripts/evaluate_v3.py \
    --method hybrid \
    --reader opus-4-7 \
    --cot \
    --context v3

# Score with LLM-judge
python scripts/judge_all_tasks.py --workers 12

Repository layout

carbonbench/
├── data/                    # Canonical splits + per-task QA files
│   ├── carbonbench_full.jsonl
│   ├── carbonbench_{train,val,test}.jsonl
│   └── qa_t{1,2,2_gold,3}.jsonl
├── scripts/                 # Evaluation + dataset generation
│   ├── evaluate_v3.py            # Main eval (multi-reader + CoT + SC + tool-use)
│   ├── evaluate_baseline.py      # Reader-only baseline (no retrieval)
│   ├── evaluate_rag.py           # RAG variants
│   ├── evaluate_parser_accuracy.py
│   ├── judge_all_tasks.py        # LLM-judge pipeline
│   ├── score_t2.py               # T2 composite + decomposition
│   ├── generate_qa_t{1,2}.py     # QA synthesis (LLM + filter)
│   └── generate_t2_gold_annotations.py
├── baseline_results/        # Pre-computed leaderboard JSON per config
├── dataset_card.json        # Croissant-compatible dataset metadata
├── reproducibility_report.json
├── v3_main_summary.json + v3_main_leaderboard.tex   # Headline numbers
├── t2_rescore_report.json + t2_rescore_table.tex    # T2 diagnostic appendix
├── LICENSE                  # Apache-2.0 (code)
├── LICENSE-DATA             # CC BY 4.0 (dataset artifacts)
└── NOTICE

License

Code (everything under scripts/): Apache License 2.0 — see LICENSE
Dataset and results (everything under data/, baseline_results/, *.json, *.tex): Creative Commons Attribution 4.0 International — see LICENSE-DATA

The benchmark questions and gold answers are transformative derivative works built on top of publicly available Gold Standard Foundation project documents (© Gold Standard Foundation). See NOTICE for attribution details.

Citation

If you use CarbonBench in your research, please cite:

@inproceedings{carbonbench2026,
  title     = {CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents},
  author    = {<authors anonymized for review>},
  booktitle = {Advances in Neural Information Processing Systems},
  series    = {Datasets and Benchmarks Track},
  year      = {2026}
}

(BibTeX entry will be updated with author list and DOI upon publication.)

Reproducibility

All numbers in the companion paper can be reproduced from this repository plus access to the listed APIs (Anthropic, OpenAI, Google Gemini). See reproducibility_report.json for environment details (model versions, seeds, retrieval cache hashes) and baseline_results/ for the exact JSON outputs underlying the leaderboard.

The retrieval cache (rag_cache/) is not distributed in this repository (~286 MB). Re-running the retrieval scripts will regenerate it; see scripts/evaluate_rag.py for the cache key scheme.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CarbonBench

Highlights

Quick start

Repository layout

License

Citation

Reproducibility

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
baseline_results		baseline_results
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
NOTICE		NOTICE
README.md		README.md
dataset_card.json		dataset_card.json
reproducibility_report.json		reproducibility_report.json
t2_rescore_report.json		t2_rescore_report.json
t2_rescore_table.tex		t2_rescore_table.tex
v3_main_leaderboard.tex		v3_main_leaderboard.tex
v3_main_summary.json		v3_main_summary.json

Folders and files

Latest commit

History

Repository files navigation

CarbonBench

Highlights

Quick start

Repository layout

License

Citation

Reproducibility

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages