Skip to content

NewTonne-AI/carbonbench

Repository files navigation

CarbonBench

A benchmark for retrieval-augmented question answering over carbon-credit project documents (Gold Standard Foundation registry).

📄 Companion paper: CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents (under review, NeurIPS 2026 Datasets and Benchmarks Track)

Highlights

  • 3 task tiers (qa_t1.jsonl, qa_t2.jsonl, qa_t3.jsonl):
    • T1 — single-document factual QA (~8k questions)
    • T2 — multi-document numeric reasoning (~1.9k questions; gold-annotated subset in qa_t2_gold.jsonl)
    • T3 — cross-project comparison (~4.8k questions)
  • Canonical splits: carbonbench_train.jsonl / carbonbench_val.jsonl / carbonbench_test.jsonl (70/15/15)
  • Baseline results for 6 reader configurations included in baseline_results/
  • Headline scores (Hybrid Retrieval + Opus 4.7 + CoT, T2 metric):
    • Composite: 26.2%
    • LLM-judge (Opus 4.7): 50.0% with self-consistency-3
    • See v3_main_summary.json and v3_main_leaderboard.tex

Quick start

# Clone the repository (canonical URL released upon paper acceptance)
git clone <repository-url>
cd carbonbench

# Install Python dependencies
pip install -r scripts/requirements.txt

# Set API keys for the readers you want to evaluate
export ANTHROPIC_API_KEY=...   # required for Opus-4.7 reader + LLM-judge
export OPENAI_API_KEY=...      # required for GPT-4o baseline
export GOOGLE_API_KEY=...      # optional, for Gemini baseline

# Reproduce headline numbers (Hybrid + Opus-4.7 + CoT)
python scripts/evaluate_v3.py \
    --method hybrid \
    --reader opus-4-7 \
    --cot \
    --context v3

# Score with LLM-judge
python scripts/judge_all_tasks.py --workers 12

Repository layout

carbonbench/
├── data/                    # Canonical splits + per-task QA files
│   ├── carbonbench_full.jsonl
│   ├── carbonbench_{train,val,test}.jsonl
│   └── qa_t{1,2,2_gold,3}.jsonl
├── scripts/                 # Evaluation + dataset generation
│   ├── evaluate_v3.py            # Main eval (multi-reader + CoT + SC + tool-use)
│   ├── evaluate_baseline.py      # Reader-only baseline (no retrieval)
│   ├── evaluate_rag.py           # RAG variants
│   ├── evaluate_parser_accuracy.py
│   ├── judge_all_tasks.py        # LLM-judge pipeline
│   ├── score_t2.py               # T2 composite + decomposition
│   ├── generate_qa_t{1,2}.py     # QA synthesis (LLM + filter)
│   └── generate_t2_gold_annotations.py
├── baseline_results/        # Pre-computed leaderboard JSON per config
├── dataset_card.json        # Croissant-compatible dataset metadata
├── reproducibility_report.json
├── v3_main_summary.json + v3_main_leaderboard.tex   # Headline numbers
├── t2_rescore_report.json + t2_rescore_table.tex    # T2 diagnostic appendix
├── LICENSE                  # Apache-2.0 (code)
├── LICENSE-DATA             # CC BY 4.0 (dataset artifacts)
└── NOTICE

License

  • Code (everything under scripts/): Apache License 2.0 — see LICENSE
  • Dataset and results (everything under data/, baseline_results/, *.json, *.tex): Creative Commons Attribution 4.0 International — see LICENSE-DATA

The benchmark questions and gold answers are transformative derivative works built on top of publicly available Gold Standard Foundation project documents (© Gold Standard Foundation). See NOTICE for attribution details.

Citation

If you use CarbonBench in your research, please cite:

@inproceedings{carbonbench2026,
  title     = {CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents},
  author    = {<authors anonymized for review>},
  booktitle = {Advances in Neural Information Processing Systems},
  series    = {Datasets and Benchmarks Track},
  year      = {2026}
}

(BibTeX entry will be updated with author list and DOI upon publication.)

Reproducibility

All numbers in the companion paper can be reproduced from this repository plus access to the listed APIs (Anthropic, OpenAI, Google Gemini). See reproducibility_report.json for environment details (model versions, seeds, retrieval cache hashes) and baseline_results/ for the exact JSON outputs underlying the leaderboard.

The retrieval cache (rag_cache/) is not distributed in this repository (~286 MB). Re-running the retrieval scripts will regenerate it; see scripts/evaluate_rag.py for the cache key scheme.

About

CarbonBench: A Knowledge-Graph Benchmark for Retrieval-Augmented QA over Carbon Credit Documents (NeurIPS 2026 Datasets and Benchmarks Track, under review)

Topics

Resources

License

Apache-2.0, Unknown licenses found

Licenses found

Apache-2.0
LICENSE
Unknown
LICENSE-DATA

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors