A small Python package that discovers, downloads, and curates IPCC PDF reports (AR5, AR6, and the three AR6-cycle Special Reports) into a clean primary-content dataset for RAG / retrieval pipelines.
From a published GitHub repo:
pip install git+https://github.com/herman181920/IPCC-Scraper.gitFrom a local checkout:
pip install -e .
# with test dependencies
pip install -e ".[test]"Python 3.10+. The only runtime dependency is aiohttp.
import asyncio
from pathlib import Path
from ipcc_scraper.discovery import DiscoveryOrchestrator
from ipcc_scraper.downloader import AdaptiveDownloader
from ipcc_scraper.manifest import ManifestStore
async def main():
manifest_path = Path("dataset/dataset_manifest.csv")
orchestrator = DiscoveryOrchestrator(manifest_path=manifest_path)
entries = await orchestrator.discover() # Phase 1
downloader = AdaptiveDownloader(
downloads_dir=Path("dataset/downloads"),
manifest_store=ManifestStore(manifest_path),
)
await downloader.download(entries) # Phase 2
# Phase 3 — strict-scope filter, run via CLI:
# python -m ipcc_scraper build-dataset
asyncio.run(main())Or use the CLI:
python -m ipcc_scraper discover
python -m ipcc_scraper download
python -m ipcc_scraper build-datasetThree-phase pipeline:
- Discovery (3 layers in parallel) — sitemap harvest, structured BFS crawl from
/reports/, URL-pattern probing against templates for AR5/AR6/SR15/SRCCL/SROCC. Results merged by URL with provenance preserved (discovered_via: "sitemap+crawl"). - Download — async concurrent downloader with adaptive concurrency (2–16), HTTP 429 backoff, magic-byte PDF verification, SHA256 streaming hash,
.partresume. - Build dataset — URL/filename classifier extracts metadata (report, working group, document kind), then a strict-scope filter keeps only chapters/SPM/TS/annex/FAQ in English; drafts, translations, supplementary spreadsheets, and admin docs are excluded with reasons recorded.
See docs/architecture.md for module-level detail.
Running the full pipeline produces:
dataset/dataset_manifest.csv— 15-column manifest with URL, SHA256, classification, statusdataset/dataset_manifest.jsonl— same data, streaming formatdataset/corpus/<REPORT>/<WG>/*.pdf— curated PDFs organized by reportdataset/excluded/— non-curated PDFs preserved with exclusion reasondataset/flat_view/— symlinks to corpus PDFs in one directory (loader convenience)
A typical run downloads ~500 PDFs and curates them down to ~242 primary-content PDFs (~3 GB).
pytest # 74 tests, unit + integration
pytest tests/unit -v # unit onlyApache-2.0. See LICENSE.