IPCC-Scraper

A small Python package that discovers, downloads, and curates IPCC PDF reports (AR5, AR6, and the three AR6-cycle Special Reports) into a clean primary-content dataset for RAG / retrieval pipelines.

Install

From a published GitHub repo:

pip install git+https://github.com/herman181920/IPCC-Scraper.git

From a local checkout:

pip install -e .
# with test dependencies
pip install -e ".[test]"

Python 3.10+. The only runtime dependency is aiohttp.

Quick start

import asyncio
from pathlib import Path
from ipcc_scraper.discovery import DiscoveryOrchestrator
from ipcc_scraper.downloader import AdaptiveDownloader
from ipcc_scraper.manifest import ManifestStore

async def main():
    manifest_path = Path("dataset/dataset_manifest.csv")
    orchestrator = DiscoveryOrchestrator(manifest_path=manifest_path)
    entries = await orchestrator.discover()              # Phase 1
    downloader = AdaptiveDownloader(
        downloads_dir=Path("dataset/downloads"),
        manifest_store=ManifestStore(manifest_path),
    )
    await downloader.download(entries)                   # Phase 2
    # Phase 3 — strict-scope filter, run via CLI:
    #   python -m ipcc_scraper build-dataset

asyncio.run(main())

Or use the CLI:

python -m ipcc_scraper discover
python -m ipcc_scraper download
python -m ipcc_scraper build-dataset

Architecture

Three-phase pipeline:

Discovery (3 layers in parallel) — sitemap harvest, structured BFS crawl from /reports/, URL-pattern probing against templates for AR5/AR6/SR15/SRCCL/SROCC. Results merged by URL with provenance preserved (discovered_via: "sitemap+crawl").
Download — async concurrent downloader with adaptive concurrency (2–16), HTTP 429 backoff, magic-byte PDF verification, SHA256 streaming hash, .part resume.
Build dataset — URL/filename classifier extracts metadata (report, working group, document kind), then a strict-scope filter keeps only chapters/SPM/TS/annex/FAQ in English; drafts, translations, supplementary spreadsheets, and admin docs are excluded with reasons recorded.

See docs/architecture.md for module-level detail.

Output

Running the full pipeline produces:

dataset/dataset_manifest.csv — 15-column manifest with URL, SHA256, classification, status
dataset/dataset_manifest.jsonl — same data, streaming format
dataset/corpus/<REPORT>/<WG>/*.pdf — curated PDFs organized by report
dataset/excluded/ — non-curated PDFs preserved with exclusion reason
dataset/flat_view/ — symlinks to corpus PDFs in one directory (loader convenience)

A typical run downloads ~500 PDFs and curates them down to ~242 primary-content PDFs (~3 GB).

Tests

pytest                    # 74 tests, unit + integration
pytest tests/unit -v      # unit only

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
examples		examples
src/ipcc_scraper		src/ipcc_scraper
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPCC-Scraper

Install

Quick start

Architecture

Output

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IPCC-Scraper

Install

Quick start

Architecture

Output

Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages