Skip to content

herman181920/IPCC-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IPCC-Scraper

A small Python package that discovers, downloads, and curates IPCC PDF reports (AR5, AR6, and the three AR6-cycle Special Reports) into a clean primary-content dataset for RAG / retrieval pipelines.

Install

From a published GitHub repo:

pip install git+https://github.com/herman181920/IPCC-Scraper.git

From a local checkout:

pip install -e .
# with test dependencies
pip install -e ".[test]"

Python 3.10+. The only runtime dependency is aiohttp.

Quick start

import asyncio
from pathlib import Path
from ipcc_scraper.discovery import DiscoveryOrchestrator
from ipcc_scraper.downloader import AdaptiveDownloader
from ipcc_scraper.manifest import ManifestStore

async def main():
    manifest_path = Path("dataset/dataset_manifest.csv")
    orchestrator = DiscoveryOrchestrator(manifest_path=manifest_path)
    entries = await orchestrator.discover()              # Phase 1
    downloader = AdaptiveDownloader(
        downloads_dir=Path("dataset/downloads"),
        manifest_store=ManifestStore(manifest_path),
    )
    await downloader.download(entries)                   # Phase 2
    # Phase 3 — strict-scope filter, run via CLI:
    #   python -m ipcc_scraper build-dataset

asyncio.run(main())

Or use the CLI:

python -m ipcc_scraper discover
python -m ipcc_scraper download
python -m ipcc_scraper build-dataset

Architecture

Three-phase pipeline:

  1. Discovery (3 layers in parallel) — sitemap harvest, structured BFS crawl from /reports/, URL-pattern probing against templates for AR5/AR6/SR15/SRCCL/SROCC. Results merged by URL with provenance preserved (discovered_via: "sitemap+crawl").
  2. Download — async concurrent downloader with adaptive concurrency (2–16), HTTP 429 backoff, magic-byte PDF verification, SHA256 streaming hash, .part resume.
  3. Build dataset — URL/filename classifier extracts metadata (report, working group, document kind), then a strict-scope filter keeps only chapters/SPM/TS/annex/FAQ in English; drafts, translations, supplementary spreadsheets, and admin docs are excluded with reasons recorded.

See docs/architecture.md for module-level detail.

Output

Running the full pipeline produces:

  • dataset/dataset_manifest.csv — 15-column manifest with URL, SHA256, classification, status
  • dataset/dataset_manifest.jsonl — same data, streaming format
  • dataset/corpus/<REPORT>/<WG>/*.pdf — curated PDFs organized by report
  • dataset/excluded/ — non-curated PDFs preserved with exclusion reason
  • dataset/flat_view/ — symlinks to corpus PDFs in one directory (loader convenience)

A typical run downloads ~500 PDFs and curates them down to ~242 primary-content PDFs (~3 GB).

Tests

pytest                    # 74 tests, unit + integration
pytest tests/unit -v      # unit only

License

Apache-2.0. See LICENSE.

About

Discover, download, and curate IPCC AR5/AR6/Special-Report PDFs into a clean primary-content dataset for RAG pipelines. 3-layer discovery + adaptive concurrent downloader + strict-scope classifier.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages