Skip to content

4thel00z/ccdown

 
 

ccdown

A polite downloader for Common Crawl data, written in Rust.

crates.io PyPI docs.rs CI License


Install

cargo install ccdown
Other methods

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag Description Default
-t Number of concurrent downloads 10
-r Max retries per file 1000
-p Show progress bars off
-f Flat file output (no directory structure) off
-n Numbered output (for Ungoliant Pipeline) off
-s Abort on unrecoverable errors (401, 403, 404) off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Fetch a single WARC record

Fetch one record by byte offset (e.g. a PDF pointed at by a columnar-index or FinePDFs row) without downloading the whole WARC. Sends an HTTP Range request and stops after one gzip member:

ccdown fetch-record crawl-data/CC-MAIN-2025-08/segments/.../x.warc.gz 12345 -o out.pdf
Flag Description Default
-o Write the record body to this file required
-r Max retries 10
--max-bytes Size cap (compressed and decompressed) 104857600 (100 MiB)

The library API is ccdown::fetch_record(file_path, offset, &RecordOptions), returning WARC headers, HTTP headers, and the body. Targets WARC-Type: response records. (Python bindings for this are not exposed yet.)

Python bindings

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

  • .files_only() — flatten directory structure
  • .numbered() — enumerate output files (for Ungoliant)
  • .strict() — abort on unrecoverable HTTP errors
  • .to(dst) — execute the download

License

MIT OR Apache-2.0

Packages

 
 
 

Contributors

Languages

  • Rust 100.0%