A polite downloader for Common Crawl data, written in Rust.
cargo install ccdownOther methods
git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .Grab the latest release for your platform from the releases page.
ccdown download-paths CC-MAIN-2025-08 warc ./pathsSupported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
ccdown download ./paths/warc.paths.gz ./data| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
-s |
Abort on unrecoverable errors (401, 403, 404) | off |
ccdown download -p -t 5 ./paths/warc.paths.gz ./dataNote: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
Fetch one record by byte offset (e.g. a PDF pointed at by a columnar-index or FinePDFs row) without downloading the whole WARC. Sends an HTTP Range request and stops after one gzip member:
ccdown fetch-record crawl-data/CC-MAIN-2025-08/segments/.../x.warc.gz 12345 -o out.pdf| Flag | Description | Default |
|---|---|---|
-o |
Write the record body to this file | required |
-r |
Max retries | 10 |
--max-bytes |
Size cap (compressed and decompressed) | 104857600 (100 MiB) |
The library API is ccdown::fetch_record(file_path, offset, &RecordOptions),
returning WARC headers, HTTP headers, and the body. Targets WARC-Type: response
records. (Python bindings for this are not exposed yet.)
Python bindings
pip install ccdownfrom ccdown import Client
client = Client(threads=10, retries=1000, progress=True)
# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")
# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")
# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")
# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")Client(threads=10, retries=1000, progress=False) — Create a client with shared config.
client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.
client.download(path_file) — Returns a builder with chainable options:
.files_only()— flatten directory structure.numbered()— enumerate output files (for Ungoliant).strict()— abort on unrecoverable HTTP errors.to(dst)— execute the download
MIT OR Apache-2.0
