GitHub - 4thel00z/ccdown: A rust based, resumable downloader cli and python library for Common Crawl data

A polite downloader for Common Crawl data, written in Rust.

Install

cargo install ccdown

Other methods

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag	Description	Default
`-t`	Number of concurrent downloads	`10`
`-r`	Max retries per file	`1000`
`-p`	Show progress bars	off
`-f`	Flat file output (no directory structure)	off
`-n`	Numbered output (for Ungoliant Pipeline)	off
`-s`	Abort on unrecoverable errors (401, 403, 404)	off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Fetch a single WARC record

Fetch one record by byte offset (e.g. a PDF pointed at by a columnar-index or FinePDFs row) without downloading the whole WARC. Sends an HTTP Range request and stops after one gzip member:

ccdown fetch-record crawl-data/CC-MAIN-2025-08/segments/.../x.warc.gz 12345 -o out.pdf

Flag	Description	Default
`-o`	Write the record body to this file	required
`-r`	Max retries	`10`
`--max-bytes`	Size cap (compressed and decompressed)	`104857600` (100 MiB)

The library API is ccdown::fetch_record(file_path, offset, &RecordOptions), returning WARC headers, HTTP headers, and the body. Targets WARC-Type: response records. (Python bindings for this are not exposed yet.)

Python bindings

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

.files_only() — flatten directory structure
.numbered() — enumerate output files (for Ungoliant)
.strict() — abort on unrecoverable HTTP errors
.to(dst) — execute the download

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
logo.png		logo.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

From source

Pre-built binaries

Usage

1. Download the path manifest for a crawl

2. Download the actual data

Options

Example

Fetch a single WARC record

Install

Usage

API

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

From source

Pre-built binaries

Usage

1. Download the path manifest for a crawl

2. Download the actual data

Options

Example

Fetch a single WARC record

Install

Usage

API

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages