Skip to content

xberg-io/crawlberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,021 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xberg

Crawlberg

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.

What and Why?

Crawlberg is the crawling substrate: everything you need to scrape and crawl a site end-to-end from a single Rust core — HTML→Markdown, headless-Chrome fallback, robots/sitemap parsing, per-domain throttling, and an SSRF-safe policy — with identical results across 14 language bindings.

Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling, billing) live in xberg-enterprise, the reference operational implementation. Every extension point (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, …) is a trait you inject via CrawlEngineBuilder::with_<trait>(...).

Features

Feature Description
Structured extraction Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, response headers
Markdown conversion Clean Markdown with citations, document structure, and fit-content mode
Concurrent crawling Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
14 language bindings Rust, Python, Node.js, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly
Smart filtering BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, sitemap discovery
Browser rendering Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
Batch & streaming Scrape or crawl hundreds of URLs concurrently; real-time crawl events via async streams
SSRF-safe by default Refuses loopback, private, link-local, and cloud-metadata addresses; opt out via env var or CrawlConfig
Auth & rate limiting HTTP Basic, Bearer, and custom-header auth with cookie jars; per-domain request throttling
MCP server & REST API Model Context Protocol integration for AI agents plus an HTTP server with OpenAPI spec

Supported Platforms

Precompiled binaries for Linux (x86_64/aarch64), macOS (ARM64), and Windows (x64) across every binding. See the platform support reference for the full matrix.

⭐ Star this repo to show your support — it helps others discover Crawlberg.

Quick Start

Language Packages

Python
pip install crawlberg

See Python README for full documentation.

Node.js
npm install @kreuzberg/crawlberg

See Node.js README for full documentation.

Rust
cargo add crawlberg

See Rust README for full documentation.

Go
go get github.com/xberg-io/crawlberg/packages/go

See Go README for full documentation.

Java

Available on Maven Central as dev.kreuzberg.crawlberg:crawlberg. See Java README for the dependency snippet and current version.

C#
dotnet add package Crawlberg

See C# README for full documentation.

Ruby
gem install crawlberg

See Ruby README for full documentation.

PHP
composer require xberg-io/crawlberg

See PHP README for full documentation.

Elixir

Add {:crawlberg, "~> 0.3"} to your mix.exs dependencies. See Elixir README for full documentation.

Dart / Flutter
dart pub add crawlberg

See Dart README for full documentation.

Kotlin (Android)

Available on Maven Central as dev.kreuzberg.crawlberg.android:crawlberg-android. See Kotlin README for the dependency snippet and current version.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Zig

See Zig README for installation and usage.

WebAssembly
npm install @kreuzberg/crawlberg-wasm

See WebAssembly README for full documentation.

C/C++ (FFI)

C header + shared library from GitHub Releases. See FFI crate for full documentation.

CLI
cargo install crawlberg-cli
brew install xberg-io/tap/crawlberg

See CLI README for full documentation.

AI Coding Assistants

Install the Crawlberg plugin from the xberg-io/plugins marketplace. It ships the Crawlberg agent skills (site crawling, HTML→Markdown scraping, headless-Chrome fallback) plus the crawlberg MCP server, and works with every major coding agent — expand your harness below.

Claude Code
/plugin marketplace add xberg-io/plugins
/plugin install crawlberg@kreuzberg
Codex CLI
/plugins add https://github.com/xberg-io/plugins

Then search for crawlberg and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select crawlberg.

Gemini CLI
gemini extensions install https://github.com/xberg-io/plugins
Factory Droid
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install crawlberg@kreuzberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install crawlberg@kreuzberg
opencode

Add the package to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@kreuzberg/opencode-crawlberg"]
}

Documentation

Full guides, per-language API references, the substrate/operational model, antibot strategy, and observability live at docs.crawlberg.xberg.io.

Contributing

Contributions are welcome! See our Contributing Guide.

Part of Kreuzberg.dev

  • Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
  • Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
  • crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

License

MIT License

Links