MarkCrawl is designed as a core engine + optional layers + agentic integrations. You can use just the crawler, or add extraction, storage, and agent interfaces on top.
┌─────────────────────────────────────────────────────────┐
│ markcrawl CLI │
├─────────────────────────────────────────────────────────┤
│ │
│ CORE (no API keys, no optional deps) │
│ ┌───────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Discover │→ │ Fetch & │→ │ Transform to │ │
│ │ URLs │ │ Clean HTML │ │ Markdown / Text │ │
│ └───────────┘ └──────────────┘ └──────────────────┘ │
│ │ │ │ │
│ sitemap.xml strip nav/footer .md files + │
│ or link-follow strip scripts pages.jsonl + │
│ extract <main> auto-citation │
│ │
├─────────────────────────────────────────────────────────┤
│ │
│ OPTIONAL LAYERS (install separately) │
│ │
│ ┌──────────────┐ pip install markcrawl[extract] │
│ │ LLM Extract │ OpenAI / Claude / Gemini / Grok │
│ │ (extract.py) │ → extracted.jsonl + LLM attribution │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ pip install markcrawl[upload] │
│ │ RAG Upload │ Chunk → Embed → Supabase/pgvector │
│ │ (upload.py) │ → vector search │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ pip install markcrawl[js] │
│ │ JS Rendering │ Playwright / headless Chromium │
│ │ (core.py) │ → render SPAs before extraction │
│ └──────────────┘ │
│ │
├─────────────────────────────────────────────────────────┤
│ │
│ AGENTIC INTEGRATIONS │
│ │
│ ┌──────────────┐ pip install markcrawl[mcp] │
│ │ MCP Server │ Expose tools to AI agents │
│ │ (mcp_server) │ Claude Desktop, Cursor, Windsurf │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ pip install markcrawl[langchain] │
│ │ LangChain │ StructuredTool wrappers │
│ │ (langchain) │ Custom RAG agents and chains │
│ └──────────────┘ │
│ │
│ ┌──────────────┐ npx clawhub install markcrawl-skill │
│ │ OpenClaw │ Autonomous agent skill │
│ │ (clawhub) │ WhatsApp / Telegram / Slack │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
The crawl() function in core.py runs a 3-stage pipeline:
robots.txt → parse sitemap URLs → filter by scope → queue
↓ (if no sitemap)
use base URL as seed → follow <a> links
- Checks
robots.txtfirst (respects disallow rules) - Tries to find sitemap URLs from
robots.txt(Sitemap:directive) - Parses sitemap XML (handles nested sitemap indexes)
- Filters URLs by scope (
same_scope()— same domain, optional subdomains) - Falls back to link-following from the base URL if no sitemap is found
For each URL in the queue:
- Fetch — via
requests(default) or Playwright (--render-js) - Validate — check HTTP status, confirm
Content-Type: text/html - Clean DOM — remove
<script>,<style>,<nav>,<header>,<footer>,<aside>, cookie banners, sr-only elements - Find main content — look for
<main>,role="main", or fall back to<body> - Deduplicate — hash content, skip pages with identical extracted text
- Markdown mode — uses
markdownifyto convert cleaned HTML to Markdown with ATX headings - Text mode — uses BeautifulSoup
get_text()with line deduplication - Output — writes
.md/.txtfile per page + appends topages.jsonl
{
"url": "https://example.com/page",
"title": "Page Title",
"path": "page__a1b2c3d4e5.md",
"text": "Extracted content as markdown or plain text..."
}{
"url": "https://example.com/page",
"title": "Page Title",
"field_name": "extracted value",
"another_field": "another value",
"source_file": "./comp1/pages.jsonl"
}{
"seen_urls": ["https://example.com/", "https://example.com/about"],
"seen_content": ["a1b2c3...", "d4e5f6..."],
"to_visit": ["https://example.com/pricing", "https://example.com/docs"],
"saved_count": 42,
"seeds": []
}| Module | Role | Dependencies |
|---|---|---|
core.py |
Core crawl engine — URL discovery, fetch, HTML cleaning, transform | requests, beautifulsoup4, markdownify |
cli.py |
CLI entry point for markcrawl command |
core.py |
chunker.py |
Text chunking with word-based overlap | None (stdlib only) |
extract.py |
LLM-powered field extraction (multi-provider) | openai / anthropic / google-genai |
extract_cli.py |
CLI entry point for markcrawl-extract |
extract.py |
upload.py |
Chunk + embed + Supabase upload | openai, supabase |
upload_cli.py |
CLI entry point for markcrawl-upload |
upload.py |
mcp_server.py |
MCP server exposing tools for AI agents | mcp, core.py, extract.py |
langchain.py |
LangChain StructuredTool wrappers |
langchain-core, core.py, extract.py |
You don't have to use the CLI. Import and call directly:
from markcrawl.core import crawl
result = crawl(
base_url="https://example.com",
out_dir="./output",
fmt="markdown",
max_pages=50,
show_progress=True,
)
print(f"Saved {result.pages_saved} pages")import json
with open("./output/pages.jsonl") as f:
for line in f:
page = json.loads(line)
# page["url"], page["title"], page["text"]
# Feed to your own embedding pipeline, database, etc.from markcrawl.chunker import chunk_text
chunks = chunk_text(
"Your long text here...",
max_words=400,
overlap_words=50,
)
for chunk in chunks:
print(f"Chunk {chunk.index}/{chunk.total}: {chunk.text[:80]}...")from markcrawl.extract import LLMClient, extract_fields
client = LLMClient(provider="anthropic")
result = extract_fields(
text="Page content here...",
fields=["company_name", "pricing"],
client=client,
)
print(result) # {"company_name": "Acme", "pricing": "$29/mo"}The core crawler writes Markdown or plain text. If you need a different format (HTML, JSON, custom), process the pages.jsonl output — each row contains the full extracted text that you can transform however you need.
The pages.jsonl output is a standard newline-delimited JSON file. You can write a simple script to load it into any database, vector store, or search engine:
import json
with open("./output/pages.jsonl") as f:
for line in f:
page = json.loads(line)
# Insert into Pinecone, Weaviate, Elasticsearch, PostgreSQL, etc.
your_db.insert(page)The built-in Supabase upload (upload.py) is one example of this pattern. Use it as a reference for writing your own storage adapter.