Runnable examples for common crawl + extract + RAG tasks. All commands use
the public markcrawl CLI installed by pip install markcrawl. See
../README.md for installation and core concepts.
- Single-page scrapes
- Crawling a whole site
- Filtering URLs
- Extraction and dedup
- Binary downloads and assets
- Multi-site workflows
- End-to-end use cases
markcrawl --base https://example.com/pricing --no-sitemap --max-pages 1For React, Vue, YouTube, and similar JS-rendered pages:
markcrawl --base "https://www.youtube.com/@channel/videos" \
--no-sitemap --max-pages 1 --render-js
# → outputs one .md file with video titles, view counts, and datesFor infinite-scroll pages like YouTube, this captures the first ~28 videos from the initial render.
markcrawl --base https://docs.example.com --max-pages 500 --concurrency 5 --show-progressLarge sites (YouTube, GitHub, etc.) have sitemaps with thousands of unrelated pages. Use --no-sitemap to crawl only from your target URL:
markcrawl --base https://docs.example.com/guides \
--no-sitemap --max-pages 50 --show-progressmarkcrawl --base https://engineering.example.com/blog \
--no-sitemap --max-pages 1000 --concurrency 5 --out ./blog-archive --show-progress
# → every post saved as clean Markdown with citations and access datesmarkcrawl --base https://docs.example.com --out ./docs --resume --show-progressExclude job listings, login walls, SEO spam:
markcrawl --base https://example.com \
--exclude-path "/job/*" --exclude-path "/careers/*" --exclude-path "/login" \
--max-pages 500 --out ./output --show-progressBlog + pricing, ignore everything else:
markcrawl --base https://example.com \
--include-path "/blog/*" --include-path "/pricing" \
--max-pages 200 --out ./output --show-progressmarkcrawl --base https://example.com --dry-run
# → prints every URL that would be crawled (from sitemap), then exits
# Pipe to wc -l to get a count, or grep to check for junk patterns
markcrawl --base https://example.com --dry-run | wc -l
markcrawl --base https://example.com --dry-run | grep "/job/"Dry-run first, then exclude:
# Step 1: see what you'd get
markcrawl --base https://tealhq.com --dry-run | head -50
# Step 2: exclude the job listings, crawl just the content pages
markcrawl --base https://tealhq.com \
--exclude-path "/job/*" --exclude-path "/resume-examples/*" \
--max-pages 200 --out ./tealhq --show-progressFor e-commerce, job boards, real estate — sample N pages per templated cluster instead of crawling thousands:
# Preview the pattern clusters first
markcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress
# Crawl with sampling — 5 pages per templated cluster instead of thousands
markcrawl --base https://bigsite.com --out ./bigsite \
--smart-sample --sample-size 5 --sample-threshold 20 --show-progress# Default (BS4 + markdownify) — fastest, good for most sites
markcrawl --base https://docs.example.com --out ./output --show-progress
# Ensemble — runs default + trafilatura, picks best per page
markcrawl --base https://docs.example.com --out ./output --extractor ensemble --show-progress
# ReaderLM-v2 — ML-based extraction (uses the bundled torch + transformers stack since v0.10.1)
markcrawl --base https://docs.example.com --out ./output --extractor readerlm --show-progressLink prioritization scores discovered links by predicted content value:
markcrawl --base https://docs.example.com --out ./docs \
--prioritize-links --max-pages 100 --show-progress
# Prioritizes content-rich pages (guides, docs) over low-value ones (legal, login)Cross-crawl dedup — only fetches new or changed pages:
# First crawl
markcrawl --base https://docs.example.com --out ./docs --show-progress
# Later — only fetches new/changed pages
markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progressPhotography blogs, product pages:
# Crawl a photography blog and save images from the content area
markcrawl --base https://photography-blog.example.com --out ./photos \
--download-images --max-pages 50 --show-progress
# Output:
# ./photos/assets/mountain-abc123.jpg
# ./photos/assets/sunset-def456.png
# ./photos/post-1__a1b2c3.md ← Markdown with  refs
# ./photos/pages.jsonl ← index includes "images" array per page
# Adjust minimum image size to skip thumbnails (default: 5000 bytes)
markcrawl --base https://example.com/gallery --out ./gallery \
--download-images --min-image-size 20000 --show-progressNew in v0.11.0. Stream-download referenced files with size + content-type guards and a pre-fetch filter callback:
# Crawl an aggregator site and harvest only the resume templates
# (skips privacy policies, ToS, marketing PDFs by anchor + URL signal).
from markcrawl import crawl
from markcrawl.filters import is_likely_resume
result = crawl(
base_url="https://example.com/templates",
out_dir="./resumes",
download_types=["pdf", "docx"], # opt-in; default None = no downloads
download_filter=is_likely_resume, # pre-fetch — rejected URLs never fetched
download_max_files=200, # cap per crawl
download_max_size_mb=25, # per-file cap (streaming)
)
print(f"Saved {result.downloads_count} files, {result.downloads_bytes/1e6:.1f} MB")
# Output:
# ./resumes/downloads/cv-template-1__a1b2c3.pdf
# ./resumes/downloads/cover-letter-2__d4e5f6.docx
# ./resumes/pages.jsonl ← each page's row gets "downloads": [{url, path, ...}]Filters are pure functions of DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — compose your own with the bundled starters:
from markcrawl.filters import is_likely_resume, exclude_legal_boilerplate
# Combine a positive selector with the negative one:
def my_filter(c):
return is_likely_resume(c) and exclude_legal_boilerplate(c) and "spam" not in c.url
crawl(..., download_types=["pdf"], download_filter=my_filter)Bundled filters (markcrawl.filters): is_likely_resume, is_likely_paper, exclude_legal_boilerplate. Best-effort heuristics, not classifiers — test against your real corpus before relying on them.
Dashboards, data visualisations, JS-rendered charts:
# Full-page screenshot of every crawled page (auto-enables --render-js)
markcrawl --base https://steamcharts.com/top --out ./dash \
--screenshot --max-pages 5 --show-progress
# Output:
# ./dash/screenshots/top-abc123def456.png ← 1920-wide full-page PNG
# ./dash/pages.jsonl ← each row gets "screenshot": "screenshots/..."
# Crop to just the dashboard region, JPEG for smaller files, longer wait for slow charts
markcrawl --base https://example.com/dashboards --out ./dash \
--screenshot --screenshot-selector ".dashboard-main" \
--screenshot-format jpeg --screenshot-wait-ms 3000 --show-progressThe screenshot path loads with wait_until="load" and then pauses --screenshot-wait-ms (default 1500 ms) before capturing, so canvas/SVG charts have time to render. (networkidle is deliberately avoided — many real sites never idle due to analytics pings.) Failures are recorded in the JSONL row as screenshot_error rather than aborting the crawl.
Use a bundled curated seed pack, then crawl every site:
# Use a bundled curated seed pack, then crawl every site with screenshots
markcrawl discover --pack game-dashboards | \
markcrawl --seed-file - --out ./dashboards \
--screenshot --max-pages-per-site 5 --show-progress
# List available packs
markcrawl discover --list-packsOutput is organised per-site: ./dashboards/<netloc>/pages.jsonl plus screenshots/ under each. See the full recipe (including a YouTube frame-extraction path using yt-dlp + ffmpeg) at recipes/game-dashboards.md.
markcrawl --base https://competitor-one.com/pricing --no-sitemap --max-pages 1 --out ./comp1
markcrawl --base https://competitor-two.com/pricing --no-sitemap --max-pages 1 --out ./comp2
markcrawl --base https://competitor-three.com/pricing --no-sitemap --max-pages 1 --out ./comp3
markcrawl-extract \
--jsonl ./comp1/pages.jsonl ./comp2/pages.jsonl ./comp3/pages.jsonl \
--fields pricing_tiers features free_trial --show-progress
# → extracted.jsonl with structured pricing data across all threeFull pipeline: crawl, embed, query — $0 in API charges thanks to the local embedder default since v0.10.1:
pip install markcrawl markcrawl[upload] # base install bundles the local embedder
markcrawl --base https://docs.example.com --out ./docs --max-pages 500 --concurrency 5 --show-progress
markcrawl-upload --jsonl ./docs/pages.jsonl --show-progress
# → pages chunked + embedded locally with mxbai-embed-large-v1, uploaded to Supabase/pgvector
# Wire your chatbot to query the vector table — see docs/SUPABASE.mdTo use OpenAI embeddings instead (e.g. for parity with an existing index), set MARKCRAWL_EMBEDDER=text-embedding-3-small or pass embedding_model=... to upload(...) / markcrawl-upload.
markcrawl --base https://api.example.com/docs --out ./api-docs --max-pages 200 --show-progress
# Feed the output to an LLM:
# "Using the API documentation in ./api-docs/pages.jsonl, generate a
# typed Python client with methods for each endpoint."