Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,20 +239,20 @@ Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep bro

### Benchmark results (6 tools, April 2026)

**Speed:** markcrawl is fastest (14.0 pages/sec), scrapy+md second (9.3). Playwright-based tools average 1.4-2.1 pages/sec.
**Speed:** markcrawl is fastest (12.1 pages/sec), scrapy+md second (9.5). Playwright-based tools average 1.4-2.1 pages/sec.

**Output cleanliness:** markcrawl has the lowest nav pollution (15 words vs 133+ for others) — less junk in your embeddings.
**Output cleanliness:** markcrawl has the lowest nav pollution (14 words vs 208+ for others) — less junk in your embeddings.

**RAG answer quality:** markcrawl scores 4.30/5 on answer quality with the fewest chunks (22,132 total, 2.1x fewer than the most), keeping embedding costs low.
**RAG answer quality:** markcrawl scores 4.52/5 on answer quality with the fewest chunks (27,051 total, 3.0x fewer than the most), keeping embedding costs low.

| Tool | Chunks/page | Answer Quality (/5) | Annual cost (100K pages, 1K queries/day) |
|---|---|---|---|
| **markcrawl** | **15.2** | **4.30** | **$4,505** |
| scrapy+md | 16.4 | 4.41 | $5,464 |
| crawl4ai | 22.5 | 4.26 | $6,960 |
| colly+md | 29.5 | 4.29 | $7,213 |
| playwright | 31.9 | 4.38 | $7,320 |
| crawlee | 32.7 | 4.33 | $7,467 |
| **markcrawl** | **18.6** | **4.52** | **$4,505** |
| scrapy+md | 29.0 | 4.03 | $5,464 |
| crawl4ai | 33.2 | 4.43 | $6,960 |
| colly+md | 55.3 | 4.53 | $7,213 |
| playwright | 50.6 | 4.42 | $7,320 |
| crawlee | 51.0 | 4.52 | $7,467 |

Full benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks)
</details>
Expand Down
44 changes: 22 additions & 22 deletions docs/BENCHMARKS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<!-- AUTO-GENERATED by sync_markcrawl.py — do not edit manually -->
# MarkCrawl Benchmarks

> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (14.0 pages/sec), produces the cleanest output (15 words of nav pollution vs 133+ for others), generates the fewest chunks (22,132 total, 2.1x fewer than crawlee), the lowest total RAG pipeline cost at every scale tested.
> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (12.1 pages/sec), produces the cleanest output (14 words of nav pollution vs 208+ for others), generates the fewest chunks (27,051 total, 3.0x fewer than colly+md), the lowest total RAG pipeline cost at every scale tested.
>
> **Where MarkCrawl is not first:** Answer quality is 5th (4.30/5, scrapy+md leads at 4.41). Retrieval Hit@5 is 6th (86% vs 91% for scrapy+md). Content recall is 7th (64% vs 97% for playwright).
> **Where MarkCrawl is not first:** Answer quality is 2nd (4.52/5, colly+md leads at 4.53). Retrieval Hit@5 is 3rd (83% vs 87% for crawlee). Content recall is 6th (30% vs 84% for crawlee).

*Last run: April 2026. Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).*

Expand All @@ -13,40 +13,40 @@

| Tool | Pages/sec |
|---|---|
| **markcrawl** | **14.0** |
| scrapy+md | 9.3 |
| colly+md | 6.6 |
| playwright | 2.1 |
| crawl4ai | 2.0 |
| crawlee | 1.8 |
| **markcrawl** | **12.1** |
| scrapy+md | 9.5 |
| colly+md | 4.2 |
| playwright | 2.2 |
| crawlee | 1.7 |
| crawl4ai | 1.5 |

MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page.

## Output cleanliness

| Tool | Nav pollution (words) | Recall |
|---|---|---|
| **markcrawl** | **15** | **64%** |
| scrapy+md | 133 | 68% |
| crawl4ai | 311 | 66% |
| colly+md | 1953 | 96% |
| playwright | 2037 | 97% |
| crawlee | 2207 | 97% |
| **markcrawl** | **14** | **30%** |
| scrapy+md | 208 | 21% |
| crawl4ai | 418 | 71% |
| playwright | 2710 | 82% |
| colly+md | 2733 | 54% |
| crawlee | 2839 | 84% |

Nav pollution = boilerplate words (navigation, footer, cookie banners) that leak into extracted content. Lower is better — less junk means cleaner embeddings and fewer wasted tokens.

The tradeoff: playwright captures 97% of page content but includes ~2,037 words of boilerplate per page. MarkCrawl captures 64% with 15 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.
The tradeoff: crawlee captures 84% of page content but includes ~2,839 words of boilerplate per page. MarkCrawl captures 30% with 14 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.

## RAG answer quality

| Tool | Chunks | Answer Quality (/5) | Hit@5 | Hit@20 |
|---|---|---|---|---|
| **markcrawl** | **22,132** | **4.30** | **86%** | **91%** |
| scrapy+md | 23,854 | 4.41 | 91% | 94% |
| crawl4ai | 32,735 | 4.26 | 89% | 93% |
| colly+md | 42,934 | 4.29 | 86% | 92% |
| playwright | 46,439 | 4.38 | 90% | 94% |
| crawlee | 47,560 | 4.33 | 88% | 93% |
| **markcrawl** | **27,051** | **4.52** | **83%** | **88%** |
| scrapy+md | 42,234 | 4.03 | 56% | 65% |
| crawl4ai | 48,332 | 4.43 | 82% | 92% |
| playwright | 73,656 | 4.42 | 86% | 93% |
| crawlee | 74,281 | 4.52 | 87% | 93% |
| colly+md | 80,550 | 4.53 | 77% | 88% |

*FireCrawl's self-hosted version did not complete crawls on all sites across multiple attempts. Its scores are on a reduced set and are not directly comparable to tools that completed all sites.

Expand All @@ -55,7 +55,7 @@ The tradeoff: playwright captures 97% of page content but includes ~2,037 words
- **Answer Quality** — LLM-judged score for answers generated from retrieved chunks.
- **Hit@5 / Hit@20** — what percentage of queries find a relevant chunk in the top 5 or 20 results.

**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 2.1x fewer chunks than crawlee for the same content, cutting embedding and storage costs significantly.
**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 3.0x fewer chunks than colly+md for the same content, cutting embedding and storage costs significantly.

## Total cost of ownership

Expand Down
Loading