From f4fc60385de68b0c3e096a820a9ceb94ea79f86b Mon Sep 17 00:00:00 2001 From: benchmark-bot Date: Thu, 16 Apr 2026 05:51:20 +0000 Subject: [PATCH] Update benchmark numbers from llm-crawler-benchmarks Source: AIMLPM/llm-crawler-benchmarks@5b5cebc --- README.md | 18 +++++++++--------- docs/BENCHMARKS.md | 44 ++++++++++++++++++++++---------------------- 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index b218f03..ed86b43 100644 --- a/README.md +++ b/README.md @@ -239,20 +239,20 @@ Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep bro ### Benchmark results (6 tools, April 2026) -**Speed:** markcrawl is fastest (14.0 pages/sec), scrapy+md second (9.3). Playwright-based tools average 1.4-2.1 pages/sec. +**Speed:** markcrawl is fastest (12.1 pages/sec), scrapy+md second (9.5). Playwright-based tools average 1.4-2.1 pages/sec. -**Output cleanliness:** markcrawl has the lowest nav pollution (15 words vs 133+ for others) — less junk in your embeddings. +**Output cleanliness:** markcrawl has the lowest nav pollution (14 words vs 208+ for others) — less junk in your embeddings. -**RAG answer quality:** markcrawl scores 4.30/5 on answer quality with the fewest chunks (22,132 total, 2.1x fewer than the most), keeping embedding costs low. +**RAG answer quality:** markcrawl scores 4.52/5 on answer quality with the fewest chunks (27,051 total, 3.0x fewer than the most), keeping embedding costs low. | Tool | Chunks/page | Answer Quality (/5) | Annual cost (100K pages, 1K queries/day) | |---|---|---|---| -| **markcrawl** | **15.2** | **4.30** | **$4,505** | -| scrapy+md | 16.4 | 4.41 | $5,464 | -| crawl4ai | 22.5 | 4.26 | $6,960 | -| colly+md | 29.5 | 4.29 | $7,213 | -| playwright | 31.9 | 4.38 | $7,320 | -| crawlee | 32.7 | 4.33 | $7,467 | +| **markcrawl** | **18.6** | **4.52** | **$4,505** | +| scrapy+md | 29.0 | 4.03 | $5,464 | +| crawl4ai | 33.2 | 4.43 | $6,960 | +| colly+md | 55.3 | 4.53 | $7,213 | +| playwright | 50.6 | 4.42 | $7,320 | +| crawlee | 51.0 | 4.52 | $7,467 | Full benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks) diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md index c247b2a..690b242 100644 --- a/docs/BENCHMARKS.md +++ b/docs/BENCHMARKS.md @@ -1,9 +1,9 @@ # MarkCrawl Benchmarks -> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (14.0 pages/sec), produces the cleanest output (15 words of nav pollution vs 133+ for others), generates the fewest chunks (22,132 total, 2.1x fewer than crawlee), the lowest total RAG pipeline cost at every scale tested. +> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (12.1 pages/sec), produces the cleanest output (14 words of nav pollution vs 208+ for others), generates the fewest chunks (27,051 total, 3.0x fewer than colly+md), the lowest total RAG pipeline cost at every scale tested. > -> **Where MarkCrawl is not first:** Answer quality is 5th (4.30/5, scrapy+md leads at 4.41). Retrieval Hit@5 is 6th (86% vs 91% for scrapy+md). Content recall is 7th (64% vs 97% for playwright). +> **Where MarkCrawl is not first:** Answer quality is 2nd (4.52/5, colly+md leads at 4.53). Retrieval Hit@5 is 3rd (83% vs 87% for crawlee). Content recall is 6th (30% vs 84% for crawlee). *Last run: April 2026. Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).* @@ -13,12 +13,12 @@ | Tool | Pages/sec | |---|---| -| **markcrawl** | **14.0** | -| scrapy+md | 9.3 | -| colly+md | 6.6 | -| playwright | 2.1 | -| crawl4ai | 2.0 | -| crawlee | 1.8 | +| **markcrawl** | **12.1** | +| scrapy+md | 9.5 | +| colly+md | 4.2 | +| playwright | 2.2 | +| crawlee | 1.7 | +| crawl4ai | 1.5 | MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page. @@ -26,27 +26,27 @@ MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-poo | Tool | Nav pollution (words) | Recall | |---|---|---| -| **markcrawl** | **15** | **64%** | -| scrapy+md | 133 | 68% | -| crawl4ai | 311 | 66% | -| colly+md | 1953 | 96% | -| playwright | 2037 | 97% | -| crawlee | 2207 | 97% | +| **markcrawl** | **14** | **30%** | +| scrapy+md | 208 | 21% | +| crawl4ai | 418 | 71% | +| playwright | 2710 | 82% | +| colly+md | 2733 | 54% | +| crawlee | 2839 | 84% | Nav pollution = boilerplate words (navigation, footer, cookie banners) that leak into extracted content. Lower is better — less junk means cleaner embeddings and fewer wasted tokens. -The tradeoff: playwright captures 97% of page content but includes ~2,037 words of boilerplate per page. MarkCrawl captures 64% with 15 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall. +The tradeoff: crawlee captures 84% of page content but includes ~2,839 words of boilerplate per page. MarkCrawl captures 30% with 14 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall. ## RAG answer quality | Tool | Chunks | Answer Quality (/5) | Hit@5 | Hit@20 | |---|---|---|---|---| -| **markcrawl** | **22,132** | **4.30** | **86%** | **91%** | -| scrapy+md | 23,854 | 4.41 | 91% | 94% | -| crawl4ai | 32,735 | 4.26 | 89% | 93% | -| colly+md | 42,934 | 4.29 | 86% | 92% | -| playwright | 46,439 | 4.38 | 90% | 94% | -| crawlee | 47,560 | 4.33 | 88% | 93% | +| **markcrawl** | **27,051** | **4.52** | **83%** | **88%** | +| scrapy+md | 42,234 | 4.03 | 56% | 65% | +| crawl4ai | 48,332 | 4.43 | 82% | 92% | +| playwright | 73,656 | 4.42 | 86% | 93% | +| crawlee | 74,281 | 4.52 | 87% | 93% | +| colly+md | 80,550 | 4.53 | 77% | 88% | *FireCrawl's self-hosted version did not complete crawls on all sites across multiple attempts. Its scores are on a reduced set and are not directly comparable to tools that completed all sites. @@ -55,7 +55,7 @@ The tradeoff: playwright captures 97% of page content but includes ~2,037 words - **Answer Quality** — LLM-judged score for answers generated from retrieved chunks. - **Hit@5 / Hit@20** — what percentage of queries find a relevant chunk in the top 5 or 20 results. -**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 2.1x fewer chunks than crawlee for the same content, cutting embedding and storage costs significantly. +**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 3.0x fewer chunks than colly+md for the same content, cutting embedding and storage costs significantly. ## Total cost of ownership