Skip to content

Thibaultjaigu/websearch-benchmark

Repository files navigation

websearch-benchmark (wsbench)

An open, extensible benchmark that compares web-search engines (Tavily, Exa, Firecrawl, …) against LLM-native web search (via the Requesty gateway, or any OpenAI-compatible gateway) on three axes that actually matter:

  • Quality - scored by an LLM-as-judge (multiple judges averaged)
  • Cost - real USD per query (uses the gateway's authoritative cost where available)
  • Latency - wall-clock per query (mean / p50 / max)

Everything is logged. Every raw request and response, every judge call, and a machine-readable summary are written to runs/<timestamp>/.

Results

260 queries · 7 providers · 2 judges (bedrock/claude-opus-4-8 + openai/gpt-5.5, averaged). Quality is a 1-5 LLM-as-judge score (mean ± 95% CI). Full report: sample_results/example_report.txt.

Web search quality leaderboard

A dedicated search API (Tavily) topped the board on quality and win-rate — and the integrity controls (green oracle / red null) cleanly bracket every real engine, so the scores span the full scale and mean something.

The recency gap

The headline finding: LLM-native search collapses on recency. Every native model scores well on stable facts but drops to ~2.7–3.4 on time-sensitive queries — because the models frequently answer from training data instead of searching (Gemini searched on 0% of queries; Claude 42%, GPT 28%).

Cost vs latency

…and "native" is not free: dedicated APIs sit cheap-and-fast in the bottom-left; native search is 20–40× slower (p50) and up to 6× the cost per query.

Charts regenerate from any run: python -m wsbench.make_charts <run>/summary.json docs/img. These are point-in-time results from one dataset and two judges. Model behavior, vendor pricing, and search quality all move — re-run before citing. That's the whole point of the harness.

Full numbers (the same data, as a table)
Provider searched lat p50 $/query quality (±95% CI) win-rate recency
Tavily (advanced) 0.29s $0.016 4.78 ± 0.06 33% 4.10
control oracle (pos) 1.00 0.00s $0.000 4.66 ± 0.12 37% 2.85
Native GPT 5.5 0.28 5.25s $0.029 4.45 ± 0.09 14% 3.36
Native Gemini 2.5 Pro 0.00 7.07s $0.007 4.31 ± 0.10 4% 2.76
Native Claude Sonnet 4.6 0.42 8.11s $0.041 4.27 ± 0.11 5% 2.71
Exa (auto) 0.23s $0.007 3.98 ± 0.12 7% 3.97
control null (neg) 0.00 0.00s $0.000 1.00 ± 0.00 0% 1.00

Why

"Native" web search (an LLM with a web_search tool) and a dedicated search API (Tavily/Exa/Firecrawl) are different products. This benchmark measures them head-to-head on identical queries so you can decide which to use - and surfaces a non-obvious truth: not every model actually searches when you enable the tool. The report includes a searched rate so you can see who really hit the web vs. who answered from training data.

Quick start

python3 -m venv .venv && source .venv/bin/activate
pip install -e .              # or: pip install -r requirements.txt

cp .env.example .env   # then fill in your keys (lowercase or UPPERCASE both work)

# Smoke test: 2 queries, all providers, both judges
python -m wsbench.run --limit 2

# Full run
python -m wsbench.run

# Ad-hoc: specific providers, skip judging
python -m wsbench.run --providers tavily,exa --no-judge

Keys (.env)

tavily_api_key=...
exa_api_key=...
firecrawl_api_key=...
requesty_api_key=...     # used for native web search AND the LLM judges

.env is gitignored. Never commit it.

Native search and the judges are routed through Requesty, an OpenAI-compatible gateway — that's just where these results were produced. It's swappable: point REQUESTY_BASE_URL at any compatible endpoint (or a provider's own API) and supply the matching key. The benchmark only needs an endpoint that speaks the OpenAI chat/responses format and returns usage/cost.

What gets saved

runs/<timestamp>/
  raw/<provider>__<query>.json     full request + response + parsed results
  judge/<provider>__<query>.json   each judge's raw I/O and scores
  records.jsonl                    one row per (provider, query) incl. scores
  summary.json                     aggregate metrics + per-category breakdown
  report.txt                       human-readable tables

Adding a new search engine (the whole contract)

Adding an engine is one file. Create wsbench/providers/<name>.py:

from .base import SearchProvider, register
from ..schema import ProviderResponse, SearchResultItem, Timer

@register("myengine")                       # registry key
class MyEngine(SearchProvider):
    requires_env = ("MYENGINE_API_KEY",)    # runner skips cleanly if missing

    def search(self, query: str) -> ProviderResponse:
        resp = ProviderResponse(provider=self.name, query=query)
        with Timer() as t:
            data = call_my_api(query)        # your HTTP call
        resp.latency_s = t.elapsed
        resp.answer = data.get("answer", "")           # if it synthesizes one
        for item in data["results"]:
            resp.results.append(SearchResultItem(
                url=item["url"], title=item["title"], snippet=item["text"]))
        resp.cost_usd = ...                  # price it (see wsbench/pricing.py)
        return resp

Then import it in wsbench/__init__.py (so the decorator runs) and list it in config.yaml. Done - it now participates in every run, gets judged, costed, and reported alongside everyone else.

Configuration (config.yaml)

  • providers - which engines to run, with per-engine config (depth, limits, model)
  • judges - LLM-as-judge models (averaged), routed through Requesty
  • dataset - path to a JSONL of queries
  • max_workers - concurrency for provider calls

The same provider class can appear multiple times with different config (e.g. native search across Claude / Gemini / GPT) - just give each a distinct label.

Dataset format (datasets/queries.jsonl)

{"id": "q01", "category": "recency", "query": "...", "reference": "ground-truth answer (optional)"}

reference is optional; when present, judges use it as ground truth for the correctness dimension. Categories let the report break quality down by query type (recency, factual, multi-hop, niche, comparison).

The judge

Each judge model scores a candidate answer 1-5 on relevance, correctness, completeness, groundedness, plus a holistic overall, returning strict JSON at temperature 0. Multiple judges are averaged so no single model's bias dominates. Default judges: bedrock/claude-opus-4-8 and openai/gpt-5.5.

Benchmark integrity (why you can trust the numbers)

A score is only meaningful if the judge actually discriminates quality. wsbench proves this every run with two control providers (wsbench/providers/controls.py):

  • control-null (negative control) returns irrelevant garbage. It must score near the floor (~1/5).
  • control-oracle (positive control) returns the dataset's ground-truth reference. It must score near the ceiling (~5/5).

If these two don't cleanly separate, the judge or rubric is broken and the real scores are meaningless. In practice they land at 1.0 and 5.0 respectively, with real engines in between - the rubric spans the full range.

Statistics

The report includes more than means:

  • mean ± 95% CI on quality (so you can see whether two providers actually differ or are within noise)
  • win-rate - head-to-head, the fraction of queries where each provider was the top scorer (ties split). Complements the mean: a provider can have a good average but rarely win, or vice versa.
  • per-category breakdown (recency / factual / multi-hop / niche / comparison / howto / numeric / entity) - recency is where native vs engine differences show up most.

Dataset generation

datasets/queries_large.jsonl (260 queries) is generated by python -m wsbench.gen_dataset --target 260, which prompts a strong model for diverse, reference-backed queries across all categories, dedupes by normalized text, and assigns stable content-hash IDs. Regenerate or extend it freely; the references are the judge's ground truth, so review them if you add your own.

Robustness

  • Provider calls retry transient failures (429 / timeout / 5xx) with backoff.
  • Judging runs concurrently (the bottleneck - opus is slow).
  • Every raw call and judge response is written to disk as it completes, so a crashed run still leaves all partial evidence behind.

Cost accounting

See wsbench/pricing.py - a single, auditable table. Native-search costs prefer the gateway's returned usage.cost (authoritative) and add an estimated server-tool surcharge for the search itself. Search-engine costs are estimated from each vendor's published credit pricing and flagged cost_is_estimate. Verify against current vendor pricing pages before quoting numbers.

License

MIT (intended as an open-source community benchmark - contributions of new engine adapters welcome). See CONTRIBUTING.md for how to add an adapter, extend the dataset, or submit a reproduction.

About

Open benchmark comparing web-search engines (Tavily, Exa, Firecrawl) vs LLM-native web search on quality (LLM-as-judge), cost, and latency. Reproducible, fully logged, one-file adapters.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages