An open, extensible benchmark that compares web-search engines (Tavily, Exa, Firecrawl, …) against LLM-native web search (via the Requesty gateway, or any OpenAI-compatible gateway) on three axes that actually matter:
- Quality - scored by an LLM-as-judge (multiple judges averaged)
- Cost - real USD per query (uses the gateway's authoritative cost where available)
- Latency - wall-clock per query (mean / p50 / max)
Everything is logged. Every raw request and response, every judge call, and a
machine-readable summary are written to runs/<timestamp>/.
260 queries · 7 providers · 2 judges (bedrock/claude-opus-4-8 + openai/gpt-5.5, averaged).
Quality is a 1-5 LLM-as-judge score (mean ± 95% CI). Full report:
sample_results/example_report.txt.
A dedicated search API (Tavily) topped the board on quality and win-rate — and the integrity controls (green oracle / red null) cleanly bracket every real engine, so the scores span the full scale and mean something.
The headline finding: LLM-native search collapses on recency. Every native model scores well on stable facts but drops to ~2.7–3.4 on time-sensitive queries — because the models frequently answer from training data instead of searching (Gemini searched on 0% of queries; Claude 42%, GPT 28%).
…and "native" is not free: dedicated APIs sit cheap-and-fast in the bottom-left; native search is 20–40× slower (p50) and up to 6× the cost per query.
Charts regenerate from any run:
python -m wsbench.make_charts <run>/summary.json docs/img. These are point-in-time results from one dataset and two judges. Model behavior, vendor pricing, and search quality all move — re-run before citing. That's the whole point of the harness.
Full numbers (the same data, as a table)
| Provider | searched | lat p50 | $/query | quality (±95% CI) | win-rate | recency |
|---|---|---|---|---|---|---|
| Tavily (advanced) | — | 0.29s | $0.016 | 4.78 ± 0.06 | 33% | 4.10 |
| control oracle (pos) | 1.00 | 0.00s | $0.000 | 4.66 ± 0.12 | 37% | 2.85 |
| Native GPT 5.5 | 0.28 | 5.25s | $0.029 | 4.45 ± 0.09 | 14% | 3.36 |
| Native Gemini 2.5 Pro | 0.00 | 7.07s | $0.007 | 4.31 ± 0.10 | 4% | 2.76 |
| Native Claude Sonnet 4.6 | 0.42 | 8.11s | $0.041 | 4.27 ± 0.11 | 5% | 2.71 |
| Exa (auto) | — | 0.23s | $0.007 | 3.98 ± 0.12 | 7% | 3.97 |
| control null (neg) | 0.00 | 0.00s | $0.000 | 1.00 ± 0.00 | 0% | 1.00 |
"Native" web search (an LLM with a web_search tool) and a dedicated search API
(Tavily/Exa/Firecrawl) are different products. This benchmark measures them
head-to-head on identical queries so you can decide which to use - and surfaces
a non-obvious truth: not every model actually searches when you enable the
tool. The report includes a searched rate so you can see who really hit the
web vs. who answered from training data.
python3 -m venv .venv && source .venv/bin/activate
pip install -e . # or: pip install -r requirements.txt
cp .env.example .env # then fill in your keys (lowercase or UPPERCASE both work)
# Smoke test: 2 queries, all providers, both judges
python -m wsbench.run --limit 2
# Full run
python -m wsbench.run
# Ad-hoc: specific providers, skip judging
python -m wsbench.run --providers tavily,exa --no-judgetavily_api_key=...
exa_api_key=...
firecrawl_api_key=...
requesty_api_key=... # used for native web search AND the LLM judges
.env is gitignored. Never commit it.
Native search and the judges are routed through Requesty,
an OpenAI-compatible gateway — that's just where these results were produced.
It's swappable: point REQUESTY_BASE_URL at any compatible endpoint (or a
provider's own API) and supply the matching key. The benchmark only needs an
endpoint that speaks the OpenAI chat/responses format and returns usage/cost.
runs/<timestamp>/
raw/<provider>__<query>.json full request + response + parsed results
judge/<provider>__<query>.json each judge's raw I/O and scores
records.jsonl one row per (provider, query) incl. scores
summary.json aggregate metrics + per-category breakdown
report.txt human-readable tables
Adding an engine is one file. Create wsbench/providers/<name>.py:
from .base import SearchProvider, register
from ..schema import ProviderResponse, SearchResultItem, Timer
@register("myengine") # registry key
class MyEngine(SearchProvider):
requires_env = ("MYENGINE_API_KEY",) # runner skips cleanly if missing
def search(self, query: str) -> ProviderResponse:
resp = ProviderResponse(provider=self.name, query=query)
with Timer() as t:
data = call_my_api(query) # your HTTP call
resp.latency_s = t.elapsed
resp.answer = data.get("answer", "") # if it synthesizes one
for item in data["results"]:
resp.results.append(SearchResultItem(
url=item["url"], title=item["title"], snippet=item["text"]))
resp.cost_usd = ... # price it (see wsbench/pricing.py)
return respThen import it in wsbench/__init__.py (so the decorator runs) and list it in
config.yaml. Done - it now participates in every run, gets judged, costed, and
reported alongside everyone else.
providers- which engines to run, with per-engineconfig(depth, limits, model)judges- LLM-as-judge models (averaged), routed through Requestydataset- path to a JSONL of queriesmax_workers- concurrency for provider calls
The same provider class can appear multiple times with different config
(e.g. native search across Claude / Gemini / GPT) - just give each a distinct
label.
{"id": "q01", "category": "recency", "query": "...", "reference": "ground-truth answer (optional)"}reference is optional; when present, judges use it as ground truth for the
correctness dimension. Categories let the report break quality down by query type
(recency, factual, multi-hop, niche, comparison).
Each judge model scores a candidate answer 1-5 on relevance, correctness,
completeness, groundedness, plus a holistic overall, returning strict JSON
at temperature 0. Multiple judges are averaged so no single model's bias
dominates. Default judges: bedrock/claude-opus-4-8 and openai/gpt-5.5.
A score is only meaningful if the judge actually discriminates quality. wsbench
proves this every run with two control providers (wsbench/providers/controls.py):
control-null(negative control) returns irrelevant garbage. It must score near the floor (~1/5).control-oracle(positive control) returns the dataset's ground-truth reference. It must score near the ceiling (~5/5).
If these two don't cleanly separate, the judge or rubric is broken and the real scores are meaningless. In practice they land at 1.0 and 5.0 respectively, with real engines in between - the rubric spans the full range.
The report includes more than means:
- mean ± 95% CI on quality (so you can see whether two providers actually differ or are within noise)
- win-rate - head-to-head, the fraction of queries where each provider was the top scorer (ties split). Complements the mean: a provider can have a good average but rarely win, or vice versa.
- per-category breakdown (recency / factual / multi-hop / niche / comparison / howto / numeric / entity) - recency is where native vs engine differences show up most.
datasets/queries_large.jsonl (260 queries) is generated by
python -m wsbench.gen_dataset --target 260, which prompts a strong model for
diverse, reference-backed queries across all categories, dedupes by normalized
text, and assigns stable content-hash IDs. Regenerate or extend it freely; the
references are the judge's ground truth, so review them if you add your own.
- Provider calls retry transient failures (429 / timeout / 5xx) with backoff.
- Judging runs concurrently (the bottleneck - opus is slow).
- Every raw call and judge response is written to disk as it completes, so a crashed run still leaves all partial evidence behind.
See wsbench/pricing.py - a single, auditable table. Native-search costs prefer
the gateway's returned usage.cost (authoritative) and add an estimated
server-tool surcharge for the search itself. Search-engine costs are estimated
from each vendor's published credit pricing and flagged cost_is_estimate.
Verify against current vendor pricing pages before quoting numbers.
MIT (intended as an open-source community benchmark - contributions of new engine adapters welcome). See CONTRIBUTING.md for how to add an adapter, extend the dataset, or submit a reproduction.


