Wikipedia & MediaWiki Search and Page Summaries in DuckDB

A Query.Farm VGI worker for DuckDB.

Wikipedia & MediaWiki Search and Page Summaries in DuckDB

vgi-wikipedia · a Query.Farm VGI worker

A VGI worker that brings Wikipedia / MediaWiki full-text search and page retrieval into DuckDB/SQL — search articles and fetch page summaries as SQL functions, over the free, official MediaWiki API. No API key, no scraping. Great for RAG / knowledge grounding; pairs naturally with vgi-embed and vgi-rerank.

INSTALL vgi FROM community; LOAD vgi;
ATTACH 'wiki' (TYPE vgi, LOCATION 'uv run wikipedia_worker.py');

-- Full-text search (table function; named args)
SELECT title, snippet, url
FROM wiki.wiki_search('apache arrow', lang := 'en', count := 10);

-- A page's summary extract (scalar; positional only)
SELECT wiki.wiki_page('DuckDB');             -- English
SELECT wiki.wiki_page('DuckDB', 'de');       -- German Wikipedia

-- The rich, multi-column page summary (table function)
SELECT title, extract, url, thumbnail_url, pageid
FROM wiki.wiki_page_summary('DuckDB', lang := 'en');

⚠️ Content licensing — read this first

The worker code in this repository is MIT (see LICENSE).

The CONTENT it retrieves is not. Text from Wikipedia and other Wikimedia projects is licensed CC-BY-SA (article text is CC-BY-SA 4.0; some media differ). That means if you store, redisplay, or redistribute the snippets / extracts this worker returns — including feeding them into a product or a published dataset — attribution and share-alike are your responsibility. This worker does not and cannot grant you any rights to the content; it only fetches it. When in doubt, link back to the source article (the url column gives you the canonical URL) and credit Wikipedia.

Please also respect the Wikimedia API etiquette: the worker sends a descriptive User-Agent (required), uses per-call timeouts, and retries politely with backoff. For heavy/automated use, set your own User-Agent (see Configuration) and consider the rate limits.

How it maps Wikipedia onto SQL

Task	SQL surface	VGI primitive
Full-text search	`wiki_search('query', lang := 'en', count := 10)`	table function
A page's summary text	`wiki_page('DuckDB'[, 'de'])`	scalar function
A page's full summary row	`wiki_page_summary('DuckDB', lang := 'en')`	table function

Conventions

VGI scalars are positional-only (DuckDB's name := value is a table-function feature). So wiki_page is two arity overloads: wiki_page(title) (English) and wiki_page(title, lang). The table functions wiki_search / wiki_page_summary take named args.
Search snippets are stripped to plain text — the MediaWiki API returns HTML (<span class="searchmatch">…</span> and entities); this worker removes the markup so the snippet column is clean text.
Missing fields become NULL (e.g. a hit with no timestamp, a summary with no thumbnail).

Function catalog

`wiki_search(query, lang := 'en', count := 10, max_pages := 1, api_url := '')` → table

Full-text search via the Action API (action=query&list=search). Returns:

column	type	notes
`title`	VARCHAR	article title
`snippet`	VARCHAR	excerpt, HTML stripped to plain text
`pageid`	BIGINT	MediaWiki page id
`wordcount`	INTEGER	article word count
`url`	VARCHAR	canonical article URL
`lang`	VARCHAR	language the result came from
`extra`	VARCHAR (JSON)	`{timestamp, size}` when present, else NULL

count — results per API page (1–50, default 10).
max_pages — how many sroffset pages to follow (1–20, default 1: a single top-N call). The MediaWiki continuation token (sroffset) is carried as externalized scan state so paging survives DuckDB's batch boundaries. This is deliberately the easy (non-defensible) kind of scan state — a plain int — and is documented as such.
api_url — point at any MediaWiki wiki (e.g. https://www.wikidata.org/w/api.php) instead of Wikipedia. May contain a {lang} placeholder.

-- Follow up to 3 pages of German results
SELECT title FROM wiki.wiki_search('datenbank', lang := 'de', count := 5, max_pages := 3);

`wiki_page(title[, lang])` → VARCHAR (scalar)

The plain-text summary extract (lead paragraph) of a page, or NULL for a missing page / NULL input. Never errors a scan.

SELECT wiki_page('DuckDB');          -- English
SELECT wiki_page('DuckDB', 'fr');    -- French Wikipedia

`wiki_page_summary(title, lang := 'en', api_url := '')` → table

The rich, multi-column summary row (the multi-column companion to the scalar): title, extract, url, thumbnail_url, pageid. A missing page yields zero rows (not an error).

RAG composition

-- search Wikipedia, then embed the snippets for retrieval (with vgi-embed)
WITH hits AS (
  SELECT title, url, snippet FROM wiki.wiki_search('vector database', count := 10)
)
SELECT title, url, embed.embed(snippet) AS vec FROM hits;

Configuration

env var	purpose	default
`VGI_WIKIPEDIA_API_URL`	override the `api.php` base (another wiki, or a mock server). May contain `{lang}`.	`https://{lang}.wikipedia.org/w/api.php`
`VGI_WIKIPEDIA_USER_AGENT`	override the (required) descriptive User-Agent.	`vgi-wikipedia/<version> (…)`

No API key is ever needed or accepted — the MediaWiki API is free and open.

Endpoints used

Action API api.php?action=query&list=search — full-text search, with srprop=snippet|wordcount|timestamp and sroffset paging.
REST summary /api/rest_v1/page/summary/{title} — the rich page summary (extract + thumbnail + canonical URL), with an Action-API prop=extracts fallback when REST is unavailable (e.g. a non-Wikimedia wiki).

Development & testing

make test         # unit/fixture tests + mock-server E2E + SQL (haybarn) E2E
make test-unit    # pytest: fixture parsers, scan-state round-trip, mock-server E2E
make test-sql     # DuckDB sqllogictest E2E via haybarn-unittest, mock-server-driven
make test-live    # OPTIONAL gated smoke against the real Wikipedia API (needs network)
make lint         # ruff + mypy

The default test gate never depends on live Wikipedia: every networked path is exercised against a local mock HTTP server serving captured fixtures (deterministic, no keys, no cost). The SQL E2E (make test-sql) launches that mock, points the worker's VGI_WIKIPEDIA_API_URL at it, and runs the worker as a real DuckDB subprocess — asserting the columns, the sroffset scan-state round-trip across a batch boundary, plain-text snippets, and a clean error path.

Licensing summary

Worker code: MIT (this repository).
Retrieved content: CC-BY-SA (Wikipedia/Wikimedia) — attribution & share-alike are the user's responsibility (see the warning at the top).
Dependencies: permissive (httpx, pyarrow, vgi-python).

Authorship & License

Written by Query.Farm.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
ci		ci
docs		docs
scripts		scripts
test/sql		test/sql
tests		tests
vgi_wikipedia		vgi_wikipedia
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
serve.py		serve.py
uv.lock		uv.lock
wikipedia_worker.py		wikipedia_worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wikipedia & MediaWiki Search and Page Summaries in DuckDB

⚠️ Content licensing — read this first

How it maps Wikipedia onto SQL

Function catalog

`wiki_search(query, lang := 'en', count := 10, max_pages := 1, api_url := '')` → table

`wiki_page(title[, lang])` → VARCHAR (scalar)

`wiki_page_summary(title, lang := 'en', api_url := '')` → table

RAG composition

Configuration

Endpoints used

Development & testing

Licensing summary

Authorship & License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Wikipedia & MediaWiki Search and Page Summaries in DuckDB

⚠️ Content licensing — read this first

How it maps Wikipedia onto SQL

Function catalog

wiki_search(query, lang := 'en', count := 10, max_pages := 1, api_url := '') → table

wiki_page(title[, lang]) → VARCHAR (scalar)

wiki_page_summary(title, lang := 'en', api_url := '') → table

RAG composition

Configuration

Endpoints used

Development & testing

Licensing summary

Authorship & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`wiki_search(query, lang := 'en', count := 10, max_pages := 1, api_url := '')` → table

`wiki_page(title[, lang])` → VARCHAR (scalar)

`wiki_page_summary(title, lang := 'en', api_url := '')` → table

Packages