A Query.Farm VGI worker for DuckDB.
vgi-wikipedia · a Query.Farm VGI worker
A VGI worker that brings
Wikipedia / MediaWiki full-text search and page retrieval into DuckDB/SQL —
search articles and fetch page summaries as SQL functions, over the free,
official MediaWiki API. No API key, no scraping. Great for RAG / knowledge
grounding; pairs naturally with vgi-embed
and vgi-rerank.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'wiki' (TYPE vgi, LOCATION 'uv run wikipedia_worker.py');
-- Full-text search (table function; named args)
SELECT title, snippet, url
FROM wiki.wiki_search('apache arrow', lang := 'en', count := 10);
-- A page's summary extract (scalar; positional only)
SELECT wiki.wiki_page('DuckDB'); -- English
SELECT wiki.wiki_page('DuckDB', 'de'); -- German Wikipedia
-- The rich, multi-column page summary (table function)
SELECT title, extract, url, thumbnail_url, pageid
FROM wiki.wiki_page_summary('DuckDB', lang := 'en');The worker code in this repository is MIT (see LICENSE).
The CONTENT it retrieves is not. Text from Wikipedia and other Wikimedia
projects is licensed CC-BY-SA
(article text is CC-BY-SA 4.0; some media differ). That means if you store,
redisplay, or redistribute the snippets / extracts this worker returns —
including feeding them into a product or a published dataset — attribution and
share-alike are your responsibility. This worker does not and cannot grant you
any rights to the content; it only fetches it. When in doubt, link back to the
source article (the url column gives you the canonical URL) and credit
Wikipedia.
Please also respect the Wikimedia API etiquette:
the worker sends a descriptive User-Agent (required), uses per-call timeouts,
and retries politely with backoff. For heavy/automated use, set your own
User-Agent (see Configuration) and consider the rate limits.
| Task | SQL surface | VGI primitive |
|---|---|---|
| Full-text search | wiki_search('query', lang := 'en', count := 10) |
table function |
| A page's summary text | wiki_page('DuckDB'[, 'de']) |
scalar function |
| A page's full summary row | wiki_page_summary('DuckDB', lang := 'en') |
table function |
Conventions
- VGI scalars are positional-only (DuckDB's
name := valueis a table-function feature). Sowiki_pageis two arity overloads:wiki_page(title)(English) andwiki_page(title, lang). The table functionswiki_search/wiki_page_summarytake named args. - Search snippets are stripped to plain text — the MediaWiki API returns
HTML (
<span class="searchmatch">…</span>and entities); this worker removes the markup so thesnippetcolumn is clean text. - Missing fields become NULL (e.g. a hit with no timestamp, a summary with no thumbnail).
Full-text search via the Action API (action=query&list=search). Returns:
| column | type | notes |
|---|---|---|
title |
VARCHAR | article title |
snippet |
VARCHAR | excerpt, HTML stripped to plain text |
pageid |
BIGINT | MediaWiki page id |
wordcount |
INTEGER | article word count |
url |
VARCHAR | canonical article URL |
lang |
VARCHAR | language the result came from |
extra |
VARCHAR (JSON) | {timestamp, size} when present, else NULL |
count— results per API page (1–50, default 10).max_pages— how manysroffsetpages to follow (1–20, default 1: a single top-N call). The MediaWiki continuation token (sroffset) is carried as externalized scan state so paging survives DuckDB's batch boundaries. This is deliberately the easy (non-defensible) kind of scan state — a plain int — and is documented as such.api_url— point at any MediaWiki wiki (e.g.https://www.wikidata.org/w/api.php) instead of Wikipedia. May contain a{lang}placeholder.
-- Follow up to 3 pages of German results
SELECT title FROM wiki.wiki_search('datenbank', lang := 'de', count := 5, max_pages := 3);The plain-text summary extract (lead paragraph) of a page, or NULL for a missing page / NULL input. Never errors a scan.
SELECT wiki_page('DuckDB'); -- English
SELECT wiki_page('DuckDB', 'fr'); -- French WikipediaThe rich, multi-column summary row (the multi-column companion to the scalar):
title, extract, url, thumbnail_url, pageid. A missing page yields zero
rows (not an error).
-- search Wikipedia, then embed the snippets for retrieval (with vgi-embed)
WITH hits AS (
SELECT title, url, snippet FROM wiki.wiki_search('vector database', count := 10)
)
SELECT title, url, embed.embed(snippet) AS vec FROM hits;| env var | purpose | default |
|---|---|---|
VGI_WIKIPEDIA_API_URL |
override the api.php base (another wiki, or a mock server). May contain {lang}. |
https://{lang}.wikipedia.org/w/api.php |
VGI_WIKIPEDIA_USER_AGENT |
override the (required) descriptive User-Agent. | vgi-wikipedia/<version> (…) |
No API key is ever needed or accepted — the MediaWiki API is free and open.
- Action API
api.php?action=query&list=search— full-text search, withsrprop=snippet|wordcount|timestampandsroffsetpaging. - REST summary
/api/rest_v1/page/summary/{title}— the rich page summary (extract + thumbnail + canonical URL), with an Action-APIprop=extractsfallback when REST is unavailable (e.g. a non-Wikimedia wiki).
make test # unit/fixture tests + mock-server E2E + SQL (haybarn) E2E
make test-unit # pytest: fixture parsers, scan-state round-trip, mock-server E2E
make test-sql # DuckDB sqllogictest E2E via haybarn-unittest, mock-server-driven
make test-live # OPTIONAL gated smoke against the real Wikipedia API (needs network)
make lint # ruff + mypyThe default test gate never depends on live Wikipedia: every networked path
is exercised against a local mock HTTP server serving captured fixtures
(deterministic, no keys, no cost). The SQL E2E (make test-sql) launches that
mock, points the worker's VGI_WIKIPEDIA_API_URL at it, and runs the worker as a
real DuckDB subprocess — asserting the columns, the sroffset scan-state
round-trip across a batch boundary, plain-text snippets, and a clean error path.
- Worker code: MIT (this repository).
- Retrieved content: CC-BY-SA (Wikipedia/Wikimedia) — attribution & share-alike are the user's responsibility (see the warning at the top).
- Dependencies: permissive (
httpx,pyarrow,vgi-python).
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
