A Query.Farm VGI worker for DuckDB.
vgi-rerank · a Query.Farm VGI worker · powered by fastembed (ONNX)
Local cross-encoder reranking as a DuckDB SQL function — the precision second stage of retrieval-augmented generation (RAG), entirely on your machine.
A VGI worker that scores how relevant a document is to a
query with a cross-encoder, so a SQL query can do precise second-stage
reranking in-engine over a top-K candidate set. It runs MS MARCO / bge reranker
models through fastembed (Qdrant,
Apache-2.0), which uses ONNX Runtime — no torch, so it installs light and
starts fast. No API keys, no network at query time.
vgi-rerank is the companion to vgi-embed: together with
DuckDB VSS (ANN recall) and BM25 (keyword recall), they complete a fully local
RAG retrieval stack —
vgi-embed vectors → DuckDB VSS ANN + BM25 keyword → top-K candidates → vgi-rerank precision
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'rerank' (TYPE vgi, LOCATION 'uv run rerank_worker.py');
-- Score one query/document pair (higher = more relevant).
SELECT rerank.rerank_score('how do I reset my password',
'Click the forgot password link to reset it.');
-- THE pattern: rerank a top-K candidate set (e.g. from VSS / BM25) by relevance.
SELECT id, chunk
FROM candidates -- the ~100 rows recall handed you
ORDER BY rerank.rerank_score('how do I reset my password', chunk) DESC
LIMIT 10;
-- What models are available?
SELECT * FROM rerank.supported_models();A bi-encoder (vgi-embed) maps a query and a document to independent vectors. Those vectors can be precomputed once, indexed (DuckDB VSS / HNSW), and reused for every future query — that's what makes embedding-based recall over a whole corpus cheap.
A cross-encoder (this worker) instead feeds the query and the document
together through one transformer and emits a single relevance score. That
joint attention is markedly more accurate — but it cannot be precomputed or
cached: every (query, document) pair is a fresh forward pass at query time.
The load-bearing caveat: because each pair is scored live, a cross-encoder is only viable as a second-stage reranker over a top-K candidate set (rerank the ~100 candidates recall handed you — never the whole corpus). Embeddings you compute once; rerank scores you pay for on every query. Run recall first (VSS ANN and/or BM25),
LIMITto a candidate set, thenORDER BY rerank_score(...).
| Default model | Xenova/ms-marco-MiniLM-L-6-v2 |
| Size | ~80 MB (quantised ONNX) |
| Model license | Apache-2.0 (commercial use permitted) — see the model card |
| Runtime | fastembed (Apache-2.0) on ONNX Runtime — no torch |
The model is downloaded on first use and cached on disk; later runs load it
locally. The cache directory is gitignored. Override it with VGI_RERANK_CACHE_DIR
(or fastembed's own FASTEMBED_CACHE_PATH). Pre-warm it once with make models.
Other supported models (pass as the third argument to rerank_score, or query
supported_models()) — all permissively licensed:
| model | license | size |
|---|---|---|
Xenova/ms-marco-MiniLM-L-6-v2 (default) |
Apache-2.0 | ~80 MB |
Xenova/ms-marco-MiniLM-L-12-v2 |
Apache-2.0 | ~120 MB |
BAAI/bge-reranker-base |
MIT | ~1 GB |
jinaai/jina-reranker-v1-tiny-en |
Apache-2.0 | ~130 MB |
jinaai/jina-reranker-v1-turbo-en |
Apache-2.0 | ~150 MB |
(fastembed also ships jina-reranker-v2-base-multilingual, but it is CC-BY-NC
— non-commercial — so this worker deliberately does not advertise it.)
| function | signature | notes |
|---|---|---|
rerank_score(query, document) |
(VARCHAR, VARCHAR) → DOUBLE |
cross-encoder relevance score, default model |
rerank_score(query, document, model) |
(VARCHAR, VARCHAR, VARCHAR) → DOUBLE |
explicit model (arity overload) |
rerank_version() |
→ VARCHAR |
worker + backend + default-model identity |
| function | columns |
|---|---|
supported_models() |
(model VARCHAR, license VARCHAR) |
rerank_score exposes its optional model via an arity overload because VGI
scalar functions are positional-only (name := value is a table-function feature).
The score is the model's raw relevance logit: higher means more relevant. It
is not normalised to a fixed range (it can be negative), and a value is only
meaningful relative to other documents for the same query — which is exactly
what ORDER BY rerank_score(:q, doc) DESC LIMIT k needs. Do not threshold on
an absolute value across different queries or models. (If you need a 0–1 score,
apply a sigmoid in SQL: 1 / (1 + exp(-rerank_score(...))).)
A NULL or empty/whitespace-only query or document → a NULL score. Nothing
crashes on odd input. NULL scores sort last under ORDER BY ... DESC.
Recall first (cheap, over the whole corpus), then rerank the candidates (precise, top-K only):
INSTALL vss; LOAD vss;
-- Stage 1 — recall: ANN over precomputed vgi-embed vectors (top 100 candidates).
WITH candidates AS (
SELECT id, body
FROM docs
ORDER BY array_cosine_distance(v, embed.embed_query('reset my password')::FLOAT[384])
LIMIT 100
)
-- Stage 2 — precision: cross-encoder rerank the 100, keep the best 10.
SELECT id, body
FROM candidates
ORDER BY rerank.rerank_score('reset my password', body) DESC
LIMIT 10;- CPU latency is real. Each row is a transformer forward pass. The MiniLM-L-6 default is the cheapest; the bge-reranker is more accurate but ~1 GB and slower. Keep candidate sets in the tens-to-low-hundreds.
- GPU is a deployment upsell. ONNX Runtime can use a GPU execution provider; that is an environment/packaging concern outside this worker's defaults.
- The value here is the in-engine, offline packaging and the local RAG suite — the cross-encoder models themselves are commodity, permissively licensed checkpoints.
uv sync --extra dev
make models # pre-warm the fastembed cache (downloads the default model)
uv run --no-sync pytest -q # unit (model-gated tests self-skip on a cold/offline checkout)
make test-sql # E2E: haybarn-unittest over test/sql/* (authoritative)
make test # both
uv run --no-sync ruff check . && uv run --no-sync mypy vgi_rerank/make test-sql exports VGI_RERANK_WORKER="uv run --python 3.13 rerank_worker.py"
and runs haybarn-unittest --test-dir . "test/sql/*" (install once:
uv tool install haybarn-unittest, then put ~/.local/bin on PATH).
Worker code: MIT (see LICENSE). The default model
Xenova/ms-marco-MiniLM-L-6-v2 is Apache-2.0-licensed; BAAI/bge-reranker-base
is MIT; fastembed is Apache-2.0; ONNX Runtime is MIT. The vgi DuckDB extension
and vgi-python are licensed separately by Query Farm.
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
