Query Elasticsearch & OpenSearch as Tables in DuckDB

A Query.Farm VGI worker for DuckDB.

Query Elasticsearch & OpenSearch as Tables in DuckDB

vgi-elasticsearch · a Query.Farm VGI worker

A VGI worker (Go) that queries an Elasticsearch / OpenSearch index as a SQL table from DuckDB. It uses Point-In-Time (PIT) + search_after for consistent, resumable deep pagination — the deprecated stateful scroll API is not used. The pagination cursor (the PIT id + the last hit's sort values) is externalized into the worker's gob-encoded scan state, so a scan resumes statelessly across VGI batch boundaries with no duplicates and no drops.

LOAD vgi;
ATTACH 'elasticsearch' AS es (TYPE vgi, LOCATION '/path/to/vgi-elasticsearch-worker');

-- Page through an entire index as rows (columns derived from the mapping):
SELECT _id, name, n, created
FROM es.es_search('http://localhost:9200', 'my_index');

-- Predicates push to the cluster as a query DSL; only requested fields leave it:
SELECT _id, n
FROM es.es_search('http://localhost:9200', 'my_index')
WHERE n >= 100 AND status = 'active';

The `es_search` table function

es_search(endpoint, index, ...) -> rows

Argument	Kind	Default	Description
`endpoint`	positional	—	Cluster base URL, e.g. `http://localhost:9200`
`index`	positional	—	Index, alias, or wildcard to search
`fields`	named	(introspect)	`name:estype,...` to declare columns explicitly; empty = read the index mapping
`query`	named	—	Raw query-DSL JSON escape hatch (merged under `bool/filter` with pushed predicates)
`sort`	named	`_id`	`field[:desc],...`; an `_id` tiebreaker is always appended
`flavor`	named	`opensearch`	PIT dialect: `opensearch` or `elasticsearch`
`keep_alive`	named	`1m`	PIT `keep_alive` duration
`page_size`	named	`1000`	Hits fetched per scan tick (`size`)
`username`/`password`	named	—	HTTP basic auth
`apikey`	named	—	`Authorization: ApiKey <value>`
`insecuretls`	named	`false`	Skip TLS verification (self-signed test clusters)

Every row carries _id VARCHAR and _score DOUBLE plus the projected source fields.

Resumable deep pagination (the differentiator)

Each Process tick does exactly one unit of I/O:

open a PIT on the first tick (reuse it thereafter);
run one _search with search_after set to the last hit's sort values from the previous tick (stored in scan state);
emit that page's hits and advance the cursor in state;
when a page is short, close the PIT and finish.

The scan state holds only plain exported, gob-encodable fields — the PIT id, the last sort values (raw JSON), and the column/sort/query plan — so the VGI runtime can serialize it between ticks and rehydrate the cursor on the next tick. The HTTP client and the Arrow batch are rebuilt from those fields every tick; nothing non-serializable (no connection, no arrow.Record) ever lives in state.

The sort always ends in an _id tiebreaker so the ordering is total and search_after never skips or repeats a document at a page boundary. (We deliberately avoid the _shard_doc tiebreaker: Elasticsearch 8 supports it but OpenSearch 2.x rejects it with "No mapping found for [_shard_doc]".)

Projection pushdown

The columns DuckDB actually requests are translated into a _source include list, so bytes the query does not need never leave the cluster. Selecting only _id skips _source entirely.

Predicate pushdown

DuckDB WHERE predicates are mapped onto the ES query DSL:

SQL	ES query DSL
`col = v`	`term`
`col IN (a, b)`	`terms`
`col >/>=/</<= v`	`range`
`col != v`	`bool.must_not term`
`col IS NULL`	`bool.must_not exists`
`col IS NOT NULL`	`exists`
`AND` of the above	combined `bool.filter`

Predicates that cannot be pushed (OR, struct/nested, expression, join-key filters) fall back to DuckDB post-filtering. Pushdown is always a performance choice, never a correctness one: DuckDB re-applies every predicate client-side, so an imperfect pushdown can only return a superset, which DuckDB then trims. A raw query argument is AND-merged with the pushed predicates.

Type mapping

Elasticsearch type	Arrow / DuckDB type
`keyword`, `text`, `ip`, `version`, …	`VARCHAR`
`long`, `integer`, `short`, `byte`, `unsigned_long`	`BIGINT`
`double`, `float`, `half_float`, `scaled_float`	`DOUBLE`
`boolean`	`BOOLEAN`
`date`, `date_nanos`	`TIMESTAMPTZ` (UTC; explicit `arrow_type`)
`object`, `nested`, `geo_*`, …	`VARCHAR` holding the raw JSON

date values are parsed from RFC3339 strings or epoch-millis. Object/nested fields are surfaced as JSON text — query them with DuckDB's json functions (meta->>'$.idx'). STRUCT/LIST/TIMESTAMPTZ returns require an explicit arrow_type, which the worker always supplies.

Building and testing

make build      # build the worker + seeder binaries
make test-unit  # pure-Go unit tests (no cluster needed)
make test-live  # Go live tests against a dockerized OpenSearch
make test-sql   # haybarn SQL E2E against a dockerized OpenSearch
make test       # all of the above

test-live and test-sql require Docker; the Makefile boots a single-node opensearchproject/opensearch:2.17.0 (Apache-2.0) with the security plugin disabled on host port 9209, seeds a fixed index, runs the suite, and tears the container down. test-sql also needs haybarn-unittest on PATH (uv tool install haybarn-unittest).

The centerpiece live test indexes 25 documents and pages through them one batch per tick at page_size := 4 (7 ticks: 6×4 + 1 short), gob-round-tripping the scan state between every tick, and asserts every document is returned exactly once — proving the search_after cursor survives a batch boundary.

Licensing

This worker is MIT-licensed (see LICENSE). Its dependency surface is Apache-2.0 and permissive only — no GPL/AGPL:

github.com/apache/arrow-go/v18 — Apache-2.0
github.com/Query-farm/vgi-go, github.com/Query-farm/vgi-rpc-go — Query Farm SDK
the cluster client is implemented directly on the Go standard library (net/http + encoding/json), so no Elasticsearch/OpenSearch client library is vendored at all. The official Apache-2.0 clients (github.com/opensearch-project/opensearch-go, github.com/elastic/go-elasticsearch) are drop-in compatible if a future version wants their connection pooling/sniffing.

Server-license note: the Elasticsearch server is dual-licensed under SSPL / the Elastic License (not OSS). This worker is only a client of an HTTP API, which those licenses do not restrict. For the test container and CI we deliberately use OpenSearch (Apache-2.0) to keep the license surface clean; the worker speaks to both flavors at runtime (flavor := 'elasticsearch').

Caveats and gaps (v1)

Object/nested fields are surfaced as JSON-in-VARCHAR, not as Arrow STRUCTs. Full recursive mapping resolution into nested STRUCT/LIST types is future work; JSON keeps every field usable today via DuckDB's json functions.
Predicate pushdown covers term/terms/range/exists and their AND composition. OR, struct/nested-field, expression, and join-key filters are not pushed (they post-filter in DuckDB). Text fields are matched with term (exact); if you index analyzed text you may want a .keyword sub-field or the raw query escape hatch for full-text match.
PIT keep-alive / expiry: each search refreshes the PIT's keep_alive, but a scan that stalls longer than keep_alive between ticks can see the PIT expire; raise keep_alive for very slow consumers. The PIT is closed on scan completion; on an abandoned scan it expires on its own.
Aggregations are not exposed — es_search returns documents, not aggs. Use DuckDB's GROUP BY over the returned rows, or the raw query hatch for bespoke document queries.
_shard_doc is intentionally unused (OpenSearch 2.x incompatibility); the _id tiebreaker is universal but slightly less efficient on very large shards than a numeric doc-order sort would be.

Authorship & License

Written by Query.Farm.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
ci		ci
cmd		cmd
docs		docs
internal/esworker		internal/esworker
test/sql		test/sql
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Query Elasticsearch & OpenSearch as Tables in DuckDB

The `es_search` table function

Resumable deep pagination (the differentiator)

Projection pushdown

Predicate pushdown

Type mapping

Building and testing

Licensing

Caveats and gaps (v1)

Authorship & License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Query Elasticsearch & OpenSearch as Tables in DuckDB

The es_search table function

Resumable deep pagination (the differentiator)

Projection pushdown

Predicate pushdown

Type mapping

Building and testing

Licensing

Caveats and gaps (v1)

Authorship & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `es_search` table function

Packages