Skip to content

feat(search): /api/search sitewide search over shared site-content field (score, KWIC, facets, paging) + producer contract#52

Merged
joewiz merged 4 commits into
eXist-db:developfrom
joewiz:feat/search-field-facets
Jun 8, 2026
Merged

feat(search): /api/search sitewide search over shared site-content field (score, KWIC, facets, paging) + producer contract#52
joewiz merged 4 commits into
eXist-db:developfrom
joewiz:feat/search-field-facets

Conversation

@joewiz

@joewiz joewiz commented Jun 7, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

Summary

Builds out GET /api/search into a sitewide search beachhead: one relevance-ranked Lucene query over the shared site-content field that content apps contribute to, returning scored hits with KWIC highlighting, pagination, facet counts, and per-result links — and documents the producer contract that makes apps searchable.

What's in it

  • Documented response shape{ query, total, offset, limit, facets, results: [{ uri, path, title, app, url, score, snippet, highlights }] }, with score, <mark> KWIC snippets, and offset/limit paging.
  • Shared site-content field query + facets — queries the shared field (body + boosted site-title), aggregates site-app/site-section facet counts, and supports app/section facet drill-down. Results are matched with a single-step axis (collection("/db/apps")/*), which is element-name-independent and — unlike a //* descendant wildcard — preserves ft:score for field queries.
  • Producer display fields — returns title/url from the producer's site-title/site-url fields (falling back to a <title> child / computed link).
  • README "Sitewide search" section — documents the shared site-content producer contract: the required/recommended fields and facets, the identical-names-across-apps rule, and the site-url convention (root-relative, points at the rendered page, no scheme/host/port; absolutize at the edge from a single configured base URL). Producers own site-url correctness — /api/search surfaces the field as stored and does not normalize it.

Context

This is the existdb-openapi half of a cross-repo search-convergence effort: exist-site-shell and the eXist Oxygen plugin become thin consumers of /api/search instead of each carrying its own dispatch, and a native eXist ft:search-scope function (in progress, separate branch) will become the engine behind this endpoint. Producer apps (documentation-next, exist-blog, notebook) contribute to the shared field per the documented contract.

Notes for review

  • Lucene query metacharacters in the user's q are escaped server-side, so identifier/qualified-name queries (map:merge, util:eval) work without the client escaping; the colon is not treated as a field selector.
  • The current matched-unit is the document root; relevance, dedup-per-document, facets, and KWIC windowing are computed in XQuery and documented in the response shape.

joewiz and others added 4 commits June 6, 2026 19:12
…ing, paging

Replace the ad-hoc /api/search response with an ES/Lucene-inspired shape, per
the existdb-oxygen-plugin tasking.

- KWIC highlighting: snippets centered on a match (not the document's leading
  200 chars), matched terms wrapped in <mark>. Field-agnostic — expands the
  matched nodes (util:expand) and summarizes with the eXist kwic module, so it
  works with whatever text index an app defines. Each fragment is a well-formed
  XML string: a single <span> root, so a client can fn:parse-xml it directly
  (no parse-xml-fragment) and it is also valid HTML for innerHTML. Each hit
  carries a marked `snippet` (first fragment) and a `highlights` array.
- Relevance score: ft:score per hit, results ordered by it, descending.
- Pagination: new `offset` parameter; `offset`/`limit` echoed.
- One result per document: ft:query matches nested elements, so a document
  yielded many hits; results are deduplicated by document URI keeping the
  highest-scoring hit, and `total` is document-level.
- Canonical identifier: both `uri` and `path` (alias) are returned and
  documented as the id a client opens over exist:.

api.json documents the new parameter and response shape. Cypress: search.cy.js
asserts <mark> in snippet/highlights, well-formed single-rooted <span>
fragments, descending non-zero scores, dedup, offset/limit paging, and the
app filter.

Verified on a clean eXist 7.0.0-beta3 against the doc app (all 9 specs pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Evolve /api/search from a generic //*[ft:query] scan to the validated
field-first recipe (see the search-convergence spike): query the shared
site-content Lucene field that content apps contribute to, with facet counts
and drill-down filtering — so site-wide and per-app search are one endpoint.

- Field-scoped query string `site-content:(…) OR site-title:(…)^3`, user terms
  escaped for Lucene metacharacters (so `array:count`'s colon is literal, not a
  field selector — the QueryParser bug). default-operator=and, filter-rewrite=yes.
- Matches at the document-root level (collection("/db/apps")/*), not //*:
  ft:score is lost for field queries under the //* descendant axis but preserved
  on a single-step axis (spike finding; core fix proposed separately).
- Facet counts (site-app, site-section) via ft:facets, returned for a filter UI;
  app/section params narrow via Lucene facet drill-down.
- KWIC <mark> highlights now come from the FIELD via ft:highlight-field-matches
  (+ kwic), not util:expand — field matches don't map to the node's element text.
- Keeps score ranking, per-document dedup, paging, and the well-formed <span>
  highlight strings from the prior endpoint.

api.json documents the new `section` parameter and the `facets` response object.

Verified on a clean eXist 7.0.0-beta3 with a synthetic two-app site-content
corpus: cross-app ranked results with non-zero scores, facet counts spanning
apps, app/section drill-down, <mark> KWIC highlights, colon-escaped queries
(no QueryParser crash), and the required-q 400.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Retrieve a hit's display title and URL from the producer's site-title/site-url
fields (via the ft:query "fields" option + ft:field) instead of assuming a
<title> child element — so results from documentation-next, exist-blog, and
notebook show real titles and app-relative URLs. Falls back to a <title> child
/ computed link when the fields are absent.

Verified against the real three-producer corpus (docs/blog/notebook): titles,
urls, KWIC highlights, and per-app/per-section facet counts all populate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the shared site-content producer contract in the README (new
"Sitewide search" section): the required/recommended fields and facets
(site-content, site-title, site-url, site-app, site-section), the
identical-names-across-apps rule, the score-preserving single-step axis,
and the site-url convention — root-relative, points at the rendered page,
no scheme/host/port; absolutize at the edge from a single configured base
URL.

Producers own site-url correctness: /api/search surfaces the field as
stored and does not normalize it, so a producer that omits or malforms
site-url shows a broken link until its own config is fixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz joewiz merged commit 73fbb80 into eXist-db:develop Jun 8, 2026
1 check failed
@joewiz joewiz deleted the feat/search-field-facets branch June 8, 2026 01:52
line-o pushed a commit that referenced this pull request Jun 8, 2026
The search.cy.js suite assumed the built-in `doc` app would be findable
via /api/search (?q=xquery, &app=doc). Since /api/search now queries the
shared `site-content` Lucene field — which the `doc` app does not populate
— a bare existdb-openapi install (the CI image) returns no results, so
three assertions failed (`expected 0 to be above 0`, then the <mark> and
uri assertions cascaded off the empty result set). This is what turned
develop red after #52 merged.

Seed a self-contained fixture instead: a `before()` hook stores a small
collection under /db/apps/cypress-search-test with a `site-content`
index (6 <page> docs containing "xquery" across two sections) and
reindexes, via admin XQuery through /api/query; `after()` removes it. The
suite now searches its own corpus rather than depending on the image's
`doc` app, and exercises the real shared-field contract (relevance,
<mark> highlighting, dedup, offset/limit paging, and the app facet).

Verified end-to-end on a live existdb-openapi instance: the setup runs via
/api/query, ?q=xquery returns the 6 fixture docs with <mark> snippets and
highlights, and ?app=cypress-search-test restricts to the fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
line-o added a commit that referenced this pull request Jun 8, 2026
…fixture

test(search): make /api/search e2e tests self-contained (fix red develop after #52)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant