feat(search): /api/search sitewide search over shared site-content field (score, KWIC, facets, paging) + producer contract#52
Merged
Conversation
…ing, paging Replace the ad-hoc /api/search response with an ES/Lucene-inspired shape, per the existdb-oxygen-plugin tasking. - KWIC highlighting: snippets centered on a match (not the document's leading 200 chars), matched terms wrapped in <mark>. Field-agnostic — expands the matched nodes (util:expand) and summarizes with the eXist kwic module, so it works with whatever text index an app defines. Each fragment is a well-formed XML string: a single <span> root, so a client can fn:parse-xml it directly (no parse-xml-fragment) and it is also valid HTML for innerHTML. Each hit carries a marked `snippet` (first fragment) and a `highlights` array. - Relevance score: ft:score per hit, results ordered by it, descending. - Pagination: new `offset` parameter; `offset`/`limit` echoed. - One result per document: ft:query matches nested elements, so a document yielded many hits; results are deduplicated by document URI keeping the highest-scoring hit, and `total` is document-level. - Canonical identifier: both `uri` and `path` (alias) are returned and documented as the id a client opens over exist:. api.json documents the new parameter and response shape. Cypress: search.cy.js asserts <mark> in snippet/highlights, well-formed single-rooted <span> fragments, descending non-zero scores, dedup, offset/limit paging, and the app filter. Verified on a clean eXist 7.0.0-beta3 against the doc app (all 9 specs pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Evolve /api/search from a generic //*[ft:query] scan to the validated
field-first recipe (see the search-convergence spike): query the shared
site-content Lucene field that content apps contribute to, with facet counts
and drill-down filtering — so site-wide and per-app search are one endpoint.
- Field-scoped query string `site-content:(…) OR site-title:(…)^3`, user terms
escaped for Lucene metacharacters (so `array:count`'s colon is literal, not a
field selector — the QueryParser bug). default-operator=and, filter-rewrite=yes.
- Matches at the document-root level (collection("/db/apps")/*), not //*:
ft:score is lost for field queries under the //* descendant axis but preserved
on a single-step axis (spike finding; core fix proposed separately).
- Facet counts (site-app, site-section) via ft:facets, returned for a filter UI;
app/section params narrow via Lucene facet drill-down.
- KWIC <mark> highlights now come from the FIELD via ft:highlight-field-matches
(+ kwic), not util:expand — field matches don't map to the node's element text.
- Keeps score ranking, per-document dedup, paging, and the well-formed <span>
highlight strings from the prior endpoint.
api.json documents the new `section` parameter and the `facets` response object.
Verified on a clean eXist 7.0.0-beta3 with a synthetic two-app site-content
corpus: cross-app ranked results with non-zero scores, facet counts spanning
apps, app/section drill-down, <mark> KWIC highlights, colon-escaped queries
(no QueryParser crash), and the required-q 400.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Retrieve a hit's display title and URL from the producer's site-title/site-url fields (via the ft:query "fields" option + ft:field) instead of assuming a <title> child element — so results from documentation-next, exist-blog, and notebook show real titles and app-relative URLs. Falls back to a <title> child / computed link when the fields are absent. Verified against the real three-producer corpus (docs/blog/notebook): titles, urls, KWIC highlights, and per-app/per-section facet counts all populate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the shared site-content producer contract in the README (new "Sitewide search" section): the required/recommended fields and facets (site-content, site-title, site-url, site-app, site-section), the identical-names-across-apps rule, the score-preserving single-step axis, and the site-url convention — root-relative, points at the rendered page, no scheme/host/port; absolutize at the edge from a single configured base URL. Producers own site-url correctness: /api/search surfaces the field as stored and does not normalize it, so a producer that omits or malforms site-url shows a broken link until its own config is fixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
line-o
pushed a commit
that referenced
this pull request
Jun 8, 2026
The search.cy.js suite assumed the built-in `doc` app would be findable via /api/search (?q=xquery, &app=doc). Since /api/search now queries the shared `site-content` Lucene field — which the `doc` app does not populate — a bare existdb-openapi install (the CI image) returns no results, so three assertions failed (`expected 0 to be above 0`, then the <mark> and uri assertions cascaded off the empty result set). This is what turned develop red after #52 merged. Seed a self-contained fixture instead: a `before()` hook stores a small collection under /db/apps/cypress-search-test with a `site-content` index (6 <page> docs containing "xquery" across two sections) and reindexes, via admin XQuery through /api/query; `after()` removes it. The suite now searches its own corpus rather than depending on the image's `doc` app, and exercises the real shared-field contract (relevance, <mark> highlighting, dedup, offset/limit paging, and the app facet). Verified end-to-end on a live existdb-openapi instance: the setup runs via /api/query, ?q=xquery returns the 6 fixture docs with <mark> snippets and highlights, and ?app=cypress-search-test restricts to the fixture. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
line-o
added a commit
that referenced
this pull request
Jun 8, 2026
…fixture test(search): make /api/search e2e tests self-contained (fix red develop after #52)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[This PR was co-authored with Claude Code. -Joe]
Summary
Builds out
GET /api/searchinto a sitewide search beachhead: one relevance-ranked Lucene query over the sharedsite-contentfield that content apps contribute to, returning scored hits with KWIC highlighting, pagination, facet counts, and per-result links — and documents the producer contract that makes apps searchable.What's in it
{ query, total, offset, limit, facets, results: [{ uri, path, title, app, url, score, snippet, highlights }] }, withscore,<mark>KWIC snippets, and offset/limit paging.site-contentfield query + facets — queries the shared field (body + boostedsite-title), aggregatessite-app/site-sectionfacet counts, and supportsapp/sectionfacet drill-down. Results are matched with a single-step axis (collection("/db/apps")/*), which is element-name-independent and — unlike a//*descendant wildcard — preservesft:scorefor field queries.title/urlfrom the producer'ssite-title/site-urlfields (falling back to a<title>child / computed link).site-contentproducer contract: the required/recommended fields and facets, the identical-names-across-apps rule, and thesite-urlconvention (root-relative, points at the rendered page, no scheme/host/port; absolutize at the edge from a single configured base URL). Producers ownsite-urlcorrectness —/api/searchsurfaces the field as stored and does not normalize it.Context
This is the existdb-openapi half of a cross-repo search-convergence effort: exist-site-shell and the eXist Oxygen plugin become thin consumers of
/api/searchinstead of each carrying its own dispatch, and a native eXistft:search-scopefunction (in progress, separate branch) will become the engine behind this endpoint. Producer apps (documentation-next, exist-blog, notebook) contribute to the shared field per the documented contract.Notes for review
qare escaped server-side, so identifier/qualified-name queries (map:merge,util:eval) work without the client escaping; the colon is not treated as a field selector.