Search: KWIC highlighting, relevance score, and pagination#51
Merged
Conversation
…ing, paging Replace the ad-hoc /api/search response with an ES/Lucene-inspired shape, per the existdb-oxygen-plugin tasking. - KWIC highlighting: snippets centered on a match (not the document's leading 200 chars), matched terms wrapped in <mark>. Field-agnostic — expands the matched nodes (util:expand) and summarizes with the eXist kwic module, so it works with whatever text index an app defines. Each fragment is a well-formed XML string: a single <span> root, so a client can fn:parse-xml it directly (no parse-xml-fragment) and it is also valid HTML for innerHTML. Each hit carries a marked `snippet` (first fragment) and a `highlights` array. - Relevance score: ft:score per hit, results ordered by it, descending. - Pagination: new `offset` parameter; `offset`/`limit` echoed. - One result per document: ft:query matches nested elements, so a document yielded many hits; results are deduplicated by document URI keeping the highest-scoring hit, and `total` is document-level. - Canonical identifier: both `uri` and `path` (alias) are returned and documented as the id a client opens over exist:. api.json documents the new parameter and response shape. Cypress: search.cy.js asserts <mark> in snippet/highlights, well-formed single-rooted <span> fragments, descending non-zero scores, dedup, offset/limit paging, and the app filter. Verified on a clean eXist 7.0.0-beta3 against the doc app (all 9 specs pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c67717c to
5cf0281
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[This PR was co-authored with Claude Code. -Joe]
A documented
/api/searchshape with score + KWIC highlighting + pagingFrom the existdb-oxygen-plugin Search-dialog tasking.
GET /api/searchhad an ad-hoc shape and returned un-highlighted snippets cut from each document's first 200 characters (which often didn't even contain the query terms). This gives it an ES/Lucene-inspired shape.Changes
<mark>. Field-agnostic: it expands the matched nodes (util:expand) and summarizes them with the eXistkwicmodule, so it works with whatever text index an app defines. Each hit carries a markedsnippet(the first fragment, which the Oxygen plugin already renders) and ahighlightsarray of fragments (ES convention).snippet/highlightsvalue is a single-rooted<span>XML string, so an eXist client can parse it with plainfn:parse-xml(noparse-xml-fragmentneeded);<span>/<mark>is also valid HTML for directinnerHTML.ft:score, and results are ordered by it, descending.offsetquery parameter; bothoffsetandlimitare echoed.ft:querymatches nested elements, so a document yielded many duplicate hits. Results are deduplicated by document URI (keeping the highest-scoring hit), andtotalis document-level.uriandpath(alias) are returned and documented as the id a client opens overexist:.pathis preserved for the Oxygen plugin, which opens hits by it.api.jsondocuments the new parameter and the full response shape.Final response shape
Actual output (
GET /api/search?q=xquery&app=doc&limit=2), abbreviated:{ "query": "xquery", "total": 54, "offset": 0, "limit": 2, "results": [ { "uri": "/db/apps/doc/data/xquery/xquery.xml", "path": "/db/apps/doc/data/xquery/xquery.xml", "title": "XQuery in eXist-db", "app": "doc", "url": "/exist/apps/doc/data/xquery/xquery.xml", "score": 39.39, "snippet": "<span>...recommendations for the <mark>XQuery</mark> language. eXist-db also supp...</span>", "highlights": [ "<span>...recommendations for the <mark>XQuery</mark> language. eXist-db also supp...</span>", "<span>...enabling <mark>XQuery</mark> developers to create powerful applic...</span>" ] } ] }Field notes:
totalis the number of matching documents (after dedup), not raw element hits.scoreis the Lucene relevance score;resultsare sorted by it, descending.snippetishighlights[0]; both are well-formed<span>…<mark>…</mark>…</span>strings.uri===path(path retained for existing clients).Verification
On a clean eXist 7.0.0-beta3 against the indexed
docapp: snippets/highlights wrap matches in<mark>centered on the match and parse as XML; results are ordered by descending non-zeroscore;offset/limitpage correctly; results are deduplicated per document. Cypress:search.cy.js(9 specs) asserts all of the above plus well-formed single-rooted<span>fragments, theappfilter, and the required-q400 — all passing.Downstream
Lights up the Oxygen plugin's latent
<mark>snippet rendering, and gives eXide, vscode-existdb, and the notebook kernel a documented shape with score + highlights + paging over the shared HTTP contract.