Skip to content

Search: KWIC highlighting, relevance score, and pagination#51

Merged
joewiz merged 1 commit into
eXist-db:developfrom
joewiz:feat/search-kwic-highlighting
Jun 8, 2026
Merged

Search: KWIC highlighting, relevance score, and pagination#51
joewiz merged 1 commit into
eXist-db:developfrom
joewiz:feat/search-kwic-highlighting

Conversation

@joewiz

@joewiz joewiz commented Jun 6, 2026

Copy link
Copy Markdown
Member

[This PR was co-authored with Claude Code. -Joe]

A documented /api/search shape with score + KWIC highlighting + paging

From the existdb-oxygen-plugin Search-dialog tasking. GET /api/search had an ad-hoc shape and returned un-highlighted snippets cut from each document's first 200 characters (which often didn't even contain the query terms). This gives it an ES/Lucene-inspired shape.

Changes

  • KWIC highlighting. Snippets are centered on a match, with matched terms wrapped in <mark>. Field-agnostic: it expands the matched nodes (util:expand) and summarizes them with the eXist kwic module, so it works with whatever text index an app defines. Each hit carries a marked snippet (the first fragment, which the Oxygen plugin already renders) and a highlights array of fragments (ES convention).
  • Well-formed fragments. Every snippet/highlights value is a single-rooted <span> XML string, so an eXist client can parse it with plain fn:parse-xml (no parse-xml-fragment needed); <span>/<mark> is also valid HTML for direct innerHTML.
  • Relevance score. Each hit carries ft:score, and results are ordered by it, descending.
  • Pagination. New offset query parameter; both offset and limit are echoed.
  • One result per document. ft:query matches nested elements, so a document yielded many duplicate hits. Results are deduplicated by document URI (keeping the highest-scoring hit), and total is document-level.
  • Canonical identifier. Both uri and path (alias) are returned and documented as the id a client opens over exist:. path is preserved for the Oxygen plugin, which opens hits by it.

api.json documents the new parameter and the full response shape.

Final response shape

Actual output (GET /api/search?q=xquery&app=doc&limit=2), abbreviated:

{
  "query": "xquery",
  "total": 54,
  "offset": 0,
  "limit": 2,
  "results": [
    {
      "uri": "/db/apps/doc/data/xquery/xquery.xml",
      "path": "/db/apps/doc/data/xquery/xquery.xml",
      "title": "XQuery in eXist-db",
      "app": "doc",
      "url": "/exist/apps/doc/data/xquery/xquery.xml",
      "score": 39.39,
      "snippet": "<span>...recommendations for the <mark>XQuery</mark> language. eXist-db also supp...</span>",
      "highlights": [
        "<span>...recommendations for the <mark>XQuery</mark> language. eXist-db also supp...</span>",
        "<span>...enabling <mark>XQuery</mark> developers to create powerful applic...</span>"
      ]
    }
  ]
}

Field notes:

  • total is the number of matching documents (after dedup), not raw element hits.
  • score is the Lucene relevance score; results are sorted by it, descending.
  • snippet is highlights[0]; both are well-formed <span>…<mark>…</mark>…</span> strings.
  • uri === path (path retained for existing clients).

Verification

On a clean eXist 7.0.0-beta3 against the indexed doc app: snippets/highlights wrap matches in <mark> centered on the match and parse as XML; results are ordered by descending non-zero score; offset/limit page correctly; results are deduplicated per document. Cypress: search.cy.js (9 specs) asserts all of the above plus well-formed single-rooted <span> fragments, the app filter, and the required-q 400 — all passing.

Downstream

Lights up the Oxygen plugin's latent <mark> snippet rendering, and gives eXide, vscode-existdb, and the notebook kernel a documented shape with score + highlights + paging over the shared HTTP contract.

…ing, paging

Replace the ad-hoc /api/search response with an ES/Lucene-inspired shape, per
the existdb-oxygen-plugin tasking.

- KWIC highlighting: snippets centered on a match (not the document's leading
  200 chars), matched terms wrapped in <mark>. Field-agnostic — expands the
  matched nodes (util:expand) and summarizes with the eXist kwic module, so it
  works with whatever text index an app defines. Each fragment is a well-formed
  XML string: a single <span> root, so a client can fn:parse-xml it directly
  (no parse-xml-fragment) and it is also valid HTML for innerHTML. Each hit
  carries a marked `snippet` (first fragment) and a `highlights` array.
- Relevance score: ft:score per hit, results ordered by it, descending.
- Pagination: new `offset` parameter; `offset`/`limit` echoed.
- One result per document: ft:query matches nested elements, so a document
  yielded many hits; results are deduplicated by document URI keeping the
  highest-scoring hit, and `total` is document-level.
- Canonical identifier: both `uri` and `path` (alias) are returned and
  documented as the id a client opens over exist:.

api.json documents the new parameter and response shape. Cypress: search.cy.js
asserts <mark> in snippet/highlights, well-formed single-rooted <span>
fragments, descending non-zero scores, dedup, offset/limit paging, and the
app filter.

Verified on a clean eXist 7.0.0-beta3 against the doc app (all 9 specs pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feat/search-kwic-highlighting branch from c67717c to 5cf0281 Compare June 6, 2026 23:12
@joewiz joewiz merged commit 5cf0281 into eXist-db:develop Jun 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant