Skip to content

feat(search): dual-substrate retrieval levers (corpus PPMI for code, OEWN for prose), auto-gated#38

Merged
RioPlay merged 4 commits into
mainfrom
feat/lexical-overlay
Jun 19, 2026
Merged

feat(search): dual-substrate retrieval levers (corpus PPMI for code, OEWN for prose), auto-gated#38
RioPlay merged 4 commits into
mainfrom
feat/lexical-overlay

Conversation

@RioPlay

@RioPlay RioPlay commented Jun 19, 2026

Copy link
Copy Markdown
Owner

What

Two opt-in, text-routed retrieval levers in query_index (the single search/ask funnel), all off by default so routing is unchanged unless enabled:

  • Code lever ADEN_PPMI_RERANK: corpus-derived PPMI rerank of the top window (no external dictionary). Index::ppmi_rerank.
  • Prose lever ADEN_LEXICON_EXPAND: grounded OEWN synonym expansion before retrieval. Grounded to corpus vocab via Index::knows_term, so it is a no-op on vocabulary the corpus lacks.
  • Auto-gate ADEN_LEXICON_AUTO: routes by detected query shape + corpus substrate (Index::code_anchor_fraction), so NL-over-code queries still get the rerank.

Why (ablation)

Measured by the A/B harnesses in this PR (all #[ignore]d):

Domain Lever Result
Code PPMI rerank MRR 0.216 -> 0.289
Prose OEWN expansion R@1 1/42 -> 41/42

Dictionaries dilute code (PPMI-only 0.289 > PPMI+OEWN 0.247); dense embeddings capture only half of prose synonymy (20/42 vs 41/42), so OEWN is complementary, not redundant. Merging multiple dictionaries adds nothing over OEWN alone, so the shipped prose lever is OEWN.

Scope / safety

Docs: docs/retrieval-levers.adoc (+ index, README, CHANGELOG).

RioPlay (Ernest Hamblen) and others added 4 commits June 19, 2026 11:05
… text

Two opt-in levers in query_index (the single search/ask funnel): a corpus-derived PPMI rerank
for code (Index::ppmi_rerank; MRR 0.216->0.289) and grounded OEWN synonym expansion for prose
(R@1 1/42->41/42). ADEN_LEXICON_AUTO routes by query shape + corpus substrate (code_anchor_fraction);
ADEN_LEXICON_EXPAND/ADEN_PPMI_RERANK force one on. Grounded via Index::knows_term, so a no-op on
vocab the corpus lacks. All off by default; routing unchanged unless set.
…nifest scripts

A/B harnesses behind #[ignore]: compound_ab (code MRR), lexicon_routing_ab + prose_lexicon_ab
(prose R@1), build_lexicon_store/build_moby_store (OEWN/Moby overlay stores), lexicon_firewall
(provenance allowlist gate). Scripts convert OEWN/Moby to triples, merge with provenance, render
the source manifest, and fetch a neutral prose corpus. Validates: dictionaries dilute code,
OEWN bridges prose synonyms BM25/dense miss; merge adds nothing over OEWN alone.
@RioPlay RioPlay merged commit 542939b into main Jun 19, 2026
6 checks passed
@RioPlay RioPlay deleted the feat/lexical-overlay branch June 19, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant