feat(search): dual-substrate retrieval levers (corpus PPMI for code, OEWN for prose), auto-gated#38
Merged
Merged
Conversation
… text Two opt-in levers in query_index (the single search/ask funnel): a corpus-derived PPMI rerank for code (Index::ppmi_rerank; MRR 0.216->0.289) and grounded OEWN synonym expansion for prose (R@1 1/42->41/42). ADEN_LEXICON_AUTO routes by query shape + corpus substrate (code_anchor_fraction); ADEN_LEXICON_EXPAND/ADEN_PPMI_RERANK force one on. Grounded via Index::knows_term, so a no-op on vocab the corpus lacks. All off by default; routing unchanged unless set.
…nifest scripts A/B harnesses behind #[ignore]: compound_ab (code MRR), lexicon_routing_ab + prose_lexicon_ab (prose R@1), build_lexicon_store/build_moby_store (OEWN/Moby overlay stores), lexicon_firewall (provenance allowlist gate). Scripts convert OEWN/Moby to triples, merge with provenance, render the source manifest, and fetch a neutral prose corpus. Validates: dictionaries dilute code, OEWN bridges prose synonyms BM25/dense miss; merge adds nothing over OEWN alone.
This was referenced Jun 19, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two opt-in, text-routed retrieval levers in
query_index(the singlesearch/askfunnel), all off by default so routing is unchanged unless enabled:ADEN_PPMI_RERANK: corpus-derived PPMI rerank of the top window (no external dictionary).Index::ppmi_rerank.ADEN_LEXICON_EXPAND: grounded OEWN synonym expansion before retrieval. Grounded to corpus vocab viaIndex::knows_term, so it is a no-op on vocabulary the corpus lacks.ADEN_LEXICON_AUTO: routes by detected query shape + corpus substrate (Index::code_anchor_fraction), so NL-over-code queries still get the rerank.Why (ablation)
Measured by the A/B harnesses in this PR (all
#[ignore]d):Dictionaries dilute code (PPMI-only 0.289 > PPMI+OEWN 0.247); dense embeddings capture only half of prose synonymy (20/42 vs 41/42), so OEWN is complementary, not redundant. Merging multiple dictionaries adds nothing over OEWN alone, so the shipped prose lever is OEWN.
Scope / safety
ask/searchinherit the levers via the CLI subprocess (set the env var in the server env).cargo denyok.aden checkall green.Docs:
docs/retrieval-levers.adoc(+ index, README, CHANGELOG).