fix(search): guard dual-substrate lexicon levers (opt-in) + index .asc AsciiDoc by RioPlay · Pull Request #40 · RioPlay/aden

RioPlay · 2026-06-20T04:51:49Z

Summary

The dual-substrate lexicon levers (PR #38/#39) shipped on by default, but external benchmarking showed they regress retrieval on every external repo with natural queries — e.g. prose MRR 0.336→0.166, tanstack 0.104→0.007 on the same base. The original wins were real but confined to engineered zero-overlap probes and a dense base, which the product path doesn't match. This PR makes the levers safe + opt-in, and fixes a related indexing bug found while building the prose benchmark.

fix(search) — guard the levers, flip to opt-in. query_index now always computes the baseline ranking as a safety floor; expansion is DF-gated (drops common-word noise) and fused base-weighted (can supplement but never evict a confident base hit); the PPMI rerank is gated to code-shaped queries. Levers are now opt-in via ADEN_LEXICON_ON — the default is the plain baseline (the best of everything measured). Guarded, enabling them reaches parity instead of cratering.

fix(index) — index .asc AsciiDoc. .asc was in SECRET_EXTS (PGP-armor collision), so AsciiDoc books that use it (Pro Git, many docs repos) indexed empty with a green health score. It's dual-use, so it's now content-gated: a real armored PRIVATE key in a .asc is still caught by content_has_high_confidence_secret; prose indexes normally.

test(bench) — adds scripts/lexicon_ab_bench.py (external ON/OFF A/B harness) + an external prose query set, so any future "on by default" decision is gated on real repos, not in-tree fixtures.

Test plan

cargo test --workspace green (via the pre-push ci-check gate)
cargo clippy --workspace clean
cargo fmt --all applied

License checklist

No dependency changes — license checklist not applicable

The lexicon levers (PR #38/#39) shipped on-by-default but regress retrieval on external repos with natural queries: neutral-to-negative on rustfmt/Go/flask/TS/prose (e.g. prose MRR 0.336->0.166, tanstack 0.104->0.007 on the same base). The wins were real but confined to engineered zero-overlap probes and a dense base. - query_index always computes the baseline ranking as a safety floor; expansion is DF-gated (drops common-word noise) and fused base-weighted (supplements but never evicts a confident base hit); rerank is gated to code-shaped queries. The levers reach parity instead of cratering. - levers are now opt-in (ADEN_LEXICON_ON); default is the baseline. - add scripts/lexicon_ab_bench.py (external A/B harness) and an external prose query set so re-enabling by default is gated on real repos.

filter.rs SECRET_EXTS listed "asc", so AsciiDoc books using the .asc extension (Pro Git and many doc repos) indexed empty with a green health score. .asc is dual-use (AsciiDoc + PGP armor); judge by content, not name: drop it from the skip-list and add a PGP-private-key content check so a real armored key in a .asc is still caught while prose indexes.

The blast-radius eval (graph caller-edges vs a text-scan ground truth of NAME( call sites, file-level) existed only as a stale .pyc. Reconstruct it as a committed, reusable harness that points at any external repo and SEPARATES two concerns: - understand name-resolution accuracy (did `understand NAME` land on the exact symbol, or fuzzy-match a substring superset), and - blast-radius precision/recall on the correctly-resolved symbols. Auto-discovers curated callable gold symbols (no dunders/generic names, called-but-not-ubiquitous), excludes test dirs to match aden's production extraction scope. On flask: resolution 0.95, blast-radius P0.44/R0.61 — the method-call-resolution gap depresses recall on OO/Python vs the 0.99 on free-function Rust code.

commands.adoc: - asm --depth default: 3 → 2 (matches code) - mcp install: add missing --surface <essential|standard|full> flag - viz: add missing --scope and --resolution flags; remove bogus -j alias - heal: add structured flag table (--propose/--fix/--gc/--since/--apply/--watch) - status: document savings estimate output block ai-integration.adoc: - Replace two-tier Core/Extended model with three-tier Essential/Standard/Full - Fix tool assignments: search/list/communities/impact-diff were in wrong tier - Fix env var: ADEN_MCP_FULL=1 → ADEN_MCP_SURFACE=standard|full (legacy alias noted) - Document --surface flag at install time retrieval-levers.adoc: - Fix polarity: auto-gating is OFF by default, opt-in via ADEN_LEXICON_ON - Reframe ADEN_LEXICON_OFF as kill switch, not primary disable mechanism - Fix rerank trigger: code_anchor_fraction → query_looks_codey (query text only) - Fix NL-over-code behavior: expands only, does not rerank - Document PR #39/PR #40 revert history in status section architecture.adoc: - Add aden-paths node to Mermaid crate diagram - Remove duplicate aden-mcp Phase 2 row (already shipped as Phase 0) security-model.adoc: - Replace CanPerform (not a valid EdgeType) with Invokes in semantics example - Update malicious-contract-injection threat: moot since store-first (ADR-003) .agent/quick-ref.adoc: - Add --features watch caveat to aden watch entry crates/aden-cli/src/commands/init.rs: - Fix misleading "Knowledge graph built in .aden/store." message; store is in per-user data dir since ADR-003 docs/adr-008-current-implementation-state.adoc (new): - Document current scope extensions beyond original ADRs: Wave 3 edges (Supersedes/Justifies/AssociatedWith), GEN_LOGIC_VERSION=4, store-first architecture, MCP three-tier surface, dual-substrate retrieval (opt-in) Co-authored-by: RioPlay (Ernest Hamblen) <rioplay@rioplay.dev>

RioPlay (Ernest Hamblen) added 3 commits June 19, 2026 16:14

RioPlay merged commit 15339eb into main Jun 20, 2026
6 checks passed

RioPlay mentioned this pull request Jun 20, 2026

docs: fix 20 documentation accuracy issues + add ADR-008 #43

Merged

4 tasks

RioPlay deleted the fix/lexicon-additive-guard branch June 21, 2026 04:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(search): guard dual-substrate lexicon levers (opt-in) + index .asc AsciiDoc#40

fix(search): guard dual-substrate lexicon levers (opt-in) + index .asc AsciiDoc#40
RioPlay merged 3 commits into
mainfrom
fix/lexicon-additive-guard

RioPlay commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RioPlay commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

License checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RioPlay commented Jun 20, 2026 •

edited

Loading