Skip to content

feat(cli): cross-platform portability, opt-in model fetch, 512-token embedding window#53

Merged
RioPlay merged 1 commit into
mainfrom
feat/prose-retrieval-portability
Jun 24, 2026
Merged

feat(cli): cross-platform portability, opt-in model fetch, 512-token embedding window#53
RioPlay merged 1 commit into
mainfrom
feat/prose-retrieval-portability

Conversation

@RioPlay

@RioPlay RioPlay commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

Cross-platform portability fixes, an opt-in aden model fetch, and a small set of
additive/feature-gated index + provenance changes — including raising the dense embedding
window to 512 tokens, which a Pro Git A/B shows modestly improves prose retrieval. Everything
new is gated or additive, so the default build and behavior are unchanged. Two features that
an earlier revision of this branch carried were tested and removed (see Notes): prose↔prose
SimilarTo edges (produced no traversable edges) and whole-doc mean-pooling (net-negative for
retrieval).

Portability (Windows / macOS / Linux)

  • Every std::env::var("HOME")dirs::home_dir() (shipping + tests); $HOME is unset on
    Windows, which silently disabled dense + the lexicon store there.
  • Per-user caches via dirs::cache_dir() (%LOCALAPPDATA% / ~/Library/Caches / ~/.cache)
    with a non-destructive legacy ~/.cache fallback (prefer_native).
  • java_resolver infers packages via Path::components() instead of matching /java/ /
    \java\ in the stringified path (the old Windows branch matched a literal that never occurs).

Tooling

  • aden model fetch (opt-in model-fetch feature) streams the bge model to disk through a
    bounded reader with incremental sha256 verification (checksum gates before the atomic
    rename), https_only. Compiled out by default; the core stays network-free.

Dense embedding window: MAX_SEQ 128 → 512 (validated)

  • The old 128-token cap truncated long prose to its first ~80 words. 512 lets a full section
    embed in one forward (CLS pooling). EMBED_PARAM_VERSION/index version bumped so stale
    vectors re-embed.

  • Benchmark (Pro Git, 20 labeled prose queries, hybrid):

    config R@1 R@5 R@10 R@20 MRR@20
    BM25 floor 0.15 0.45 0.50 0.60 0.293
    MAX_SEQ 128 (≈ prior) 0.30 0.50 0.65 0.85 0.410
    MAX_SEQ 512 (this PR) 0.30 0.55 0.75 0.85 0.435

    Modest but consistent lift (R@5 +0.05, R@10 +0.10, MRR +0.025) over 128; dense ≫ BM25 on
    prose throughout. N=20 so individual deltas are within noise; the direction is consistent.

Additive provenance (additive JSON; no default-behavior change)

  • query output carries via_edge_types, inferred (embedding-derived edges only, via
    EdgeType::is_inferred), and node confidence.
  • Confidence-gated ask: on ambiguous routing, pull the budget back toward base and print an
    honest note (now distinguishes a genuine near-tie from a deliberate overview pick).

Testing

  • cargo test --workspace green; cargo clippy --workspace and --features dense,model-fetch
    clean (no new warnings); aden ci-check . passes.
  • Unit tests: prefer_native, infer_java_package (5 cases), model hashing. mcp_flag_parity
    exempts the model command under --features model-fetch.

Notes — what was tested and removed

  • whole-doc mean-pooling: a Pro Git A/B at both 128 and 512 caps showed it was
    net-negative on every metric (e.g. MRR 0.385 with pooling vs 0.435 without, at 512) —
    mean-pooling a long chapter dilutes the answering section. Reverted; only the cap ships.
  • prose↔prose SimilarTo edges: produced zero traversable edges on a real corpus
    (derived over the search-index anchor namespace, which doesn't match the graph's, so they
    dangle as orphans), and used an absolute cosine gate rather than the mutual-kNN + whitening
    the design specifies. Removed; returns as a sequenced follow-up (index-split → whitening →
    mutual-kNN), gated on the eval harness + .aden/store sign-off. The similar_pairs
    primitive (with tests) is retained for that work.

License checklist (please confirm)

  • license-check / cargo deny CI green
  • New deps: dirs (MIT OR Apache-2.0), sha2 (MIT OR Apache-2.0, already default via
    aden-paths) — AGPL-compatible
  • NOTICE.md updated (model-fetch feature paragraph)
  • No new GPL/LGPL/MPL dependencies

Maintainer (BDFL) change: the commit carries the DCO Signed-off-by; the contributor CLA
acknowledgement does not apply to the Maintainer (CLA Section 1).

@RioPlay RioPlay force-pushed the feat/prose-retrieval-portability branch from 5a1d0fd to 9274362 Compare June 23, 2026 12:28
…embedding window

Portability (Windows / macOS / Linux):
- HOME -> dirs::home_dir() everywhere (shipping + tests); $HOME is unset on Windows,
  which silently disabled dense + the lexicon store there.
- Per-user caches via dirs::cache_dir() (%LOCALAPPDATA% / ~/Library/Caches / ~/.cache)
  with a non-destructive legacy ~/.cache fallback (prefer_native).
- java_resolver infers packages via Path::components() instead of matching "/java/"
  in the stringified path (the old Windows branch matched a literal that never occurs).

Tooling:
- Opt-in `aden model fetch` (model-fetch feature): streamed, size-capped, sha256-verified,
  https_only download of the bge model; compiled out by default (network-free core).

Dense / retrieval:
- MAX_SEQ 128 -> 512 so full prose sections embed (CLS pooling, single forward). A Pro Git
  A/B showed a modest prose-retrieval lift over 128; a whole-doc mean-pool variant trialed
  on top was net-negative and reverted (only the cap ships).
- query provenance: via_edge_types, inferred (embedding-derived edges only), node confidence.
- Confidence-gated ask: pull budget toward base on ambiguous routing, with an honest note.
- Index/embedding versions bumped so stale vectors re-embed.

Tests:
- cargo test --workspace and --features dense,model-fetch green; clippy clean (no new warnings).
- Unit tests: prefer_native, infer_java_package (5 cases), model hashing; mcp_flag_parity
  exempts the `model` command under model-fetch.

Signed-off-by: RioPlay (Ernest Hamblen) <rioplay@rioplay.dev>
@RioPlay RioPlay force-pushed the feat/prose-retrieval-portability branch from 9274362 to e881ce5 Compare June 24, 2026 12:14
@RioPlay RioPlay changed the title feat(retrieval): stronger prose retrieval, portability, and opt-in model fetch feat(cli): cross-platform portability, opt-in model fetch, 512-token embedding window Jun 24, 2026
@RioPlay RioPlay merged commit a972c58 into main Jun 24, 2026
6 checks passed
@RioPlay RioPlay deleted the feat/prose-retrieval-portability branch June 24, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant