Retrievo

Hybrid search for .NET — BM25 + vectors + RRF fusion, zero infrastructure

Retrievo is an open-source, in-process, in-memory search library for .NET that combines BM25 lexical matching with vector similarity search. Results are merged via Reciprocal Rank Fusion (RRF) into a single ranked list — no external servers, no databases, no infrastructure. Designed for corpora up to ~10k documents: local agent memory, small RAG pipelines, developer tools, and offline/edge scenarios.

Quick Start

using Retrievo;
using Retrievo.Models;

var index = new HybridSearchIndexBuilder()
    .AddDocument(new Document { Id = "1", Body = "Neural networks learn complex patterns." })
    .AddDocument(new Document { Id = "2", Body = "Kubernetes orchestrates container deployments." })
    .Build();

using var _ = index;
var response = index.Search(new HybridQuery { Text = "neural network training", TopK = 5 });

foreach (var r in response.Results)
    Console.WriteLine($"  {r.Id}: {r.Score:F4}");

Benchmarks

Retrieval Quality (NDCG@10)

Validated against BEIR with 245-configuration parameter sweeps per dataset:

Dataset	BM25	Vector-only	Hybrid (default)	Hybrid (tuned)	Anserini BM25
NFCorpus	0.330	0.384	0.391	0.392	0.325
SciFact	0.685	0.731	0.756	0.757	0.679

Default parameters (LexicalWeight=0.5, VectorWeight=1.0, RrfK=20, TitleBoost=0.5) tuned via cross-dataset harmonic mean optimization.

Query Latency

Measured on BEIR datasets with text-embedding-3-small (1536-dim) embeddings:

Operation	NFCorpus (3.6k docs)	SciFact (5.2k docs)
Lexical-only (BM25)	0.1 ms	0.2 ms
Vector-only	1.0 ms	1.4 ms
Hybrid (BM25 + vector + RRF)	1.1 ms	1.6 ms

Details: benchmarks/ The benchmark harness can also export a temporary snapshot, re-import it, and assert identical ranked outputs per query with --verify-snapshot-roundtrip. Micro-benchmark timings are hardware-dependent; the current vector math measurements live in benchmarks/README.md.

Key Features

Core Search

Hybrid Retrieval: Combine BM25 and cosine similarity using RRF fusion.
Standalone Modes: Use lexical-only or vector-only search when needed.
Explain Mode: Detailed score breakdown for every search result.
Fielded Search: Title and body fields with independent boost weights.
Metadata Filters: Exact-match, range, and contains filtering post-fusion.
Field Definitions: Declare field types (String, StringArray) at index time for automatic filter semantics.
Finite Vector Validation: Rejects NaN/Infinity embeddings and query vectors with clear exceptions.

Index Management

Fluent Builder: Clean API for batch construction and folder ingestion.
Mutable Index: Incremental upserts and deletes with thread-safe commits.
Snapshot Export/Import: Persist versioned JSON snapshots and reload indexes without rescanning source files.
Unique Document IDs: Builders reject duplicate document IDs within a single build.
Zero Infrastructure: Runs entirely in-process with no external dependencies.
Auto-Embedding: Transparently embed documents at index time.

Developer Experience

Shared Deterministic Ranking Paths: Immutable and mutable vector search share the same bounded heap selector, and RRF fusion uses the same deterministic top-K tie-breaking strategy.
Deterministic Vector Math: Memory-layout-independent cosine scoring with SIMD-aware accumulation and bounded top-K selection.
Query Diagnostics: Detailed timing breakdown for every pipeline stage.
Pluggable Providers: Easy integration with any embedding model or API.
CLI Tool: Powerful terminal interface for indexing and querying.

Packages

Package	Description
`Retrievo`	Core library — BM25 lexical search, brute-force vector search, RRF fusion, builder, mutable index. Zero external service dependencies.
`Retrievo.AzureOpenAI`	Azure OpenAI embedding provider. Install this if you want automatic document/query embedding via Azure OpenAI. Adds a dependency on `Azure.AI.OpenAI`.

dotnet add package Retrievo --prerelease
dotnet add package Retrievo.AzureOpenAI --prerelease  # optional, for Azure OpenAI embeddings

Azure OpenAI Embeddings

Plug in an embedding provider and Retrievo handles the rest — documents are embedded at build time, queries at search time.

using Retrievo.AzureOpenAI;

var provider = new AzureOpenAIEmbeddingProvider(
    new Uri("https://your-resource.openai.azure.com/"),
    "your-api-key",
    "text-embedding-3-small");

// Documents are auto-embedded during build
using var index = await new HybridSearchIndexBuilder()
    .AddFolder("./docs")  // loads *.md and *.txt recursively
    .WithEmbeddingProvider(provider)
    .BuildAsync();

// Query text is automatically converted to a vector
var response = await index.SearchAsync(new HybridQuery { Text = "how to deploy", TopK = 5 });

Field Definitions

Declare field types at index time so filters automatically use the right matching strategy:

using var index = new HybridSearchIndexBuilder()
    .DefineField("tags", FieldType.StringArray)         // pipe-delimited by default
    .DefineField("categories", FieldType.StringArray, delimiter: ',')
    .AddDocument(new Document
    {
        Id = "1",
        Body = "Deep learning fundamentals",
        Metadata = new Dictionary<string, string>
        {
            ["tags"] = "ml|deep-learning|neural-nets",
            ["categories"] = "ai,education"
        }
    })
    .Build();

// StringArray fields auto-split and do contains-match; undeclared fields use exact-match
var response = index.Search(new HybridQuery
{
    Text = "deep learning",
    MetadataFilters = new Dictionary<string, string> { ["tags"] = "ml" }
});

Snapshot Export and Import

Persist a built index to a versioned JSON snapshot and reload it later without rescanning files or regenerating stored embeddings:

using var index = await new HybridSearchIndexBuilder()
    .AddFolder("./docs")
    .WithEmbeddingProvider(provider)
    .BuildAsync();

await index.ExportSnapshotAsync("./docs.retrievo.json");

using var restored = await HybridSearchIndex.ImportSnapshotAsync(
    "./docs.retrievo.json",
    embeddingProvider: provider);
var response = await restored.SearchAsync(new HybridQuery { Text = "how to deploy", TopK = 5 });

Mutable indexes export the last committed snapshot only. Call Commit() before exporting pending changes. Snapshots persist the live normalized vector state, so round-tripped vector rankings stay identical. Stored document embeddings survive snapshot export/import, but text queries still need an embedding provider at import time if you want vector or hybrid retrieval after restore.

Built-in RrfFuser snapshots import directly. If you use a custom IFuser, supply the same fuser again during import so ranking semantics are preserved:

using var restored = HybridSearchIndex.ImportSnapshot(
    "./docs.retrievo.json",
    fuser: new MyCustomFuser());

The CLI supports built-in RrfFuser snapshots directly.

retrievo export ./docs --output ./docs.retrievo.json
retrievo query ./docs.retrievo.json --text "neural network"

For end-to-end snapshot parity checks against the checked-in BEIR benchmark fixtures:

dotnet run --project benchmarks/Retrievo.Benchmarks -- --dataset nfcorpus --embeddings benchmarks/fixtures/embeddings/nfcorpus.text-embedding-3-small.cache --verify-snapshot-roundtrip

Architecture

HybridQuery
    |
    v
+-------------------+
|  HybridSearchIndex |  (orchestrator)
+-------------------+
    |           |
    v           v
+--------+  +----------+
| Lucene |  | Brute-   |
| BM25   |  | Force    |
| Search |  | Cosine   |
+--------+  +----------+
    |           |
    v           v
  ranked      ranked
  list        list
    \         /
     v       v
  +----------+
  | RRF      |
  | Fusion   |
  +----------+
       |
       v
  SearchResponse

Reciprocal Rank Fusion merges multiple ranked lists without score normalization: score(doc) = Σ weight / (k + rank). Documents that rank high on both lexical and vector lists get the biggest boost — surfacing results that are semantically relevant and contain the right keywords.

Roadmap

Phase	Status	Description
Phase 1	Done	MVP hybrid retrieval, CLI, Azure OpenAI provider
Phase 2	Done	Mutable index, fielded search, filters (exact, range, contains), field definitions, diagnostics
Phase 3	Done	Snapshot export and import
Phase 4	Planned	ANN support for larger corpora

Build & Test

Requires .NET 8 SDK or later.

dotnet build
dotnet test

263 tests covering retrieval, vector math, fusion, mutable index, snapshot persistence, filters, field definitions, cancellation, and CLI integration — 0 warnings. CLI integration tests build the CLI project and execute the matching built artifact for the active configuration/TFM, so dotnet test tests/Retrievo.IntegrationTests works from a clean checkout.

Known Limitations

Lexical (BM25) search is English-only: The lexical retrieval pipeline uses EnglishStemAnalyzer (StandardTokenizer → EnglishPossessiveFilter → LowerCaseFilter → English StopWords → PorterStemmer). Non-English text will not be properly tokenized or stemmed for BM25 matching.
Vector search is language-agnostic: Semantic search works with any language supported by your embedding model (e.g., multilingual embeddings). Hybrid search inherits the English-only limitation for its lexical component.
Brute-force vector search is O(n) per query: Designed for corpora up to ~10k documents. For larger corpora, consider ANN-based solutions (planned for Phase 4).
Manual persistence: Indexes stay in-memory at runtime. Durability is explicit via snapshot export/import; there is no automatic crash recovery or write-ahead log.
No concurrent writers on HybridSearchIndex: The immutable index is built once via the builder. Use MutableHybridSearchIndex for incremental upserts and deletes.
Single-process: No distributed or shared index support. The index lives in a single process's memory.
Workaround for non-English corpora: Use vector-only search by omitting lexical configuration, or configure a custom analyzer for your language in a fork.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
src		src
tests		tests
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
README.md		README.md
Retrievo.slnx		Retrievo.slnx
SUGGESTIONS.md		SUGGESTIONS.md
nuget.config		nuget.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrievo

Quick Start

Benchmarks

Retrieval Quality (NDCG@10)

Query Latency

Key Features

Core Search

Index Management

Developer Experience

Packages

Azure OpenAI Embeddings

Field Definitions

Snapshot Export and Import

Architecture

Roadmap

Build & Test

Known Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Retrievo

Quick Start

Benchmarks

Retrieval Quality (NDCG@10)

Query Latency

Key Features

Core Search

Index Management

Developer Experience

Packages

Azure OpenAI Embeddings

Field Definitions

Snapshot Export and Import

Architecture

Roadmap

Build & Test

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages