feat(serve): hybrid text+visual search with BM25 and Reciprocal Rank Fusion by aafaq-rashid-comprinno · Pull Request #97 · StarTrail-org/PixelRAG

aafaq-rashid-comprinno · 2026-06-24T13:34:03Z

Summary

Adds opt-in hybrid search that fuses FAISS visual results with BM25 text results using Reciprocal Rank Fusion (RRF, Cormack et al. 2009). This improves precision on text-heavy queries (entity names, specific facts) while retaining visual retrieval for tables/charts/diagrams.

Problem

Pure visual search struggles with text-only queries like "Albert Einstein" when the visual embedding doesn't strongly match a specific article. Adding a lightweight text signal improves recall for named entities and factual lookups without degrading visual results.

Implementation

New module: `serve/src/pixelrag_serve/hybrid.py`

BM25Index — in-memory inverted index with standard BM25 scoring (k1=1.2, b=0.75)
reciprocal_rank_fusion() — merges multiple ranked lists (k=60 constant)
load_or_build_bm25() — auto-builds from articles.json titles/URLs, caches as bm25_index.json

API change

SearchRequest gains hybrid: bool field (default false — fully backward compatible).

When hybrid=true:

FAISS visual search runs as before
BM25 text search runs on the query text
Results are fused via RRF
Fused ranking is returned to the client

Usage

curl -X POST http://localhost:30001/search \
  -d '{ "queries": [{"text": "Albert Einstein"}], "n_docs": 5, "hybrid": true }'

Testing

8 new tests: BM25 scoring, disambiguation, save/load persistence, RRF fusion correctness, articles.json integration
All 31 tests pass (23 existing + 8 new)
ruff check clean

Future work

Richer text corpus: OCR from tiles or original page text (currently uses article titles only)
Per-chunk text indexing for finer-grained matching
Configurable RRF k and alpha weighting between visual/text signals

…Fusion Add an opt-in hybrid search mode that fuses FAISS visual results with BM25 text results using Reciprocal Rank Fusion (RRF). This improves precision on text-heavy queries (entity names, specific facts) while retaining visual retrieval for tables/charts/diagrams. Implementation: - New module: serve/src/pixelrag_serve/hybrid.py - BM25Index: in-memory inverted index with standard BM25 scoring (k1=1.2, b=0.75) - reciprocal_rank_fusion(): merges multiple ranked lists (k=60, per Cormack et al.) - Auto-builds from articles.json titles/URLs, caches as bm25_index.json - API change: SearchRequest gains `hybrid: bool` field (default false) - When hybrid=true, BM25 text results are fused with FAISS visual results per-query before result assembly Usage: curl -X POST http://localhost:30001/search \ -d '{"queries": [{"text": "Albert Einstein"}], "n_docs": 5, "hybrid": true}' 8 new tests covering BM25 scoring, disambiguation, save/load, RRF fusion, and articles.json integration.

vercel · 2026-06-24T13:34:07Z

@aafaq-rashid-comprinno is attempting to deploy a commit to the andylizf's projects Team on Vercel.

A member of the Team first needs to authorize it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(serve): hybrid text+visual search with BM25 and Reciprocal Rank Fusion#97

feat(serve): hybrid text+visual search with BM25 and Reciprocal Rank Fusion#97
aafaq-rashid-comprinno wants to merge 1 commit into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/hybrid-search

aafaq-rashid-comprinno commented Jun 24, 2026

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aafaq-rashid-comprinno commented Jun 24, 2026

Summary

Problem

Implementation

New module: serve/src/pixelrag_serve/hybrid.py

API change

Usage

Testing

Future work

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New module: `serve/src/pixelrag_serve/hybrid.py`