Skip to content

feat(serve): hybrid text+visual search with BM25 and Reciprocal Rank Fusion#97

Open
aafaq-rashid-comprinno wants to merge 1 commit into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/hybrid-search
Open

feat(serve): hybrid text+visual search with BM25 and Reciprocal Rank Fusion#97
aafaq-rashid-comprinno wants to merge 1 commit into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/hybrid-search

Conversation

@aafaq-rashid-comprinno

Copy link
Copy Markdown

Summary

Adds opt-in hybrid search that fuses FAISS visual results with BM25 text results using Reciprocal Rank Fusion (RRF, Cormack et al. 2009). This improves precision on text-heavy queries (entity names, specific facts) while retaining visual retrieval for tables/charts/diagrams.

Problem

Pure visual search struggles with text-only queries like "Albert Einstein" when the visual embedding doesn't strongly match a specific article. Adding a lightweight text signal improves recall for named entities and factual lookups without degrading visual results.

Implementation

New module: serve/src/pixelrag_serve/hybrid.py

  • BM25Index — in-memory inverted index with standard BM25 scoring (k1=1.2, b=0.75)
  • reciprocal_rank_fusion() — merges multiple ranked lists (k=60 constant)
  • load_or_build_bm25() — auto-builds from articles.json titles/URLs, caches as bm25_index.json

API change

SearchRequest gains hybrid: bool field (default false — fully backward compatible).

When hybrid=true:

  1. FAISS visual search runs as before
  2. BM25 text search runs on the query text
  3. Results are fused via RRF
  4. Fused ranking is returned to the client

Usage

curl -X POST http://localhost:30001/search \
  -d '{ "queries": [{"text": "Albert Einstein"}], "n_docs": 5, "hybrid": true }' 

Testing

  • 8 new tests: BM25 scoring, disambiguation, save/load persistence, RRF fusion correctness, articles.json integration
  • All 31 tests pass (23 existing + 8 new)
  • ruff check clean

Future work

  • Richer text corpus: OCR from tiles or original page text (currently uses article titles only)
  • Per-chunk text indexing for finer-grained matching
  • Configurable RRF k and alpha weighting between visual/text signals

…Fusion

Add an opt-in hybrid search mode that fuses FAISS visual results with
BM25 text results using Reciprocal Rank Fusion (RRF). This improves
precision on text-heavy queries (entity names, specific facts) while
retaining visual retrieval for tables/charts/diagrams.

Implementation:
- New module: serve/src/pixelrag_serve/hybrid.py
  - BM25Index: in-memory inverted index with standard BM25 scoring (k1=1.2, b=0.75)
  - reciprocal_rank_fusion(): merges multiple ranked lists (k=60, per Cormack et al.)
  - Auto-builds from articles.json titles/URLs, caches as bm25_index.json
- API change: SearchRequest gains `hybrid: bool` field (default false)
- When hybrid=true, BM25 text results are fused with FAISS visual results
  per-query before result assembly

Usage:
  curl -X POST http://localhost:30001/search \
    -d '{"queries": [{"text": "Albert Einstein"}], "n_docs": 5, "hybrid": true}'

8 new tests covering BM25 scoring, disambiguation, save/load, RRF fusion,
and articles.json integration.
@vercel

vercel Bot commented Jun 24, 2026

Copy link
Copy Markdown

@aafaq-rashid-comprinno is attempting to deploy a commit to the andylizf's projects Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant