An AI-powered semantic and keyword search engine for Islamic Hadiths, built with FastAPI, Qdrant, and hybrid retrieval techniques.
Hadith is an end-to-end pipeline that scrapes, processes, indexes, and searches Islamic Hadith texts across multiple major books (Sahih al-Bukhari, Sahih Muslim, and more). It combines semantic vector search (via multilingual embeddings stored in Qdrant) with keyword-based BM25 search (via Whoosh) to deliver highly relevant results in both Arabic and English.
┌─────────────────────────────────────────────────────────────┐
│ Data Pipeline │
│ │
│ sunnah.com ──► Scraper ──► Repository (DB / Excel) │
│ │ │
│ ▼ │
│ Chunker ──► Embedder ──► Qdrant │
│ │ │
│ ▼ │
│ Whoosh Index │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Search Pipeline │
│ │
│ User Query │
│ │ │
│ ├──► Semantic Search (Qdrant) ──► [(id, score)] │
│ │ │ │
│ └──► Keyword Search (Whoosh) ──► [(id, score)] │
│ │ │
│ Normalize + Fuse │
│ (0.7 semantic + 0.3 BM25) │
│ │ │
│ Fetch from PostgreSQL │
│ │ │
│ Ranked Results │
└─────────────────────────────────────────────────────────────┘
src/
├── api/
│ ├── routes/ # FastAPI route definitions
│ ├── schemas/ # Request / response Pydantic models
│ ├── dependencies.py # Dependency injection
│ └── mapper.py # Domain → response DTO mapping
├── application/
│ ├── interfaces/ # Abstract base classes
│ ├── services/ # SemanticSearch, KeywordSearch, HybridSearch
│ ├── embedder.py # Embedding model wrapper
│ └── factories/ # VectorDB and DataLoader factories
├── domain/
│ ├── models/ # ScrapedHadith domain model
│ └── enums/ # EmbeddingType, HadithRepositoryType
├── infrastructure/
│ ├── scrapper/ # HTTP client and sunnah.com scraper
│ ├── repositories/ # DbHadithRepository, ExcelHadithRepository
│ ├── vectoreDb/ # QdrantDB wrapper
│ └── KeywordDataStore/ # WhooshIndex wrapper
│ └── db/ # db client and models (SQLAlchemy)
├── scripts/
│ ├── scrape.py # Scraping pipeline
│ └── vectorize.py # Chunking + embedding + indexing pipeline
│ └── dbToWhoosh.py # pulicating keyWord store (whoosh)
├── config.py # App configuration via pydantic-settings
└── main.py # FastAPI app entrypoint
1. Clone the repository
git clone https://github.com/alikashlan10/Hadith.git
cd hadith2. Create and activate a virtual environment
conda create -n hadith python=3.11
conda activate hadith3. Install dependencies
pip install -r requirements.txt4. Set up environment variables
Copy the provided example file and fill in your own values:
cp .env.example .envThen edit .env with your configuration. See .env.example for all required variables and their descriptions.
Scrapes hadith texts from sunnah.com and saves them to your configured repository (db or Excel).
python -m src.scripts.scrapeLoads hadiths from the repository, chunks them, generates embeddings, and indexes them into Qdrant (vector search).
python -m src.scripts.vectorizeLoads hadiths from the repository, insert them into (whoosh).
python -m src.scripts.dbToWhooshuvicorn src.main:app --reloadThe API will be available at http://127.0.0.1:8000
Search for hadiths using hybrid semantic + keyword retrieval.
Request:
{
"query" : "fasting while travelling",
"top_k" : 5
}Response:
{
"query": "fasting while travelling",
"top_k": 5,
"returned": 5,
"results": [
{
"db_id": 9878,
"text_ar": "...",
"text_en": "...",
"book_name_ar": "صحيح مسلم",,
"book_name_en": "Sahih Muslim",
"chapter_name_ar": "كتاب الصيام",
"chapter_name_en": "The Book of Fasting",
"reference": "Sahih Muslim 1121b",
"score": 0.7
}
]
}| Layer | Technology |
|---|---|
| API | FastAPI |
| Vector Search | Qdrant |
| Keyword Search | Whoosh (BM25) |
| Embeddings | multilingual-e5-base |
| Relational DB | SQLAlchemy |
| Scraping | requests + BeautifulSoup |
| Config | pydantic-settings |