Skip to content

alikashlan10/Hadith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadith - حديث

An AI-powered semantic and keyword search engine for Islamic Hadiths, built with FastAPI, Qdrant, and hybrid retrieval techniques.


📖 Description

Hadith is an end-to-end pipeline that scrapes, processes, indexes, and searches Islamic Hadith texts across multiple major books (Sahih al-Bukhari, Sahih Muslim, and more). It combines semantic vector search (via multilingual embeddings stored in Qdrant) with keyword-based BM25 search (via Whoosh) to deliver highly relevant results in both Arabic and English.


🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Data Pipeline                        │
│                                                             │
│  sunnah.com ──► Scraper ──► Repository (DB / Excel)         │
│                                   │                         │
│                                   ▼                         │
│                    Chunker ──► Embedder ──► Qdrant          │
│                                   │                         │
│                                   ▼                         │
│                              Whoosh Index                   │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                       Search Pipeline                       │
│                                                             │
│  User Query                                                 │
│      │                                                      │
│      ├──► Semantic Search (Qdrant)  ──► [(id, score)]       │
│      │                                        │             │
│      └──► Keyword Search  (Whoosh)  ──► [(id, score)]       │
│                                               │             │
│                                    Normalize + Fuse         │
│                               (0.7 semantic + 0.3 BM25)     │
│                                               │             │
│                                    Fetch from PostgreSQL    │
│                                               │             │
│                                        Ranked Results       │
└─────────────────────────────────────────────────────────────┘

📁 Project Structure

src/
├── api/
│   ├── routes/          # FastAPI route definitions
│   ├── schemas/         # Request / response Pydantic models
│   ├── dependencies.py  # Dependency injection
│   └── mapper.py        # Domain → response DTO mapping
├── application/
│   ├── interfaces/      # Abstract base classes
│   ├── services/        # SemanticSearch, KeywordSearch, HybridSearch
│   ├── embedder.py      # Embedding model wrapper
│   └── factories/       # VectorDB and DataLoader factories
├── domain/
│   ├── models/          # ScrapedHadith domain model
│   └── enums/           # EmbeddingType, HadithRepositoryType
├── infrastructure/
│   ├── scrapper/        # HTTP client and sunnah.com scraper
│   ├── repositories/    # DbHadithRepository, ExcelHadithRepository
│   ├── vectoreDb/       # QdrantDB wrapper
│   └── KeywordDataStore/ # WhooshIndex wrapper
│   └── db/              # db client and models (SQLAlchemy)
├── scripts/
│   ├── scrape.py        # Scraping pipeline
│   └── vectorize.py     # Chunking + embedding + indexing pipeline
│   └── dbToWhoosh.py    # pulicating keyWord store (whoosh)
├── config.py            # App configuration via pydantic-settings
└── main.py              # FastAPI app entrypoint

⚙️ Installation

1. Clone the repository

git clone https://github.com/alikashlan10/Hadith.git
cd hadith

2. Create and activate a virtual environment

conda create -n hadith python=3.11
conda activate hadith

3. Install dependencies

pip install -r requirements.txt

4. Set up environment variables

Copy the provided example file and fill in your own values:

cp .env.example .env

Then edit .env with your configuration. See .env.example for all required variables and their descriptions.


🚀 Running the Pipeline

Step 1 — Scrape Hadiths

Scrapes hadith texts from sunnah.com and saves them to your configured repository (db or Excel).

python -m src.scripts.scrape

Step 2 — Vectorize

Loads hadiths from the repository, chunks them, generates embeddings, and indexes them into Qdrant (vector search).

python -m src.scripts.vectorize

Step 3 — Vectorize

Loads hadiths from the repository, insert them into (whoosh).

python -m src.scripts.dbToWhoosh

Step 4 — Start the API Server

uvicorn src.main:app --reload

The API will be available at http://127.0.0.1:8000


🔍 API

POST /search

Search for hadiths using hybrid semantic + keyword retrieval.

Request:

{
    "query" : "fasting while travelling",
    "top_k" : 5
}

Response:

{
  "query": "fasting while travelling",
  "top_k": 5,
  "returned": 5,
  "results": [
    {
      "db_id": 9878,
      "text_ar": "...",
      "text_en": "...",
      "book_name_ar": "صحيح مسلم",,
      "book_name_en": "Sahih Muslim",
      "chapter_name_ar": "كتاب الصيام",
      "chapter_name_en":  "The Book of Fasting",
      "reference": "Sahih Muslim 1121b",
      "score": 0.7
    }
  ]
}

🛠️ Tech Stack

Layer Technology
API FastAPI
Vector Search Qdrant
Keyword Search Whoosh (BM25)
Embeddings multilingual-e5-base
Relational DB SQLAlchemy
Scraping requests + BeautifulSoup
Config pydantic-settings

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages