Multi-modal RAG with Hexagonal Architecture

🚀 A powerful and modular Multi-modal Retrieval-Augmented Generation (RAG) system designed for maximum flexibility. Process videos, extract intelligence, and swap AI services (OpenAI, Gemini, Whisper, etc.) with zero friction.

🏗️ Architecture: Hexagonal (Ports & Adapters)

This project follows a Hexagonal Architecture (also known as Ports and Adapters). The core business logic is completely decoupled from external services.

graph TD
    subgraph Domain
        Models[Models: VideoSegment, Dataset]
    end
    
    subgraph Application
        VP[VideoProcessor]
        RO[RAGOrchestrator]
    end
    
    subgraph Ports
        ITrans[ITranscriptionService]
        IVisual[IVisualDescriptionService]
        IOCR[IOCRService]
        IEmb[IEmbeddingService]
        IVec[IVectorStore]
        IChat[IChatService]
    end
    
    subgraph Adapters
        OpenAI[OpenAI Adapters: Whisper, GPT-4, Embeddings]
        Gemini[Gemini Adapters: 2.0 Flash, Embeddings]
        Local[Local Adapters: Pytesseract OCR, FAISS]
        Future[Potential: Claude, Pinecone...]
    end
    
    Application --> Domain
    Application --> Ports
    Adapters -.-> Ports

Why this architecture?

Provider Agnostic: Switch from OpenAI to Gemini or local models by simply adding a new adapter.
Testable: Easily mock external services to test core logic.
Maintainable: Clear separation of concerns makes the codebase easier to understand and evolve.

✨ Features

Video Transcription: Support for OpenAI Whisper (API), Local Whisper (no token), and Deepgram (accurate timestamps).
Scene Detection: Automatically detect scene changes for precise visual analysis.
Multi-modal Indexing: Combines Audio, OCR, and Visual Descriptions into a unified RAG index.
Smart Segmenting: Index by fixed time intervals (e.g., every 1s) with automatic overrides on scene changes.
Timestamped Answers: Get answers directly tied to specific video segments.

🛠️ Installation

Clone the repository:

git clone https://github.com/x-eight/rag-multimodal.git
cd rag-multimodal

Install dependencies:
```
pip install -e .
```
Note: Ensure you have ffmpeg and tesseract installed on your system.
Set up Environment Variables:
```
export OPENAI_API_KEY="your_api_key"
```

🚀 Usage

1. Index a media file

You can index a video, audio, or image file. The system will automatically detect the file type based on the extension.

Video (Full):

python src/main.py index --file my_video.mp4

Video (Specific range, e.g., from 10s to 30s) with 2s interval:

python src/main.py index --file my_video.mp4 --start 10 --end 30 --interval 2

Audio:

python src/main.py index --file my_audio.mp3

Audio (Specific range with 5s interval):

python src/main.py index --file my_audio.mp3 --start 5 --end 15 --interval 5

Image:

python src/main.py index --file my_image.jpg

Indexing returns a unique Index ID (e.g., my_video_a1b2c3d4).

2. Query an Index

Use the Index ID generated above to ask questions.

python src/main.py query --id my_video_a1b2c3d4 --question "What happens in the first 5 seconds?"

🔄 Swapping Services

To swap a service (e.g., using Gemini for visual descriptions):

Create an adapter in src/adapters/ that implements the corresponding Port (Interface) from src/ports/.
Inject the new adapter in src/main.py.

# In src/main.py
# transcription_service = OpenAITranscriptionAdapter()
transcription_service = MyNewGeminiAdapter() # Swapped in one line!

📦 Requirements

openai
opencv-python
faiss-cpu
numpy
pillow
pytesseract

⚖️ License

This project is licensed under the MIT License.

Developed for the AI community.# RAG-multimodal

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
migrate		migrate
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal RAG with Hexagonal Architecture

🏗️ Architecture: Hexagonal (Ports & Adapters)

Why this architecture?

✨ Features

🛠️ Installation

🚀 Usage

1. Index a media file

2. Query an Index

🔄 Swapping Services

📦 Requirements

⚖️ License

RAG-multimodal

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-modal RAG with Hexagonal Architecture

🏗️ Architecture: Hexagonal (Ports & Adapters)

Why this architecture?

✨ Features

🛠️ Installation

🚀 Usage

1. Index a media file

2. Query an Index

🔄 Swapping Services

📦 Requirements

⚖️ License

RAG-multimodal

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages