🚀 A powerful and modular Multi-modal Retrieval-Augmented Generation (RAG) system designed for maximum flexibility. Process videos, extract intelligence, and swap AI services (OpenAI, Gemini, Whisper, etc.) with zero friction.
This project follows a Hexagonal Architecture (also known as Ports and Adapters). The core business logic is completely decoupled from external services.
graph TD
subgraph Domain
Models[Models: VideoSegment, Dataset]
end
subgraph Application
VP[VideoProcessor]
RO[RAGOrchestrator]
end
subgraph Ports
ITrans[ITranscriptionService]
IVisual[IVisualDescriptionService]
IOCR[IOCRService]
IEmb[IEmbeddingService]
IVec[IVectorStore]
IChat[IChatService]
end
subgraph Adapters
OpenAI[OpenAI Adapters: Whisper, GPT-4, Embeddings]
Gemini[Gemini Adapters: 2.0 Flash, Embeddings]
Local[Local Adapters: Pytesseract OCR, FAISS]
Future[Potential: Claude, Pinecone...]
end
Application --> Domain
Application --> Ports
Adapters -.-> Ports
- Provider Agnostic: Switch from OpenAI to Gemini or local models by simply adding a new adapter.
- Testable: Easily mock external services to test core logic.
- Maintainable: Clear separation of concerns makes the codebase easier to understand and evolve.
- Video Transcription: Support for OpenAI Whisper (API), Local Whisper (no token), and Deepgram (accurate timestamps).
- Scene Detection: Automatically detect scene changes for precise visual analysis.
- Multi-modal Indexing: Combines Audio, OCR, and Visual Descriptions into a unified RAG index.
- Smart Segmenting: Index by fixed time intervals (e.g., every 1s) with automatic overrides on scene changes.
- Timestamped Answers: Get answers directly tied to specific video segments.
-
Clone the repository:
git clone https://github.com/x-eight/rag-multimodal.git cd rag-multimodal -
Install dependencies:
pip install -e .Note: Ensure you have
ffmpegandtesseractinstalled on your system. -
Set up Environment Variables:
export OPENAI_API_KEY="your_api_key"
You can index a video, audio, or image file. The system will automatically detect the file type based on the extension.
Video (Full):
python src/main.py index --file my_video.mp4Video (Specific range, e.g., from 10s to 30s) with 2s interval:
python src/main.py index --file my_video.mp4 --start 10 --end 30 --interval 2Audio:
python src/main.py index --file my_audio.mp3Audio (Specific range with 5s interval):
python src/main.py index --file my_audio.mp3 --start 5 --end 15 --interval 5Image:
python src/main.py index --file my_image.jpgIndexing returns a unique Index ID (e.g., my_video_a1b2c3d4).
Use the Index ID generated above to ask questions.
python src/main.py query --id my_video_a1b2c3d4 --question "What happens in the first 5 seconds?"To swap a service (e.g., using Gemini for visual descriptions):
- Create an adapter in
src/adapters/that implements the corresponding Port (Interface) fromsrc/ports/. - Inject the new adapter in
src/main.py.
# In src/main.py
# transcription_service = OpenAITranscriptionAdapter()
transcription_service = MyNewGeminiAdapter() # Swapped in one line!openaiopencv-pythonfaiss-cpunumpypillowpytesseract
This project is licensed under the MIT License.
Developed for the AI community.# RAG-multimodal