A RAG-based AI consultant tool that processes videos end-to-end and answers questions strictly from video content with full source citations.
Run locally:
streamlit run app/streamlit_app.py| Component | Tool |
|---|---|
| Video download | yt-dlp + Playwright |
| Audio extraction | ffmpeg |
| Transcription | Groq Whisper large-v3 |
| Keyframe extraction | OpenCV + PySceneDetect |
| OCR | Groq Vision (llama-4-scout) |
| Visual analysis | Groq Vision (llama-4-scout) |
| Embeddings | sentence-transformers all-MiniLM-L6-v2 |
| Vector store | ChromaDB (local) |
| Q&A answers | Groq LLaMA 3.3 70b |
| Demo UI | Streamlit |
vidiq-rag/ ├── pipeline/ │ ├── downloader.py # Video download │ ├── extractor.py # Audio extraction │ ├── transcriber.py # Whisper transcription │ ├── keyframer.py # Keyframe extraction │ ├── ocr.py # OCR on frames │ ├── vision.py # Visual analysis │ ├── chunker.py # Chunk builder │ ├── embedder.py # Embeddings + ChromaDB │ └── qa.py # Retrieval + Q&A ├── app/ │ └── streamlit_app.py # Demo UI ├── data/ │ └── videos.json # Video URLs ├── transcripts/ # Whisper transcripts ├── keyframes/ # Extracted frames ├── ocr_output/ # OCR results ├── visual_analysis/ # Vision descriptions ├── chunks/ # RAG chunks ├── vectorstore/ # ChromaDB storage └── outputs/ # Sample Q&A results
git clone https://github.com/YOUR_USERNAME/vidiq-rag.git
cd vidiq-raguv venv
# Windows:
.venv\Scripts\activate
# Mac/Linux:
source .venv/bin/activateuv add yt-dlp ffmpeg-python scenedetect pytesseract pillow chromadb \
google-genai python-dotenv groq opencv-python sentence-transformers \
streamlit playwrightGROQ_API_KEY=your_groq_key_here
Name them video_01.mp4 through video_10.mp4
python pipeline/extractor.py
python pipeline/transcriber.py
python pipeline/keyframer.py
python pipeline/ocr.py
python pipeline/vision.py
python pipeline/chunker.py
python pipeline/embedder.pystreamlit run app/streamlit_app.pyVideo files ↓ Audio extraction (ffmpeg) ↓ Transcription with timestamps (Groq Whisper) ↓ Keyframe extraction (OpenCV + PySceneDetect) ↓ OCR on frames (Groq Vision) ↓ Visual analysis (Groq Vision) ↓ Chunk builder — transcript + OCR + visual combined ↓ Embeddings + ChromaDB storage ↓ RAG retrieval + LLM answer generation ↓ Streamlit demo UI
The system answers strictly from retrieved video chunks only. The LLM is instructed to respond with: "I don't know based on the provided video content." when the answer is not found in the videos.
| Scale | Approach |
|---|---|
| 10 videos | Local machine, current setup |
| 100 videos | Parallel workers, Qdrant Cloud |
| 1000 videos | Celery + Redis task queue, cloud storage |
| 14000 videos | Distributed pipeline, Prefect orchestration |
See outputs/sample_qa_readable.txt for 10 answered questions
with source citations and 2 no-answer refusal tests.