A production-grade Retrieval-Augmented Generation (RAG) system that lets you ask questions about any GitHub repository and get accurate answers with file and line number citations.
- Paste a GitHub repo URL — the app fetches all code files via the GitHub API
- Files are parsed using tree-sitter (AST-aware chunking by function/class boundaries)
- Each chunk is embedded using both dense (semantic) and sparse (BM42 keyword) vectors
- Vectors are stored in Qdrant Cloud with repo-level tenant isolation
- When you ask a question, hybrid search (RRF fusion) retrieves the most relevant chunks
- A cross-encoder re-ranker re-scores results for precision
- Gemini generates an answer grounded strictly in retrieved chunks, with file + line citations
Client (curl / API)
│
▼
┌───────────────────┐
│ FastAPI Server │ handles HTTP, query, evaluate
│ (api service) │
└────────┬──────────┘
│ enqueue job
▼
┌───────────────────┐
│ Redis │ job queue + job state storage
└────────┬──────────┘
│ poll + dequeue
▼
┌───────────────────┐
│ ARQ Worker │ runs indexing pipeline
│ (worker service) │
└────────┬──────────┘
│ HTTP (embed/rerank)
▼
┌───────────────────┐
│ Embedding Service │ dense + sparse embeddings, reranking
│ (embedding svc) │ models loaded once, shared by all
└───────────────────┘
│
▼
┌───────────────────┐
│ Qdrant Cloud │ vector storage, hybrid search
└───────────────────┘
AST-aware chunking Uses tree-sitter to split code at function and class boundaries. Supports TypeScript, TSX, JavaScript, Python, Go, Rust, Java, C, and C++. Falls back to line-based chunking for unsupported types.
Hybrid search Combines dense vector search (semantic similarity) with BM42 sparse keyword search, fused via Reciprocal Rank Fusion (RRF). Catches both meaning-level queries and exact matches like function names.
Cross-encoder re-ranking
After retrieval, a cross-encoder/ms-marco-MiniLM-L-6-v2 model re-scores candidates by reading the question and chunk together — significantly improving result precision over vector similarity alone.
Multi-repo support
All repos share one Qdrant collection, isolated by a repo payload field with a keyword tenant index. Each query is scoped to the repo you specify.
Idempotent re-indexing File hashes are stored in Redis per repo. On re-index, only changed or new files are re-embedded. Deleted files are removed from Qdrant. Unchanged files are skipped entirely.
Microservices architecture Three independent services — API server, ARQ worker, and embedding service. Embedding models are loaded once in the embedding service and shared by both the API (for query-time embedding) and the worker (for indexing). No model duplication across processes.
Async pipeline
Built on FastAPI with full async support. CPU-bound work (embedding, reranking) runs in threadpool executors via asyncio.to_thread. Network I/O (GitHub API, Qdrant, Gemini, Redis) is fully async.
Job queue with retries
POST /api/index returns immediately. Indexing runs as an ARQ background job. Failed jobs retry up to 3 times with exponential backoff. Job state (status, progress) stored in Redis with 24hr TTL.
Indexing progress tracking
Poll GET /api/index/{job_id} to get real-time progress — total files, fetched, chunked, embedded, and stored counts.
RAG evaluation endpoint
Dedicated /api/evaluate scores pipeline quality using DeepEval — faithfulness, answer relevancy, contextual precision, and contextual recall. Separate from the query hot path.
Tracing
Every query request is traced end-to-end using DeepEval's @observe decorator — agent → retriever → reranker → LLM — visible in the Confident AI dashboard.
| Layer | Technology |
|---|---|
| API Server | FastAPI (Python) |
| Job Queue | ARQ + Redis |
| Vector DB | Qdrant Cloud |
| Dense Embeddings | all-MiniLM-L6-v2 (sentence-transformers) |
| Sparse Embeddings | Qdrant/bm42-all-minilm-l6-v2-attentions (fastembed) |
| Re-ranking | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| AST Parsing | tree-sitter + tree-sitter-language-pack |
| LLM | Gemini 2.5 Flash (Google AI Studio) |
| Tracing + Evaluation | DeepEval + Confident AI |
| Async HTTP | httpx |
| Containerization | Docker + Docker Compose |
gh-rag-app/
├── docker-compose.yml
├── README.md
│
├── server/
│ ├── main.py # FastAPI app, lifespan, routes
│ ├── constants.py # Extensions, node types, blocked dirs
│ ├── worker.py # ARQ worker settings + task
│ ├── requirements.txt
│ ├── Dockerfile
│ ├── .env
│ ├── core/
│ │ ├── clients.py # Qdrant, Gemini, Redis, ARQ pool
│ │ ├── database.py # Redis job state operations
│ │ ├── fetcher.py # GitHub API — fetch repo file tree
│ │ ├── chunker.py # tree-sitter AST chunking
│ │ ├── embedder.py # Calls embedding service, batched
│ │ ├── embedding_client.py # HTTP client for embedding service
│ │ ├── retriever.py # Hybrid search, reranking, ask()
│ │ ├── generator.py # Prompt building + Gemini call
│ │ └── indexer.py # Full indexing pipeline
│ └── api/
│ ├── models.py # Pydantic request/response models
│ └── routes/
│ ├── index.py # POST /api/index, GET, DELETE
│ ├── query.py # POST /api/query
│ └── evaluate.py # POST /api/evaluate
│
└── embedding_service/
├── main.py # FastAPI embedding + rerank endpoints
├── requirements.txt
└── Dockerfile
- Python 3.11
- Docker Desktop (for Docker setup)
- Qdrant Cloud free cluster
- Google AI Studio API key
- Upstash Redis free instance (or local Redis)
- Confident AI account (optional, for tracing)
- GitHub Personal Access Token (optional, for higher rate limits)
1. Clone the repo
git clone https://github.com/kVarunkk/GitHub-Codebase-RAG.git
cd GitHub-Codebase-RAG2. Set up server
cd server
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Mac/Linux
pip install -r requirements.txt3. Set up embedding service
cd ../embedding_service
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt4. Configure environment variables
Create server/.env:
QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your-qdrant-api-key
GOOGLE_API_KEY=your-ai-studio-api-key
GITHUB_TOKEN=your-github-pat
REDIS_URL=redis://localhost:6379
EMBEDDING_SERVICE_URL=http://localhost:8001
CONFIDENT_API_KEY=your-confident-ai-key5. Run all three services in separate terminals
# Terminal 1 — embedding service
cd embedding_service && uvicorn main:app --port 8001 --reload
# Terminal 2 — API server
cd server && uvicorn main:app --port 8000 --reload
# Terminal 3 — ARQ worker
cd server && python -u -m arq worker.WorkerSettings1. Clone the repo
git clone https://github.com/kVarunkk/GitHub-Codebase-RAG.git
cd GitHub-Codebase-RAG2. Configure environment variables
Create server/.env with the same variables as above, but update:
REDIS_URL=redis://redis:6379
EMBEDDING_SERVICE_URL=http://embedding:80013. Build and run
docker compose up --build4. Subsequent runs (no code changes)
docker compose up5. Stop
docker compose downView logs:
docker compose logs -f # all services
docker compose logs -f worker # worker only
docker compose logs -f api # api only{ "status": "ok" }Start indexing a GitHub repository. Returns immediately — indexing runs in background.
Request:
{
"repo_url": "https://github.com/owner/repo",
"branch": "main"
}Response:
{
"job_id": "uuid",
"status": "pending",
"message": "Indexing queued for owner/repo"
}Poll indexing status and progress.
Response:
{
"job_id": "uuid",
"status": "running",
"message": "Attempt 1",
"progress": {
"total_files": 333,
"fetched_files": 210,
"chunked_files": 210,
"embedded_chunks": 800,
"stored_chunks": 700
}
}Status values: pending | running | done | failed
Returns a list of all unique repository identifiers that have been successfully chunked and stored inside the vector database.
Response:
{
"success": "true",
"repos": ["repo1", "repo2", ...],
"count": 5
}Retires failed job.
Response:
{
"success": "true",
"message": "Retired job 12fewgrg32-gerg32-gre"
}Delete all indexed vectors for a specific repo.
Example:
curl -X DELETE "http://localhost:8000/api/index?repo=owner/repo"Response:
{
"success": true,
"message": "Deleted all points for owner/repo"
}Ask a question about an indexed repository.
Request:
{
"question": "How does authentication work?",
"repo": "owner/repo",
"candidate_k": 20,
"final_k": 5
}Response:
{
"answer": "Authentication uses bearer tokens validated by requireBearerAuth middleware...",
"citations": [
{
"path": "src/auth/middleware.ts",
"start_line": 1,
"end_line": 24
}
]
}Run RAG evaluation metrics on a question. Slower than /api/query — intended for pipeline quality testing only.
Request:
{
"question": "How does authentication work?",
"repo": "owner/repo",
"expected_answer": "optional ground truth for precision/recall",
"candidate_k": 20,
"final_k": 5
}Response:
{
"question": "How does authentication work?",
"answer": "Authentication uses bearer tokens...",
"passed": true,
"metrics": [
{
"name": "Answer Relevancy",
"score": 0.95,
"threshold": 0.7,
"passed": true,
"reason": "The answer directly addresses the question."
},
{
"name": "Faithfulness",
"score": 1.0,
"threshold": 0.7,
"passed": true,
"reason": "All claims are grounded in the retrieved chunks."
}
],
"confident_link": "https://app.confident-ai.com/..."
}
ContextualPrecisionandContextualRecallare only evaluated whenexpected_answeris provided.
| Variable | Required | Description |
|---|---|---|
QDRANT_URL |
Yes | Qdrant Cloud cluster URL |
QDRANT_API_KEY |
Yes | Qdrant Cloud API key |
GOOGLE_API_KEY |
Yes | Google AI Studio key for Gemini |
REDIS_URL |
Yes | Redis connection URL |
EMBEDDING_SERVICE_URL |
Yes | URL of the embedding service |
GITHUB_TOKEN |
Recommended | GitHub PAT — raises rate limit from 60 to 5000 req/hr |
CONFIDENT_API_KEY |
Optional | Confident AI key for tracing dashboard |
- Streaming LLM responses (Server-Sent Events)
- Metadata filtering by file path or language at query time
- Support for private repositories
- Switch to
BAAI/bge-base-en-v1.5for better code embedding quality - tree-sitter upgrade to v0.23+ with individual language packages