A full-stack AI-powered codebase chatbot application that allows users to chat with their GitHub repositories using Retrieval Augmented Generation (RAG). The system indexes code repositories and provides intelligent answers based on the actual codebase content.
This project consists of three main components working together:
- Frontend - React-based user interface
- Main Server - Express.js API server handling requests and business logic
- Private Server - Background worker processing repository indexing jobs
The system uses a microservices architecture with:
- PostgreSQL - Relational database for user data, repositories, chats, and prompts
- Weaviate - Vector database for semantic search of code chunks
- BullMQ + Redis (Upstash) - Job queue system for asynchronous processing
- JinaAI - Embedding service for vectorizing code chunks
- Google Gemini - LLM for generating responses
The Main Server (Main server/) is the central API server that handles all HTTP requests, authentication, and coordinates between the frontend, database, and background workers.
- User Registration: Creates new user accounts with bcrypt-hashed passwords
- User Login: Validates credentials and issues JWT tokens (7-day expiration)
- Protected Routes: Middleware validates JWT tokens on protected endpoints
- User Management: Stores user data (name, email, password) in PostgreSQL
When a user submits a GitHub repository URL:
- Stores Repository Metadata: Saves the repository URL and associated user email to PostgreSQL
- Queues Indexing Job: Immediately adds a job to the BullMQ queue (
repo-index-queue) with the repository URL and user email - Returns Immediately: The API responds quickly while indexing happens in the background
- Repository CRUD: Users can view all their repositories, get specific repository details, and delete repositories
- Chat Creation: Creates new chat sessions associated with a repository
- Prompt Handling: When a user sends a question:
- Creates a prompt record in the database with status "processing"
- Immediately responds to the frontend with the prompt ID
- Asynchronously processes the query using RAG:
- Retrieves relevant code chunks from Weaviate using semantic search
- Filters results by user email and repository URL
- Builds an augmented prompt with context
- Sends to Gemini LLM for answer generation
- Updates the prompt record with the generated response
- Chat History: Retrieves all prompts and responses for a chat session
The Main Server performs intelligent code search and answer generation:
Semantic Search Process:
- Takes user's natural language query
- Searches Weaviate vector database using
nearTextquery - Retrieves top 5 most semantically similar code chunks
- Filters results to only include chunks from the user's specific repository
- Each chunk includes metadata: file path, repository URL, user ID, and similarity score
Answer Generation:
- Builds an augmented prompt that includes:
- User's original question
- Retrieved code snippets with their source files
- Similarity scores for each snippet
- Instructions to answer using only the provided code context
- Sends the augmented prompt to Google Gemini 2.5 Flash model
- Returns the generated answer with references to source code
- Express.js - Web framework
- Sequelize - PostgreSQL ORM
- BullMQ - Job queue client (adds jobs to queue)
- Weaviate Client - Vector database queries
- Google Gemini API - LLM for answer generation
- JWT - Authentication tokens
- bcrypt - Password hashing
The Private Server (Private server/) is a dedicated background worker that processes repository indexing jobs asynchronously. It runs independently from the Main Server and handles the computationally intensive task of indexing codebases.
- Connects to Redis/Upstash using BullMQ configuration
- Creates a Worker instance listening to the
repo-index-queue - Processes up to 5 jobs concurrently (configurable)
- Runs continuously, waiting for new indexing jobs
When a job is received from the queue, the worker performs the following steps:
Step 1: Repository Cloning
- Creates a temporary directory in the system's temp folder
- Clones the GitHub repository using
git clone --depth 1(shallow clone for efficiency) - Uses the repository URL and user email from the job data
Step 2: File Discovery
- Scans the cloned repository using glob patterns
- Filters out unnecessary files:
node_modules/directories.git/directories- Markdown files (
*.md) - JSON files (
*.json) - Lock files (
*.lock)
- Collects all code files from the repository
Step 3: Code Processing & Chunking For each code file:
- Loads the file content using LangChain's TextLoader
- Determines the programming language based on file extension:
- JavaScript/TypeScript (
.js,.jsx,.ts,.tsx) - Python (
.py) - Java (
.java) - Go (
.go) - C++ (
.cpp) - PHP (
.php)
- JavaScript/TypeScript (
- Uses language-specific text splitters:
- Language-aware splitters for recognized languages (800 char chunks, 100 char overlap)
- Generic splitter for unknown languages (1000 char chunks, 100 char overlap)
- Creates multiple chunks from each file to handle large files
Step 4: Metadata Enrichment For each code chunk:
- Adds metadata:
source: Original file pathrepo: Repository URLuserid: User's email address
- Prepends file name to chunk content for better context
- Formats chunk as:
File: filename.js\n\n[code content]
Step 5: Vector Database Storage
- Connects to Weaviate cloud instance
- Uses the
RepoCodeChunkcollection - Transforms chunks into Weaviate data objects:
text: The code chunk contentrepourl: Repository URLuserid: User email
- Bulk inserts all chunks into Weaviate
- Weaviate automatically:
- Generates embeddings using JinaAI (configured via headers)
- Stores vectors for semantic search
- Indexes the data for fast retrieval
Step 6: Cleanup
- Removes the temporary cloned repository directory
- Logs completion or errors
Separation of Concerns:
- Main Server stays responsive for API requests
- Heavy processing happens asynchronously
- Can scale workers independently
Scalability:
- Multiple Private Server instances can process jobs in parallel
- Queue system handles job distribution
- Concurrency control (5 jobs at once) prevents resource exhaustion
Reliability:
- Jobs are persisted in Redis
- Failed jobs can be retried
- Worker failures don't affect the API server
- BullMQ Worker - Job queue worker
- LangChain - Document loading and text splitting
- Weaviate Client - Vector database operations
- JinaAI - Embedding generation (via Weaviate)
- Node.js fs & child_process - File system and git operations
- User registers with name, email, password
- Password is hashed with bcrypt and stored
- User logs in and receives JWT token
- Token is used for all subsequent authenticated requests
- User submits GitHub repository URL via frontend
- Main Server saves repository metadata to PostgreSQL
- Main Server adds indexing job to BullMQ queue
- API responds immediately with repository record
- Private Server worker picks up the job
- Worker clones, processes, and indexes the repository
- Code chunks are stored in Weaviate with embeddings
- User selects a repository and asks a question
- Main Server creates a prompt record (status: "processing")
- API responds immediately with prompt ID
- Background process:
- Queries Weaviate for semantically similar code chunks
- Filters by user email and repository URL
- Builds augmented prompt with context
- Sends to Gemini LLM
- Updates prompt record with answer
- Frontend polls or receives update with the answer
Codebase-Chatbot/
βββ Frontend/ # React frontend application
β βββ src/
β β βββ pages/ # Chat, Login, Register, Repos pages
β β βββ components/ # Reusable UI components
β β βββ context/ # Auth context for state management
β β βββ services/ # API service layer
β
βββ Main server/ # Express.js API server
β βββ controllers/ # Request handlers
β β βββ authController.js # Registration & login
β β βββ repoInputController.js # Repository management
β β βββ promptController.js # Prompt & RAG handling
β β βββ chatController.js # Chat management
β βββ models/ # Sequelize database models
β βββ routes/ # API route definitions
β βββ middleware/ # Auth middleware
β βββ util/ # RAG utilities
β β βββ ragOutput.js # RAG query & generation
β β βββ augmenter.js # Prompt augmentation
β βββ client/ # Weaviate client connection
β βββ config/ # BullMQ queue configuration
β βββ db/ # Database connection setup
β
βββ Private server/ # Background worker
β βββ src/
β βββ handler/
β β βββ indexer.js # Repository indexing logic
β βββ config/
β β βββ bullmq.config.js # Redis connection
β β βββ vectordb.config.js # Weaviate connection
β βββ index.js # Worker initialization
β
βββ docker-compose.yml # Multi-container orchestration
AUTH Table
user_id(Primary Key)nameemail(Unique)password(Hashed)
REPO_INPUT Table
repo_id(Primary Key)email(Foreign Key to AUTH)repo_urlcreated_at
CHAT Table
chat_id(Primary Key)repo_id(Foreign Key to REPO_INPUT)created_at
PROMPT Table
prompt_id(Primary Key)chat_id(Foreign Key to CHAT)prompt(User's question)response(Generated answer)created_at
RepoCodeChunk Collection
text- Code chunk contentrepourl- Repository URLuserid- User email- Vector embeddings (auto-generated by JinaAI)
- Node.js
- Docker & Docker Compose
- PostgreSQL (or use Docker)
- Redis/Upstash account
- Weaviate Cloud account
- JinaAI API key
- Google Gemini API key
Main Server (Main server/.env):
PORT=5000
JWT_SECRET=your-secret-key
POSTGRES_DB_NAME=postgres
POSTGRES_USERNAME=postgres
POSTGRES_PASSWORD=your-password
POSTGRES_HOST=localhost
FRONTEND_URL=http://localhost:3000
WEAVIATE_URL=your-weaviate-url
WEAVIATE_API_KEY=your-weaviate-key
JINA_API_KEY=your-jina-key
GEMINI_API_KEY=your-gemini-key
UPSTASH_PASSWORD=your-upstash-password
Private Server (Private server/.env):
WEAVIATE_URL=your-weaviate-url
WEAVIATE_API_KEY=your-weaviate-key
JINA_API_KEY=your-jina-key
UPSTASH_PASSWORD=your-upstash-password
docker-compose up -dThis starts:
- PostgreSQL database
- Main Server (port 5000)
- Frontend (port 3000)
- Private Server worker
Main Server:
cd "Main server"
npm install
npm run devPrivate Server:
cd "Private server"
npm install
npm run devFrontend:
cd Frontend
npm install
npm run dev- β User authentication with JWT
- β GitHub repository indexing
- β Semantic code search using vector embeddings
- β RAG-based question answering
- β Multi-repository support per user
- β Chat history and conversation management
- β Asynchronous job processing
- β Language-aware code chunking
- β Filtered search (user-specific, repo-specific)
- Frontend: React, Vite, TailwindCSS
- Backend: Node.js, Express.js
- Database: PostgreSQL (Sequelize ORM)
- Vector DB: Weaviate Cloud
- Queue: BullMQ + Redis (Upstash)
- Embeddings: JinaAI
- LLM: Google Gemini 2.5 Flash
- Authentication: JWT, bcrypt
- Containerization: Docker, Docker Compose
POST /api/auth/register- Register new userPOST /api/auth/login- Login user
POST /api/repo/save-repo- Add repository (queues indexing)GET /api/repo/get-repo/:repo_id- Get repository detailsGET /api/repo/get-all-repos/:email- Get all user repositoriesDELETE /api/repo/delete-repo/:repo_id- Delete repository
POST /api/prompt/save-prompt- Submit question (triggers RAG)GET /api/prompt/get-prompts/:chat_id- Get chat historyGET /api/chat/get-chat/:chat_id- Get chat detailsGET /api/chat/get-all-chats/:repo_id- Get all chats for repository
- Password hashing with bcrypt (10 rounds)
- JWT token-based authentication
- CORS protection
- SQL injection prevention (Sequelize ORM)
- Environment variable configuration
- Token expiration (7 days)
- Asynchronous Processing: Heavy indexing jobs don't block API responses
- Concurrent Workers: Private Server processes 5 jobs simultaneously
- Shallow Git Clones: Only clones latest commit (
--depth 1) - Chunked Processing: Large files are split into manageable chunks
- Vector Search: Fast semantic search using Weaviate's optimized indexes
- Connection Pooling: Database connections are managed efficiently
- Try-catch blocks in all async operations
- Graceful error responses with appropriate HTTP status codes
- Detailed error messages in development mode
- Queue job retry mechanisms (via BullMQ)
- Database transaction rollbacks on failures
- Temporary file cleanup even on errors
Potential improvements:
- Real-time WebSocket updates for prompt processing
- Support for private repositories (GitHub tokens)
- Code syntax highlighting in responses
- Multi-file context in answers
- Repository indexing status tracking
- Incremental updates (only index changed files)
- Support for more programming languages
- Code snippet references with line numbers
[Add your license here]
[Add contributors here]