GithubBot/README.md at main · oGYCo/GithubBot

██████╗  ██╗████████╗██╗  ██╗██╗   ██╗██████╗ ██████╗  ██████╗ ████████╗
██╔════╝ ██║╚══██╔══╝██║  ██║██║   ██║██╔══██╗██╔══██╗██╔═══██╗╚══██╔══╝
██║  ███╗██║   ██║   ███████║██║   ██║██████╔╝██████╔╝██║   ██║   ██║   
██║   ██║██║   ██║   ██╔══██║██║   ██║██╔══██╗██╔══██╗██║   ██║   ██║   
██████╔╝ ██║   ██║   ██║  ██║╚██████╔╝██████╔╝██████╔╝╚██████╔╝   ██║   
╚═════╝  ╚═╝   ╚═╝   ╚═╝  ╚═╝ ╚═════╝ ╚═════╝ ╚═════╝  ╚═════╝    ╚═╝

An open-source, LLM-based intelligent analysis bot for GitHub repositories

Chat with your codebase, gain deep insights, and automate code understanding

Note: This project is currently under active development and is not yet ready for production use.

GithubBot is a powerful AI framework designed to revolutionize how developers interact with codebases. It automatically "learns" an entire GitHub repository—including all its code and documentation—and answers any questions about it in natural language through an intelligent chatbot, from "What does this function do?" to "How do I implement a new feature?".

🚀 Core Features

🤖 Intelligent Code Q&A: Provides precise, context-aware code explanations and suggestions based on Retrieval-Augmented Generation (RAG).
⚡️ Fully Automated Processing: Simply provide a GitHub repository URL to automatically clone, parse, chunk, vectorize, and index the code.
🔌 Highly Extensible: Easily swap or extend LLMs, embedding models, and vector databases. Supports various models like OpenAI, Azure, Cohere, and HuggingFace.
🔍 Hybrid Search: Combines vector search with BM25 keyword search to ensure optimal context retrieval for various types of queries.
⚙️ Asynchronous Task Handling: Uses Celery and Redis to manage time-consuming repository indexing tasks, ensuring API responsiveness and stability.
🐳 One-Click Deployment: Comes with a complete Docker Compose setup, allowing you to launch all services (API, Worker, databases, etc.) with a single command.

🏗️ Architecture Overview

GithubBot uses a modern microservices architecture to ensure system scalability and maintainability. The core process is divided into two stages: "Data Ingestion" and "Query Answering".

📥 Data Ingestion Flow

💬 Query Answering Flow

1. User submits repo URL via API

↓

2. API service creates a Celery async task

↓

3. Task enters Redis message queue

↓

4. Celery Worker executes `ingestion_service`

↓

Processing Steps:

• Git Helper: Clone repository
• File Parser: Parse & chunk files
• Embedding Manager: Generate vectors

↓

5. Store in ChromaDB (vectors) & PostgreSQL (metadata)

1. User asks a question via API

↓

2. API service calls `query_service`

↓

Hybrid Search:

• Vector search from ChromaDB
• Keyword search with BM25

↓

3. Fuse and rerank retrieved results

↓

4. LLM Manager builds prompt & calls LLM

↓

5. Return final answer via API

🛠️ Tech Stack

Backend: FastAPI, Python 3.10+
AI / RAG: LangChain, OpenAI, Cohere, HuggingFace (extendable)
Database: PostgreSQL (metadata), ChromaDB (vector storage)
Task Queue: Celery, Redis
Containerization: Docker, Docker Compose
Data Validation: Pydantic

🚀 Quick Start

You can get GithubBot up and running in minutes with Docker.

1. Prerequisites

Docker: Install Docker
Docker Compose: Usually included with Docker Desktop.
Git: To clone this project.

2. Clone the Project

git clone https://github.com/oGYCo/GithubBot.git
cd GithubBot

3. Configure Environment

The project uses a .env file to manage sensitive information and configurations. Please note: The project includes a .env.example file. You need to create your own .env file from it.

cp .env.example .env

Then, edit the .env file and add at least your OpenAI API key:

# .env

# --- LLM and Embedding Model API Keys ---
# At least one model key is required
OPENAI_API_KEY="sk-..."
# AZURE_OPENAI_API_KEY=
# ANTHROPIC_API_KEY=
# ... other API keys

4. Launch Services

Option A: One-Click Start (Recommended)

For Linux/macOS:

chmod +x start.sh
./start.sh

For Windows:

Method 1 (Batch file): Double-click start.bat or run in Command Prompt:
```
start.bat
```
Method 2 (PowerShell): Right-click start.ps1 → "Run with PowerShell" or run in PowerShell:
```
.\start.ps1
```

Option B: Manual Docker Compose

Build and start all services manually:

docker compose up --build -d

This command will start the API service, Celery worker, PostgreSQL, Redis, and ChromaDB.

5. Check Status

Wait a moment for the services to initialize, then check if all containers are running correctly:

docker compose ps

You should see the status of all services as running or healthy.

6. Access the Services

Once all services are running, you can access:

API Documentation: http://localhost:8000/docs
API Root: http://localhost:8000
Flower (Task Monitor): http://localhost:5555
Health Check: http://localhost:8000/health

Project Status

📊 Service Monitor

Service	Port	Monitor URL	Description
API Service	8000	http://localhost:8000/health	Main API interface
API Documentation	8000	http://localhost:8000/docs	Swagger documentation
Flower	5555	http://localhost:5555	Task queue monitoring
PostgreSQL	5432	-	Database service
Redis	6380	-	Cache and message queue
ChromaDB	8001	-	Vector database (host port, container internal 8000)

🛑 Stop Services

docker compose down

🔄 Restart Services

docker compose restart

📝 View Logs

# View all service logs
docker compose logs -f

# View specific service logs
docker compose logs -f api
docker compose logs -f worker

🔧 Troubleshooting

Common Issues

API keys not set
- Ensure at least one LLM API key is set in the .env file
- Recommended: Set OPENAI_API_KEY
Port conflicts
- Check if ports 8000, 5555, 5432, 6380, 8001 are occupied
- Use netstat -an | grep :8000 to check port status
Docker not running
- Ensure Docker Desktop is running
- Check Docker system tray icon
Memory issues
- Ensure system has enough memory to run all containers
- Recommended: At least 4GB available memory
Network connection issues
- Ensure access to Docker Hub
- May need Docker registry mirror configuration in China

Windows Specific Issues

Docker Desktop not started
- Ensure Docker Desktop is running
- Check Docker icon in system tray
WSL2 not enabled
- Docker Desktop requires WSL2 support
- Refer to WSL2 installation guide
Firewall blocking
- Ensure Windows Firewall allows Docker network access

📖 API Usage Example

Once the services are running, the API will be available at http://localhost:8000. You can access the interactive API documentation (Swagger UI) at http://localhost:8000/docs.

1. Index a New Repository

Send a POST request to the following endpoint to start analyzing a repository. This is an asynchronous operation, and the API will immediately return a task ID.

URL: /api/v1/repos/analyze
Method: POST
Body:

{
  "repo_url": "https://github.com/tiangolo/fastapi",
  "embedding_config": {
    "provider": "openai",
    "model_name": "text-embedding-3-small",
    "api_key": "your-openai-api-key"
  }
}

Example (using cURL):

curl -X 'POST' \
  'http://localhost:8000/api/v1/repos/analyze' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "repo_url": "https://github.com/tiangolo/fastapi",
  "embedding_config": {
    "provider": "openai",
    "model_name": "text-embedding-3-small",
    "api_key": "your-openai-api-key"
  }
}'

2. Check Analysis Status

Use the session_id returned from the previous step to check the analysis progress.

URL: /api/v1/repos/status/{session_id}
Method: GET

3. Chat with the Repository

Once the repository status changes to SUCCESS, you can start asking questions.

URL: /api/v1/repos/query
Method: POST
Body:

{
  "session_id": "your-session-id",
  "question": "How to handle CORS in FastAPI?",
  "generation_mode": "service",
  "llm_config": {
    "provider": "openai",
    "model_name": "gpt-4",
    "api_key": "your-openai-api-key",
    "temperature": 0.7,
    "max_tokens": 1000
  }
}

Example (using cURL):

curl -X 'POST' \
  'http://localhost:8000/api/v1/repos/query' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "session_id": "your-session-id",
  "question": "How to handle CORS in FastAPI?",
  "generation_mode": "service"
}'

⚙️ Environment Configuration Details

You can customize almost every aspect of the application in the .env file.

Core Configuration

Variable Name	Description	Default Value
`APP_NAME`	Application name	`"GithubBot"`
`APP_VERSION`	Application version	`"0.1.0"`
`DEBUG`	Debug mode	`False`
`LOG_LEVEL`	Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)	`"INFO"`
`API_KEY`	API access key (optional)	`""`
`CORS_ORIGINS`	Allowed CORS origins (comma-separated)	`"http://localhost:3000,http://127.0.0.1:3000"`

Service Ports

Variable Name	Description	Default Value
`API_HOST`	API host address	`"0.0.0.0"`
`API_PORT`	Port for the API service to listen on	`8000`

Database Configuration (PostgreSQL)

Variable Name	Description	Default Value
`DATABASE_URL`	Complete PostgreSQL connection URL	`"postgresql+psycopg2://user:password@postgres:5432/repoinsight"`
`POSTGRES_USER`	PostgreSQL username	`"user"`
`POSTGRES_PASSWORD`	PostgreSQL password	`"password"`
`POSTGRES_DB`	PostgreSQL database name	`"repoinsight"`
`POSTGRES_HOST`	PostgreSQL host	`"postgres"`
`POSTGRES_PORT`	PostgreSQL port	`5432`

Redis Configuration

Variable Name	Description	Default Value
`REDIS_URL`	Complete Redis connection URL	`"redis://redis:6379/0"`
`REDIS_HOST`	Redis service address	`"redis"`
`REDIS_PORT`	Redis port	`6379`

ChromaDB Configuration

Variable Name	Description	Default Value
`CHROMADB_HOST`	ChromaDB host	`"chromadb"`
`CHROMADB_PORT`	ChromaDB port	`8000`
`CHROMADB_CLIENT_TIMEOUT`	ChromaDB client timeout (seconds)	`120`
`CHROMADB_SERVER_TIMEOUT`	ChromaDB server timeout (seconds)	`120`
`CHROMADB_MAX_RETRIES`	ChromaDB connection max retries	`5`
`CHROMADB_RETRY_DELAY`	ChromaDB connection retry delay (seconds)	`3`

LLM and Embedding Model API Keys

Variable Name	Description
`OPENAI_API_KEY`	OpenAI API key
`AZURE_OPENAI_API_KEY`	Azure OpenAI API key
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint
`ANTHROPIC_API_KEY`	Anthropic API key
`COHERE_API_KEY`	Cohere API key
`GOOGLE_API_KEY`	Google API key
`HUGGINGFACE_HUB_API_TOKEN`	HuggingFace API token
`MISTRAL_API_KEY`	Mistral API key
`QWEN_API_KEY`	Qwen API key
`DASHSCOPE_API_KEY`	DashScope API key

Processing Configuration

Variable Name	Description	Default Value
`GIT_CLONE_DIR`	Directory for Git repository clones	`"/repo_clones"`
`CHUNK_SIZE`	Maximum size of text chunks	`1000`
`CHUNK_OVERLAP`	Overlap size between text chunks	`200`
`EMBEDDING_BATCH_SIZE`	Batch size for embedding processing	`32`
`VECTOR_SEARCH_TOP_K`	Number of documents from vector search	`10`
`BM25_SEARCH_TOP_K`	Number of documents from BM25 search	`10`

File Processing

Variable Name	Description	Default Value
`ALLOWED_FILE_EXTENSIONS`	List of allowed file extensions (JSON array)	`[".py", ".js", ".jsx", ".ts", ".tsx", ".java", ".cpp", ".c", ".h", ".hpp", ".cs", ".php", ".rb", ".go", ".rs", ".swift", ".kt", ".scala", ".md", ".txt", ".rst", ".json", ".yaml", ".yml", ".toml", ".ini", ".cfg", ".sh", ".sql", ".html", ".css", ".vue", "dockerfile", "makefile", "readme", "license", "changelog"]`
`EXCLUDED_DIRECTORIES`	List of directories to exclude (JSON array)	`[".git", "node_modules", "dist", "build", "venv", ".venv", "target"]`

Celery Configuration

Variable Name	Description	Default Value
`CELERY_BROKER_URL`	Celery broker URL	`"redis://redis:6379/0"`

🤝 Contributing

Contributions of all kinds are welcome! Whether it's reporting a bug, submitting a feature request, or contributing code.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Core Features

🏗️ Architecture Overview

🛠️ Tech Stack

🚀 Quick Start

1. Prerequisites

2. Clone the Project

3. Configure Environment

4. Launch Services

Option A: One-Click Start (Recommended)

Option B: Manual Docker Compose

5. Check Status

6. Access the Services

Project Status

📊 Service Monitor

🛑 Stop Services

🔄 Restart Services

📝 View Logs

🔧 Troubleshooting

Common Issues

Windows Specific Issues

📖 API Usage Example

1. Index a New Repository

2. Check Analysis Status

3. Chat with the Repository

⚙️ Environment Configuration Details

Core Configuration

Service Ports

Database Configuration (PostgreSQL)

Redis Configuration

ChromaDB Configuration

LLM and Embedding Model API Keys

Processing Configuration

File Processing

Celery Configuration

🤝 Contributing

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🚀 Core Features

🏗️ Architecture Overview

🛠️ Tech Stack

🚀 Quick Start

1. Prerequisites

2. Clone the Project

3. Configure Environment

4. Launch Services

Option A: One-Click Start (Recommended)

Option B: Manual Docker Compose

5. Check Status

6. Access the Services

Project Status

📊 Service Monitor

🛑 Stop Services

🔄 Restart Services

📝 View Logs

🔧 Troubleshooting

Common Issues

Windows Specific Issues

📖 API Usage Example

1. Index a New Repository

2. Check Analysis Status

3. Chat with the Repository

⚙️ Environment Configuration Details

Core Configuration

Service Ports

Database Configuration (PostgreSQL)

Redis Configuration

ChromaDB Configuration

LLM and Embedding Model API Keys

Processing Configuration

File Processing

Celery Configuration

🤝 Contributing

📄 License