Skip to content

Latest commit

 

History

History
362 lines (254 loc) · 6.83 KB

File metadata and controls

362 lines (254 loc) · 6.83 KB

⚡ LLM Cascade Router

A production-style intelligent prompt router that dynamically decides whether a request should be handled:

  • locally using Ollama + Qwen
  • or escalated to a cloud LLM (Gemini)

It includes:

  • complexity-based routing
  • semantic caching
  • live dashboard
  • OpenAI-compatible API
  • cost optimization
  • latency tracking
  • routing observability

🚀 Why This Exists

Most AI applications either:

  • send everything to expensive cloud models
  • or force everything through weaker local models

This project solves that.

The router first analyzes the complexity of a prompt, then decides:

Prompt Type Route
Simple / factual / coding help Local Qwen
Complex reasoning / architecture / deep analysis Gemini
Repeated prompts Semantic cache

This dramatically reduces:

  • cloud API cost
  • latency
  • unnecessary escalations

while still preserving high-quality answers for difficult prompts.


🧠 Architecture

                    ┌─────────────────┐
                    │ Incoming Prompt │
                    └────────┬────────┘
                             │
                             ▼
                 ┌────────────────────┐
                 │ Semantic Cache     │
                 │ (SQLite / vector)  │
                 └────────┬───────────┘
                          │ hit
                          ▼
                    Cached Response

                          │ miss
                          ▼

              ┌────────────────────────┐
              │ Complexity Analyzer    │
              │ (Local Qwen via Ollama)│
              └──────────┬─────────────┘
                         │
         ┌───────────────┴────────────────┐
         │                                │
         ▼                                ▼

┌──────────────────┐          ┌────────────────────┐
│ Local Model      │          │ Cloud Escalation   │
│ Qwen via Ollama  │          │ Gemini Flash Lite  │
└────────┬─────────┘          └─────────┬──────────┘
         │                               │
         └──────────────┬────────────────┘
                        ▼
                 Final Response

✨ Features

Intelligent Prompt Routing

Complexity classifier determines whether a prompt should remain local or go to the cloud.

Local-First Inference

Simple prompts are handled completely offline using:

  • Ollama
  • Qwen2.5-Coder

Cloud Escalation

Complex prompts automatically route to Gemini for stronger reasoning.

Semantic Cache

Repeated prompts are served instantly from cache.

OpenAI-Compatible API

Works with:

  • Continue.dev
  • OpenWebUI
  • VSCode extensions
  • custom agents
  • OpenAI SDKs

Live Dashboard

Real-time observability dashboard showing:

  • local vs cloud routing
  • cache hits
  • complexity scores
  • latency
  • request logs

Cost Optimization

Designed to minimize paid token usage.


🖥 Dashboard

Open:

http://localhost:8000/dashboard

You’ll see:

  • live request stream
  • complexity scoring
  • local/cloud/cache routing
  • latency metrics
  • routing percentages

📦 Tech Stack

Component Tech
API FastAPI
Local LLM Ollama
Local Model Qwen2.5-Coder
Cloud Model Gemini Flash Lite
Cache SQLite
HTTP Client httpx
Dashboard Vanilla HTML/CSS/JS

📂 Project Structure

.
├── main.py           # FastAPI server
├── router.py         # Complexity analysis + routing logic
├── dashboard.py      # Live monitoring dashboard
├── cache.py          # Semantic caching layer
├── cache.db          # SQLite cache database
├── .env              # Environment variables
└── README.md

⚙️ Installation

1. Clone the repo

git clone https://github.com/YOUR_USERNAME/llm-cascade-router.git
cd llm-cascade-router

2. Create virtual environment

python -m venv .venv

Activate:

macOS/Linux

source .venv/bin/activate

Windows

.venv\Scripts\activate

3. Install dependencies

pip install -r requirements.txt

4. Install Ollama

From:

Ollama Official Website


5. Pull Qwen model

ollama pull qwen2.5-coder:7b

6. Configure environment variables

Create .env

OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b

GEMINI_API_KEY=your_key_here

COMPLEXITY_THRESHOLD=65

▶️ Running The Project

Start Ollama:

ollama serve

Then run FastAPI:

uvicorn main:app --reload

🔌 API Usage

Endpoint:

POST /v1/chat/completions

Example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Design a scalable notification system"
      }
    ]
  }'

📊 Example Routing Decisions

Prompt Route
"Reverse a linked list" Local
"Fix this Python syntax error" Local
"Design a distributed event streaming platform" Cloud
"Compare CQRS vs Event Sourcing tradeoffs" Cloud

🧠 Complexity Signals

The classifier considers:

  • deep reasoning
  • ambiguity
  • generative requirements
  • domain breadth
  • architectural complexity

These are combined into a final complexity_score.


📈 Future Improvements

  • vector embeddings cache
  • Redis cache backend
  • streaming responses
  • async queueing
  • Prometheus metrics
  • Docker support
  • Kubernetes deployment
  • adaptive thresholds
  • multi-model routing
  • token usage tracking
  • reinforcement learning for routing

🔥 Example Use Cases

AI IDE Backend

Reduce API costs for coding assistants.

Enterprise Gateways

Keep sensitive prompts local.

Multi-LLM Agents

Route tasks intelligently.

Edge AI Systems

Run hybrid local/cloud inference.


🛡 Disclaimer

This project is experimental and intended for learning/research purposes.

Not production hardened yet.


⭐ If You Like This Project

Star the repo and feel free to fork/build on top of it.


👨‍💻 Author

Built by Rohith.

Focused on:

  • AI infrastructure
  • intelligent orchestration
  • developer tooling
  • cost-efficient LLM systems