A production-style intelligent prompt router that dynamically decides whether a request should be handled:
- locally using Ollama + Qwen
- or escalated to a cloud LLM (Gemini)
It includes:
- complexity-based routing
- semantic caching
- live dashboard
- OpenAI-compatible API
- cost optimization
- latency tracking
- routing observability
Most AI applications either:
- send everything to expensive cloud models
- or force everything through weaker local models
This project solves that.
The router first analyzes the complexity of a prompt, then decides:
| Prompt Type | Route |
|---|---|
| Simple / factual / coding help | Local Qwen |
| Complex reasoning / architecture / deep analysis | Gemini |
| Repeated prompts | Semantic cache |
This dramatically reduces:
- cloud API cost
- latency
- unnecessary escalations
while still preserving high-quality answers for difficult prompts.
┌─────────────────┐
│ Incoming Prompt │
└────────┬────────┘
│
▼
┌────────────────────┐
│ Semantic Cache │
│ (SQLite / vector) │
└────────┬───────────┘
│ hit
▼
Cached Response
│ miss
▼
┌────────────────────────┐
│ Complexity Analyzer │
│ (Local Qwen via Ollama)│
└──────────┬─────────────┘
│
┌───────────────┴────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌────────────────────┐
│ Local Model │ │ Cloud Escalation │
│ Qwen via Ollama │ │ Gemini Flash Lite │
└────────┬─────────┘ └─────────┬──────────┘
│ │
└──────────────┬────────────────┘
▼
Final Response
Complexity classifier determines whether a prompt should remain local or go to the cloud.
Simple prompts are handled completely offline using:
- Ollama
- Qwen2.5-Coder
Complex prompts automatically route to Gemini for stronger reasoning.
Repeated prompts are served instantly from cache.
Works with:
- Continue.dev
- OpenWebUI
- VSCode extensions
- custom agents
- OpenAI SDKs
Real-time observability dashboard showing:
- local vs cloud routing
- cache hits
- complexity scores
- latency
- request logs
Designed to minimize paid token usage.
Open:
http://localhost:8000/dashboardYou’ll see:
- live request stream
- complexity scoring
- local/cloud/cache routing
- latency metrics
- routing percentages
| Component | Tech |
|---|---|
| API | FastAPI |
| Local LLM | Ollama |
| Local Model | Qwen2.5-Coder |
| Cloud Model | Gemini Flash Lite |
| Cache | SQLite |
| HTTP Client | httpx |
| Dashboard | Vanilla HTML/CSS/JS |
.
├── main.py # FastAPI server
├── router.py # Complexity analysis + routing logic
├── dashboard.py # Live monitoring dashboard
├── cache.py # Semantic caching layer
├── cache.db # SQLite cache database
├── .env # Environment variables
└── README.md
git clone https://github.com/YOUR_USERNAME/llm-cascade-router.git
cd llm-cascade-routerpython -m venv .venvActivate:
source .venv/bin/activate.venv\Scripts\activatepip install -r requirements.txtFrom:
ollama pull qwen2.5-coder:7bCreate .env
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:7b
GEMINI_API_KEY=your_key_here
COMPLEXITY_THRESHOLD=65Start Ollama:
ollama serveThen run FastAPI:
uvicorn main:app --reloadEndpoint:
POST /v1/chat/completionsExample:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Design a scalable notification system"
}
]
}'| Prompt | Route |
|---|---|
| "Reverse a linked list" | Local |
| "Fix this Python syntax error" | Local |
| "Design a distributed event streaming platform" | Cloud |
| "Compare CQRS vs Event Sourcing tradeoffs" | Cloud |
The classifier considers:
- deep reasoning
- ambiguity
- generative requirements
- domain breadth
- architectural complexity
These are combined into a final complexity_score.
- vector embeddings cache
- Redis cache backend
- streaming responses
- async queueing
- Prometheus metrics
- Docker support
- Kubernetes deployment
- adaptive thresholds
- multi-model routing
- token usage tracking
- reinforcement learning for routing
Reduce API costs for coding assistants.
Keep sensitive prompts local.
Route tasks intelligently.
Run hybrid local/cloud inference.
This project is experimental and intended for learning/research purposes.
Not production hardened yet.
Star the repo and feel free to fork/build on top of it.
Built by Rohith.
Focused on:
- AI infrastructure
- intelligent orchestration
- developer tooling
- cost-efficient LLM systems