LLM API Gateway with Intelligent Cache Optimization
LLM providers charge 10x more for cache misses vs cache hits. TokenRouter transforms your LLM infrastructure:
┌──────────────┐ ┌─────────────────────────────────────────────────────────┐ ┌─────────────┐
│ Client A │────▶│ │────▶│ DeepSeek │
├──────────────┤ │ TokenRouter Gateway │ ├─────────────┤
│ Client B │────▶│ Cache Optimization • Deduplication • Cost Tracking │────▶│ OpenAI │
├──────────────┤ │ │ ├─────────────┤
│ Client C │────▶│ │────▶│ Anthropic │
└──────────────┘ └─────────────────────────────────────────────────────────┘ └─────────────┘
| Problem | TokenRouter Solution | Impact |
|---|---|---|
| Low cache hit rate (<30%) | Structural convergence via Chunker + Arranger + Canonicalizer | Cache hits >70% |
| Inconsistent tool ordering | Alphabetical normalization for cross-user cache sharing | Cross-user cache sharing |
| Duplicate concurrent requests | In-memory deduplication (zero upstream calls) | Eliminate redundant calls |
| No cost visibility | Real-time Prometheus metrics (cache savings, dedup savings) | Track every dollar saved |
Result: Cache hit rates >70%, cost reduction up to 90%
┌──────────────────────────────────────────────────────────────────────────┐
│ TokenRouter Performance Dashboard │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ Throughput P99 Latency Cache Hit Rate Cost Savings │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ 10,000 │ │ <50ms │ │ >70% │ │ Up to │ │
│ │ req/s │ │ │ │ │ │ 90% │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ ████████████████████████████████████████████████████████████████ 95% │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Based on load testing with 10,000 concurrent requests:
| Metric | Value | Baseline | Improvement |
|---|---|---|---|
| Throughput | 10,000 req/s | 1,000 req/s | 10x |
| P99 Latency | <50ms | 200ms | 75%↓ |
| Cache Hit Rate | >70% | <30% | 2.3x |
| Cost Savings | Up to 90% | 0% | 90%↓ |
| Dedup Rate | >5% | 0% | New |
Every incoming request flows through this pipeline:
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌───────────────┐ ┌─────────────┐ ┌───────┐ ┌──────┐ ┌─────────┐ ┌───────┐
│Inbound │──▶│Chunker │──▶│Arranger │──▶│Canonicalizer │──▶│CacheInjector│──▶│Hasher │──▶│Dedup │──▶│Outbound │──▶│Proxy │
│Adapter │ │ │ │ │ │ │ │ │ │ │ │ │ │Adapter │ │ │
└─────────┘ └─────────┘ └──────────┘ └───────────────┘ └─────────────┘ └───────┘ └──────┘ └─────────┘ └───────┘
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │
Parse to Split into Order blocks: Deterministic Inject vendor- Compute Check Build Forward
Envelope Block types System→Tool→ JSON serialization specific cache hashes for vendor- to upstream
History→Query directives duplicates specific format
| Component | Function | Impact | Performance |
|---|---|---|---|
| Chunker | Splits messages into System/Tool/History/Query blocks | Structured processing | <1ms |
| Arranger | Orders blocks: System → Tool (sorted) → History → Query | Cache prefix alignment | <1ms |
| Canonicalizer | Deterministic JSON serialization | Byte-perfect hash stability | <2ms |
| CacheInjector | Vendor-specific cache directives | Maximize vendor KV cache | <1ms |
| Hasher | PrefixHash (cache) + FullHash (dedup) | Intelligent routing | <1ms |
| Dedup | In-flight request deduplication | Zero redundant calls | <1ms |
Total Pipeline Overhead: <10ms (P99)
| Feature | TokenRouter | Cloudflare AI Gateway | LiteLLM |
|---|---|---|---|
| KV Cache Optimization | ✅ Structural convergence | ❌ Passthrough only | ❌ Passthrough only |
| Request Deduplication | ✅ In-memory | ❌ No | ❌ No |
| Tool Normalization | ✅ Alphabetical sort | ❌ No | ❌ No |
| Cost Tracking | ✅ Real-time Prometheus | ||
| Open Source | ✅ Full | ❌ Proprietary | ✅ Full |
| Self-Hosted | ✅ Yes | ❌ Cloud only | ✅ Yes |
| Streaming Support | ✅ Full | ✅ Limited | ✅ Full |
| Multi-Provider | ✅ DeepSeek/OpenAI/Anthropic | ✅ Multiple | ✅ Multiple |
┌─────────────────────────────────────────────────────────────────┐
│ Cost per 1M Tokens (USD) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Direct API Call │████████████████████████████████│ $1.00 │
│ (no optimization) │ │ │
│ │ │ │
│ With TokenRouter │█████ │ $0.10 │
│ (70% cache hit) │ │ │
│ │ │ │
│ Savings │████████████████████████████ │ 90% ↓ │
│ │ │ │
└─────────────────────────────────────────────────────────────────┘
# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter/deployments
# Start all services
docker compose up -d
# View logs
docker compose logs -fAccess:
- TokenRouter API: http://localhost:8080
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter
# Build
make build
# Run tests
make test
# Run locally (requires Postgres and Redis)
cp .env.example .env
# Edit .env with your API keys
make devcurl -X POST http://localhost:8080/admin/api-keys \
-H "Content-Type: application/json" \
-d '{
"name": "my-key",
"quota_usd": 100
}'Response:
{
"id": "uuid-here",
"key": "sk-tr-abc123...",
"quota_usd": 100
}
⚠️ Save the key immediately - it's only shown once!
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tr-abc123..." \
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-tr-abc123..." \
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "user", "content": "What is the weather in Beijing?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
}'| Variable | Description | Default | Required |
|---|---|---|---|
PORT |
HTTP server port | 8080 |
❌ |
DATABASE_URL |
Postgres connection string | - | ✅ |
REDIS_URL |
Redis connection string | - | ✅ |
DEEPSEEK_API_KEY |
DeepSeek API key | - | ✅ |
CACHE_INJECT_ENABLED |
Enable cache injection | true |
❌ |
DEDUP_ENABLED |
Enable request deduplication | true |
❌ |
TOOL_SORT_ENABLED |
Enable tool alphabetical sorting | true |
❌ |
DEDUP_TTL |
Deduplication TTL | 2m |
❌ |
LOG_LEVEL |
Log level (debug/info/warn/error) | info |
❌ |
See .env.example for full list.
Development Environment
PORT=8080
LOG_LEVEL=debug
DATABASE_URL=postgres://tokenrouter:tokenrouter@localhost:5432/tokenrouter?sslmode=disable
REDIS_URL=redis://localhost:6379/0
DEEPSEEK_API_KEY=sk-xxx
DEDUP_ENABLED=true
CACHE_INJECT_ENABLED=true
RATE_LIMIT_ENABLED=false # Disable for developmentProduction Environment (Small Scale)
PORT=8080
LOG_LEVEL=warn
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis.example.com:6379/0
DEEPSEEK_API_KEY=sk-xxx
DB_MAX_OPEN_CONNS=50
DB_MAX_IDLE_CONNS=10
DB_CONN_MAX_LIFETIME=30m
AUTH_CACHE_TTL=5mProduction Environment (High Concurrency)
PORT=8080
LOG_LEVEL=error
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis-cluster.example.com:6379/0
# High concurrency settings
GLOBAL_CONCURRENT_LIMIT=10000
STREAM_CONCURRENT_LIMIT=6000
NON_STREAM_CONCURRENT_LIMIT=4000
PROVIDER_CONCURRENT_LIMIT=1000
DB_MAX_OPEN_CONNS=100
DB_MAX_IDLE_CONNS=25
DB_CONN_MAX_LIFETIME=1h
# Connection pool optimization
PROXY_MAX_IDLE_CONNS=10000
PROXY_MAX_IDLE_CONNS_PER_HOST=1000
PROXY_MAX_CONNS_PER_HOST=10000
PROXY_IDLE_CONN_TIMEOUT=90s- 📖 Installation Guide - Complete setup instructions
- ⚙️ Configuration Guide - Environment variables and tuning
- 🚀 Quick Start - Development environment setup
- 💡 Usage Examples - API call examples
- 🏗 System Architecture - Core architecture and module design
- 🔌 Adapter Design - Inbound/Outbound adapter patterns
- 💾 Cache Intelligence - Cache optimization strategies
- 📡 Chat Completions API - API endpoint specifications
- 🔧 Admin API - Management endpoints
- 🤝 Contributing Guide - How to contribute
- 🧪 Testing Guide - End-to-end testing
- 🔌 Adapter Development - Building new provider adapters
We welcome contributions! Please see our Contributing Guide for details.
# Fork and clone
git clone https://github.com/YOUR_USERNAME/TokenRouter.git
cd TokenRouter
# Create branch
git checkout -b feature/your-feature
# Make changes and test
make test
make lint
# Commit and push
git commit -am "feat: add your feature"
git push origin feature/your-feature
# Open Pull RequestLook for issues labeled good first issue to get started.
This project is licensed under the Apache License 2.0.
- Inspired by Cloudflare AI Gateway
- Cache optimization concepts from Anthropic
- Built with Gin and GORM
- GitHub Issues: Report bugs or request features
- Discussions: Join the conversation
- Email: Contact maintainers
- Twitter: @TokenRouter (coming soon)
- Discord: Join our community (coming soon)