Skip to content

Latest commit

 

History

History
435 lines (329 loc) · 17.8 KB

File metadata and controls

435 lines (329 loc) · 17.8 KB

TokenRouter

LLM API Gateway with Intelligent Cache Optimization

Go Version License Tests Coverage GitHub Stars GitHub Forks GitHub Issues GitHub Pull Requests Release Last Commit

TokenRouter Banner


🎯 Why TokenRouter?

LLM providers charge 10x more for cache misses vs cache hits. TokenRouter transforms your LLM infrastructure:

┌──────────────┐     ┌─────────────────────────────────────────────────────────┐     ┌─────────────┐
│   Client A   │────▶│                                                         │────▶│   DeepSeek  │
├──────────────┤     │   TokenRouter Gateway                                    │     ├─────────────┤
│   Client B   │────▶│   Cache Optimization • Deduplication • Cost Tracking    │────▶│   OpenAI    │
├──────────────┤     │                                                         │     ├─────────────┤
│   Client C   │────▶│                                                         │────▶│  Anthropic  │
└──────────────┘     └─────────────────────────────────────────────────────────┘     └─────────────┘
Problem TokenRouter Solution Impact
Low cache hit rate (<30%) Structural convergence via Chunker + Arranger + Canonicalizer Cache hits >70%
Inconsistent tool ordering Alphabetical normalization for cross-user cache sharing Cross-user cache sharing
Duplicate concurrent requests In-memory deduplication (zero upstream calls) Eliminate redundant calls
No cost visibility Real-time Prometheus metrics (cache savings, dedup savings) Track every dollar saved

Result: Cache hit rates >70%, cost reduction up to 90%


📊 Performance Metrics

┌──────────────────────────────────────────────────────────────────────────┐
│                    TokenRouter Performance Dashboard                      │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Throughput        P99 Latency      Cache Hit Rate      Cost Savings    │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐       ┌──────────┐     │
│  │ 10,000   │     │  <50ms   │     │   >70%   │       │  Up to   │     │
│  │  req/s   │     │          │     │          │       │   90%    │     │
│  └──────────┘     └──────────┘     └──────────┘       └──────────┘     │
│                                                                          │
│  ████████████████████████████████████████████████████████████████ 95%   │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Based on load testing with 10,000 concurrent requests:

Metric Value Baseline Improvement
Throughput 10,000 req/s 1,000 req/s 10x
P99 Latency <50ms 200ms 75%↓
Cache Hit Rate >70% <30% 2.3x
Cost Savings Up to 90% 0% 90%↓
Dedup Rate >5% 0% New

Star History Chart


🏗 Architecture

Every incoming request flows through this pipeline:

┌─────────┐   ┌─────────┐   ┌──────────┐   ┌───────────────┐   ┌─────────────┐   ┌───────┐   ┌──────┐   ┌─────────┐   ┌───────┐
│Inbound  │──▶│Chunker  │──▶│Arranger  │──▶│Canonicalizer  │──▶│CacheInjector│──▶│Hasher │──▶│Dedup │──▶│Outbound │──▶│Proxy  │
│Adapter  │   │         │   │          │   │               │   │             │   │       │   │      │   │Adapter  │   │       │
└─────────┘   └─────────┘   └──────────┘   └───────────────┘   └─────────────┘   └───────┘   └──────┘   └─────────┘   └───────┘
     │              │              │                │                  │                │           │            │
     │              │              │                │                  │                │           │            │
  Parse to      Split into    Order blocks:    Deterministic     Inject vendor-   Compute     Check     Build      Forward
  Envelope      Block types   System→Tool→     JSON serialization  specific cache  hashes     for       vendor-    to upstream
                              History→Query                         directives                 duplicates specific  format

Core Components

Component Function Impact Performance
Chunker Splits messages into System/Tool/History/Query blocks Structured processing <1ms
Arranger Orders blocks: System → Tool (sorted) → History → Query Cache prefix alignment <1ms
Canonicalizer Deterministic JSON serialization Byte-perfect hash stability <2ms
CacheInjector Vendor-specific cache directives Maximize vendor KV cache <1ms
Hasher PrefixHash (cache) + FullHash (dedup) Intelligent routing <1ms
Dedup In-flight request deduplication Zero redundant calls <1ms

Total Pipeline Overhead: <10ms (P99)


📈 Comparison

Feature Comparison

Feature TokenRouter Cloudflare AI Gateway LiteLLM
KV Cache Optimization ✅ Structural convergence ❌ Passthrough only ❌ Passthrough only
Request Deduplication ✅ In-memory ❌ No ❌ No
Tool Normalization ✅ Alphabetical sort ❌ No ❌ No
Cost Tracking ✅ Real-time Prometheus ⚠️ Paid feature ⚠️ Basic
Open Source ✅ Full ❌ Proprietary ✅ Full
Self-Hosted ✅ Yes ❌ Cloud only ✅ Yes
Streaming Support ✅ Full ✅ Limited ✅ Full
Multi-Provider ✅ DeepSeek/OpenAI/Anthropic ✅ Multiple ✅ Multiple

Cost Comparison (1M tokens)

┌─────────────────────────────────────────────────────────────────┐
│                    Cost per 1M Tokens (USD)                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Direct API Call    │████████████████████████████████│ $1.00   │
│  (no optimization)  │                                │         │
│                     │                                │         │
│  With TokenRouter   │█████                           │ $0.10   │
│  (70% cache hit)    │                                │         │
│                     │                                │         │
│  Savings            │████████████████████████████    │ 90% ↓   │
│                     │                                │         │
└─────────────────────────────────────────────────────────────────┘

🚀 Quick Start

Docker (Recommended)

# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter/deployments

# Start all services
docker compose up -d

# View logs
docker compose logs -f

Access:

Source Build

# Clone repository
git clone https://github.com/GouBuliya/TokenRouter.git
cd TokenRouter

# Build
make build

# Run tests
make test

# Run locally (requires Postgres and Redis)
cp .env.example .env
# Edit .env with your API keys
make dev

💡 Usage Examples

1. Create API Key

curl -X POST http://localhost:8080/admin/api-keys \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-key",
    "quota_usd": 100
  }'

Response:

{
  "id": "uuid-here",
  "key": "sk-tr-abc123...",
  "quota_usd": 100
}

⚠️ Save the key immediately - it's only shown once!

2. Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-tr-abc123..." \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

3. With Tools

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-tr-abc123..." \
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [
      {"role": "user", "content": "What is the weather in Beijing?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string"}
            },
            "required": ["city"]
          }
        }
      }
    ]
  }'

🔧 Configuration

Environment Variables

Variable Description Default Required
PORT HTTP server port 8080
DATABASE_URL Postgres connection string -
REDIS_URL Redis connection string -
DEEPSEEK_API_KEY DeepSeek API key -
CACHE_INJECT_ENABLED Enable cache injection true
DEDUP_ENABLED Enable request deduplication true
TOOL_SORT_ENABLED Enable tool alphabetical sorting true
DEDUP_TTL Deduplication TTL 2m
LOG_LEVEL Log level (debug/info/warn/error) info

See .env.example for full list.

Configuration Templates

Development Environment
PORT=8080
LOG_LEVEL=debug
DATABASE_URL=postgres://tokenrouter:tokenrouter@localhost:5432/tokenrouter?sslmode=disable
REDIS_URL=redis://localhost:6379/0
DEEPSEEK_API_KEY=sk-xxx
DEDUP_ENABLED=true
CACHE_INJECT_ENABLED=true
RATE_LIMIT_ENABLED=false  # Disable for development
Production Environment (Small Scale)
PORT=8080
LOG_LEVEL=warn
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis.example.com:6379/0
DEEPSEEK_API_KEY=sk-xxx
DB_MAX_OPEN_CONNS=50
DB_MAX_IDLE_CONNS=10
DB_CONN_MAX_LIFETIME=30m
AUTH_CACHE_TTL=5m
Production Environment (High Concurrency)
PORT=8080
LOG_LEVEL=error
DATABASE_URL=postgres://user:pass@db.example.com:5432/tokenrouter?sslmode=require
REDIS_URL=redis://redis-cluster.example.com:6379/0

# High concurrency settings
GLOBAL_CONCURRENT_LIMIT=10000
STREAM_CONCURRENT_LIMIT=6000
NON_STREAM_CONCURRENT_LIMIT=4000
PROVIDER_CONCURRENT_LIMIT=1000

DB_MAX_OPEN_CONNS=100
DB_MAX_IDLE_CONNS=25
DB_CONN_MAX_LIFETIME=1h

# Connection pool optimization
PROXY_MAX_IDLE_CONNS=10000
PROXY_MAX_IDLE_CONNS_PER_HOST=1000
PROXY_MAX_CONNS_PER_HOST=10000
PROXY_IDLE_CONN_TIMEOUT=90s

📚 Documentation

Getting Started

Architecture

API Reference

Development


🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

# Fork and clone
git clone https://github.com/YOUR_USERNAME/TokenRouter.git
cd TokenRouter

# Create branch
git checkout -b feature/your-feature

# Make changes and test
make test
make lint

# Commit and push
git commit -am "feat: add your feature"
git push origin feature/your-feature

# Open Pull Request

Good First Issues

Look for issues labeled good first issue to get started.

Contributors


📄 License

This project is licensed under the Apache License 2.0.


🙏 Acknowledgments


📬 Contact


Made with ❤️ for the AI community

⬆️ Back to top | 📖 Documentation | 🤝 Contributing

Star this repo Fork this repo Follow us