LLM smart router. Classifies prompts by complexity and routes to the optimal model.
OpenAI API-compatible proxy — drop in as a base URL and let kani pick the right model automatically.
Request → Distilled Feature Classifier (15 dimensions) → Tier + Agentic Score → Capability Filter → Model Selection (round-robin) → Upstream Provider
│
└─ model unavailable → conservative default
Classification pipeline:
- Distilled feature classifier — deterministic
tokenCount+ learned 14 semantic dimensions - Weighted synthesis — one unified score drives tier selection (
SIMPLE/MEDIUM/COMPLEX/REASONING) - Unified agentic score —
agenticTaskdimension is mapped directly toagentic_score - Conservative default — fall back to
MEDIUMwhen the feature model is unavailable - Capability filter — auto-detects vision/tools/json_mode from the request and escalates to a capable model
Every request is logged to $XDG_STATE_HOME/kani/log/ (default: ~/.local/state/kani/log/) as training data for future model improvement.
kani no longer relies on hand-maintained keyword lists or runtime LLM fallback for routing. The scorer is now distilled-feature-first:
- compute
tokenCountdeterministically - infer 14 semantic dimensions (
low/medium/high) using a learned multi-output classifier - combine all 15 dimensions with explicit weights to determine tier and confidence
- derive
agentic_scorefrom theagenticTaskdimension in the same pipeline - return a conservative default tier only when the feature model is unavailable
This makes routing behavior easier to improve with data, because changes come from retraining and calibration rather than runtime prompt engineering.
# Classify a prompt
uvx --from git+https://github.com/tumf/kani kani route "hello world"
# Start the proxy server
uvx --from git+https://github.com/tumf/kani kani servegit clone https://github.com/tumf/kani.git && cd kani
uv sync
uv run kani route "hello world"
uv run kani servekani speaks the OpenAI API. Change base_url and model, everything else stays the same.
from openai import OpenAI
client = OpenAI(
api_key="sk-...", # OpenAI key
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "explain quicksort"}],
)from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1", # OpenRouter
api_key="sk-or-...", # OpenRouter key
)
response = client.chat.completions.create(
model="anthropic/claude-sonnet-4",
messages=[{"role": "user", "content": "explain quicksort"}],
)from openai import OpenAI
client = OpenAI(
base_url="http://localhost:18420/v1", # ← kani
api_key="anything", # kani handles upstream auth
)
# kani picks the best model based on prompt complexity
response = client.chat.completions.create(
model="kani/auto",
messages=[{"role": "user", "content": "explain quicksort"}],
)
# Or pin a routing profile
response = client.chat.completions.create(
model="kani/premium", # always use best-quality models
messages=[{"role": "user", "content": "prove P != NP"}],
)curl http://localhost:18420/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "kani/auto", "messages": [{"role": "user", "content": "hello"}]}'That's it. Any tool or library that supports the OpenAI API works with kani — LangChain, LlamaIndex, Cursor, Continue, etc. Just point
base_urlat kani.
Note: The routing profiles below are sample/reference defaults. Treat them as examples — you should tune the actual profile names, strategies, and model mappings to match your own workload and cost/quality goals.
| Profile | Strategy | Best for |
|---|---|---|
kani/auto |
Balanced cost/quality (default) | General use |
kani/eco |
Cheapest viable models | High volume, low stakes |
kani/premium |
Best quality models | Critical tasks |
kani/agentic |
Tool-use optimized | Agent workflows |
kani automatically detects required capabilities from the request and routes to a model that supports them. If no model in the scored tier has the required capabilities, kani escalates to higher tiers.
Detected capabilities:
| Capability | Trigger |
|---|---|
vision |
image_url content block in messages |
tools |
tools or functions field in request |
json_mode |
response_format.type is json_object or json_schema |
Configuration: declare model capabilities via prefix matching in config.yaml:
model_capabilities:
- prefix: "anthropic/claude-"
capabilities: [vision, tools, json_mode]
- prefix: "google/gemini-"
capabilities: [vision, tools, json_mode]
- prefix: "openai/gpt-4"
capabilities: [vision, tools, json_mode]When no model_capabilities are configured, capability filtering is skipped and routing works as before.
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Main proxy (OpenAI-compatible) |
/v1/models |
GET | List available models |
/v1/route |
POST | Debug — returns routing decision without proxying |
/admin/reload-config |
POST | Admin-only safe config hot reload |
/health |
GET | Health + active config version metadata |
Routed responses include extra headers: X-Kani-Tier, X-Kani-Model, X-Kani-Score, X-Kani-Signals.
config.yaml:
host: "0.0.0.0"
port: 18420
default_provider: openrouter
default_profile: auto
providers:
openrouter:
name: openrouter
base_url: "https://openrouter.ai/api/v1"
api_key: "${OPENROUTER_API_KEY}"
cliproxy:
name: cliproxy
base_url: "http://127.0.0.1:8317/v1"
api_key: "local-test-key"
profiles:
auto:
tiers:
SIMPLE:
# primary can be a single model or an ordered list for round-robin
primary: ["google/gemini-2.5-flash", "google/gemini-2.5-flash-lite"]
fallback: ["nvidia/gpt-oss-120b"]
MEDIUM:
primary: "moonshotai/kimi-k2.5"
fallback: null # allowed; normalized to []
COMPLEX:
primary: "google/gemini-3.1-pro"
fallback: ["anthropic/claude-sonnet-4.6"]
REASONING:
primary: "x-ai/grok-4-1-fast-reasoning"
fallback: ["anthropic/claude-sonnet-4.6"]
# provider: per-tier override (optional)
smart_proxy:
fallback_backoff:
enabled: true
initial_delay_seconds: 5
multiplier: 2
max_delay_seconds: 300
${VAR}syntax resolves environment variables- Each tier can specify its own
provideror inheritdefault_provider primaryaccepts a string,{model, provider}object, or a list of those; list entries are selected round-robin perprofile+tiercombinationfallback: nullis accepted only atprofiles.*.tiers.*.fallbackand normalized to[]- When primary fails, fallback attempts skip the failed primary candidate and deduplicate repeated
model+providerentries smart_proxy.fallback_backoffenables process-local exponential cooldowns for retryable non-streaming429/5xxfailures, keyed bymodel+provider- Cooled-down
model+providerpairs are skipped during both primary selection and fallback execution; the same model on a different provider remains eligible - Successful recovery resets the failure streak for that exact
model+providerpair, and restarting kani clears the in-memory cooldown registry - Config path:
--configflag >$KANI_CONFIGenv var >./config.yaml>$XDG_CONFIG_HOME/kani/config.yaml>/etc/kani/config.yaml - Set
KANI_ADMIN_TOKENto enablePOST /admin/reload-config(admin-only, separate from regular API keys) - Hot reload validates with
strict=Trueand rejects non-reloadable field changes (host,port) with409
kani can optionally reduce context pressure for long-running conversations by compacting oversized message histories before proxying upstream (Phase A) and by pre-computing summaries in the background for reuse on later requests (Phase B).
All compaction behavior is opt-in and disabled by default. When disabled or when compaction fails, kani routes and proxies requests unchanged.
Add a smart_proxy section to your config.yaml:
smart_proxy:
context_compaction:
enabled: true # master switch
sync_compaction:
enabled: true # Phase A: compact inline before proxying
threshold_percent: 80.0 # compact when prompt ≥ 80% of context window
protect_first_n: 1 # turns to keep at head of conversation
protect_last_n: 2 # turns to keep at tail
summary_model: "" # empty = use 'compress' profile primary model
background_precompaction:
enabled: true # Phase B: pre-compute summaries async
trigger_percent: 70.0 # start background job at 70% usage
max_concurrency: 2 # max parallel background jobs
summary_ttl_seconds: 3600
session:
header_name: X-Kani-Session-Id # client header for explicit session binding
context_window_tokens: 128000 # assumed context window for threshold mathA compress routing profile (see config.example.yaml) is used as the default summarisation model when summary_model is empty.
kani resolves a stable session key in this order:
- Explicit header — value of
session.header_name(preferred; required for Phase B cache hits) - Derived — deterministic hash of model + first/last message content
The resolution mode is surfaced in the X-Kani-Compaction-Session response header.
Each routed response includes compaction headers:
| Header | Values | Meaning |
|---|---|---|
X-Kani-Compaction |
off | skipped | inline | cached | failed |
What compaction did |
X-Kani-Compaction-Session |
explicit | derived |
How session was resolved |
X-Kani-Compaction-Saved-Tokens |
integer | Estimated tokens saved |
Structured log fields are emitted at INFO level on every compaction decision. Failures are logged at WARNING level and never propagate to the client.
Use admin-only config hot reload without restarting the proxy:
# 1) set an admin token (separate from normal API keys)
export KANI_ADMIN_TOKEN="your-admin-token"
# 2) trigger reload after editing config.yaml
curl -X POST http://localhost:18420/admin/reload-config \
-H "Authorization: Bearer ${KANI_ADMIN_TOKEN}"Behavior:
- Reload is applied only when strict config validation succeeds.
- In-flight requests keep the state snapshot captured at request start.
- Changes to
host/portare rejected as non-reloadable with409and require process restart.
No additional services are required. Compaction state is persisted in SQLite under $XDG_DATA_HOME/kani/compaction.db (default: ~/.local/share/kani/compaction.db). Override with KANI_DATA_DIR.
# Verify compaction is active after startup:
curl -s http://localhost:18420/health | jq .
# Inspect a routed request's compaction outcome:
curl -v -X POST http://localhost:18420/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Kani-Session-Id: my-session-1" \
-d '{"model":"kani/auto","messages":[{"role":"user","content":"hello"}]}' \
2>&1 | grep -i "x-kani-compaction"Runtime routing does not call an LLM. LLM usage is limited to offline dataset generation when logs are missing semantic labels.
Optional annotator configuration (for scripts/build_agentic_dataset.py --annotate-missing) can be set in config.yaml under feature_annotator, or overridden with env vars:
feature_annotator:
model: "gemini-2.5-flash-lite"
provider: "cliproxy" # optional; defaults to default_providerfeature_annotator and llm_classifier connection details are provider-resolved. In config.yaml, set model + optional provider; do not set base_url or api_key directly in these sections.
| Env var | Default | Description |
|---|---|---|
KANI_LLM_ANNOTATOR_MODEL |
google/gemini-2.5-flash-lite |
Annotation model |
KANI_LLM_ANNOTATOR_BASE_URL |
https://openrouter.ai/api/v1 |
API endpoint |
KANI_LLM_ANNOTATOR_API_KEY |
$OPENROUTER_API_KEY |
API key |
Priority is: CLI flags > env vars > config.yaml feature_annotator > built-in defaults.
All decisions are logged to $XDG_STATE_HOME/kani/log/routing-YYYY-MM-DD.jsonl (default: ~/.local/state/kani/log/):
{"timestamp":"2025-03-21T19:50:00","prompt_preview":"prove the Riemann...","tier":"REASONING","score":0.82,"confidence":0.87,"method":"distilled-features","agentic_score":1.0,"signals":{"tokenCount":38,"semanticLabels":{"reasoningMarkers":"high","agenticTask":"high"},"featureVersion":"v1"}}Use these logs to build distilled feature training data:
uv run python scripts/build_agentic_dataset.py \
--output data/distilled_feature_dataset.jsonWhen existing logs do not yet include semantic labels, you can annotate missing examples offline:
uv run python scripts/build_agentic_dataset.py \
--annotate-missing \
--output data/distilled_feature_dataset.jsonThen train the multi-output feature classifier bundle:
uv run python scripts/train_classifier.py \
--data data/distilled_feature_dataset.json \
--output modelsThis writes models/feature_classifier.pkl with the sklearn multi-output classifier, per-dimension label encoders, weights, thresholds, and embedding metadata.
kani supports API key authentication to restrict proxy access. Keys are managed via the CLI and stored in $XDG_DATA_HOME/kani/api_keys.json.
When no keys are configured, all requests pass through without authentication (backward-compatible). As soon as one key is added, every API request must include a valid Authorization: Bearer <key> header.
# Create a key (auto-generated, shown once)
kani keys add hermes
# kani-aBcDeFgH... ← save this
# List keys (prefix only, secrets are not stored in plaintext)
kani keys list
# Remove a key by name or prefix
kani keys remove hermesUsing the key:
curl http://localhost:18420/v1/chat/completions \
-H "Authorization: Bearer kani-aBcDeFgH..." \
-H "Content-Type: application/json" \
-d '{"model": "kani/auto", "messages": [{"role": "user", "content": "hello"}]}'client = OpenAI(
base_url="http://localhost:18420/v1",
api_key="kani-aBcDeFgH...", # kani API key
)/health and /docs are exempt from authentication. No server restart required — keys take effect immediately.
kani serve [--config path] [--host 0.0.0.0] [--port 18420]
kani route "your prompt here" [--config path]
kani config [--config path]
kani keys add <name>
kani keys list
kani keys remove <name|prefix>src/kani/
├── scorer.py # distilled feature scoring (15-dimensional classifier)
├── router.py # Tier → model+provider mapping
├── proxy.py # FastAPI OpenAI-compatible server
├── config.py # YAML config loading, env var resolution
├── dirs.py # XDG-compliant directory paths (config, data, logs)
├── logger.py # JSONL routing log
└── cli.py # Click CLI
uv sync
uv run pytest tests/ -q # 176 tests
uv run ruff check src/
uv run pyright src/Scoring logic ported from ClawRouter (MIT license).
MIT
