Drop-in prompt compression for production LLM applications. Cut your input-token bill by 40–60% — without changing your code.
# before
from openai import OpenAI
# after
from leanctx import OpenAI # same interface, compressed requestsOn the public LongBench v2 leaderboard's short subset, leanctx-Lingua doubles accuracy versus naive head+tail truncation (40 % vs 20 %) while removing 57 % of tokens. Open-source models, runs locally, MIT-licensed. Your prompts and user data never leave your infrastructure by default.
pip install 'leanctx[openai,lingua]' # or [anthropic], [gemini]from leanctx import OpenAI
client = OpenAI(
leanctx_config={
"mode": "on",
"trigger": {"threshold_tokens": 2000},
"routing": {"prose": "lingua"}, # route prose through LLMLingua-2
},
)
response = client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=512,
messages=[{"role": "user", "content": LONG_DOCUMENT}],
)
print(response.usage.leanctx_tokens_saved) # e.g. 1841
print(response.usage.leanctx_ratio) # e.g. 0.49First Lingua call loads ~1.2 GB of model weights to
~/.cache/huggingface/. Subsequent calls reuse the cache. Addpip install 'leanctx[lingua]'to opt in; without it, leanctx falls back to passthrough.
Verify the install with no API key needed:
leanctx bench list # 7 registered scenarios
leanctx bench run agent-structural --workload agent # 5 invariants enforced, exit 0 = passYou're building a production LLM app and your token bill is a line item:
- RAG apps with large retrieved documents
- Long-running conversational agents (LangChain / LangGraph / CrewAI)
- Document-processing pipelines
- Coding agents — Cursor-like / Claude-Code-like, with growing tool-call histories
Existing options have gaps:
- Provider prompt caching (Anthropic / OpenAI / Gemini) wins on stable prefixes — system prompts, tool definitions, retrieved-document pools. It doesn't help with dynamic per-query content (chat history, freshly retrieved docs, tool outputs). Compose with leanctx, don't choose between them.
- Naive truncation drops the middle of the document, exactly where many answers live. The LongBench v2 numbers above show this concretely.
- Hosted compression APIs (Compresr, Token Company) require sending your context to their servers. Closed-source models. leanctx is MIT-licensed, runs the model locally, and never makes outbound calls except to your existing provider.
15-item short-subset ablation, Claude Haiku 4.5 eval, 20K head+tail truncation cap (rate-limit-friendly). Same model, same questions, same truncation across all three conditions — so the comparison is apples-to-apples. Full 503-item sweep is on the v0.3.x roadmap.
| Method | Accuracy | Tokens kept | Reproduce |
|---|---|---|---|
| Baseline (head+tail truncation only) | 20.0 % (3/15) | 100 % of 20K cap | leanctx bench run longbench-v2 |
| leanctx Lingua (ratio=0.5) | 40.0 % (6/15) | 43 % | LEANCTX_LBV2_COMPRESSOR=lingua leanctx bench run longbench-v2 |
| leanctx SelfLLM (Haiku, ratio=0.3) | 26.7 % (4/15) | 1.4 % | LEANCTX_LBV2_COMPRESSOR=selfllm leanctx bench run longbench-v2 |
Lingua doubles the baseline accuracy while removing 57 % of tokens. Naive head+tail truncation drops the middle; Lingua's extractive token classifier keeps answer-bearing tokens distributed across the full document. Per-question records: docs/blog/data/lbv2-2026-05-03/.
A realistic 9-message agent transcript — user question, file reads, grep, log dumps, failed edit, error trace — totaling ~2.1K tokens. Run through leanctx.Anthropic with content-aware routing (code → verbatim, errors → verbatim, prose → Lingua):
| Metric | Before | After | Reduction |
|---|---|---|---|
| Tokens | 2148 | 1384 | 35.6 % |
| Tokens saved per request | 768 |
What got preserved verbatim (asserted programmatically by the agent-structural bench scenario):
- A 2 KB Python source file inside a
tool_resultblock — byte-identical - A Python traceback in an
is_errortool result — byte-identical - Every
tool_use_idand thename/inputof everytool_useblock — tool linkage and tool-call payloads untouched edit_file'snew_strargument — the actual code edit isn't rewritten
What actually compressed:
- A 3.4 KB log dump shrank to 1.9 KB (45 % reduction) — the legitimate compression target
- Grep results and prose reasoning blocks shrank by 30–50 %
Reproduce: leanctx bench run agent-structural --workload agent — runs the real LLMLingua-2 model, ~30 s on Apple Silicon, no API key required. Status flips to failure with named invariants if any regress; CI-gateable.
Same 1.7 KB SRE-incident document through SelfLLM against each provider's cheapest tier:
| Provider | Model | Compression | Latency | Cost per call |
|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 |
41.6 % | 3.05 s | ~$0.0016 |
| OpenAI | gpt-4o-mini |
49.1 % | 6.42 s | ~$0.0003 |
| Gemini | gemini-2.5-flash |
48.7 % | 2.25 s ⚡ | ~$0.0001 |
All three preserved every timestamp, metric value, and action item with no hallucination. Combined with Lingua (LLMLingua-2 local) hitting 44.7 % char reduction on the same document at zero marginal cost, leanctx covers the full speed/cost/quality trade-off space.
Full methodology, per-provider output samples, cost analysis, bugs found in flight: docs/benchmarks/.
leanctx wraps your existing SDK call and applies a configurable compression pipeline before the request hits the wire.
your code
↓
leanctx.Anthropic / OpenAI / Gemini ← drop-in wrapper
↓
Middleware (mode=on/off, threshold)
↓
Per-message pipeline:
classify (code | error | prose | …)
↓
route to compressor:
Verbatim — never touch (code, errors, tool calls)
Lingua — LLMLingua-2 local, free marginal cost
SelfLLM — your configured LLM (Anthropic/OpenAI/Gemini), highest quality
↓
real Anthropic / OpenAI / Gemini SDK → API
Two layers of config:
mode—"on"to compress,"off"to passthrough. Off is safe to leave deployed.routing— maps content types (code / error / prose / unknown / long_important) to compressors (verbatim / lingua / selfllm).
A fully-loaded production config:
from leanctx import OpenAI
client = OpenAI(leanctx_config={
"mode": "on",
"trigger": {"threshold_tokens": 2000}, # don't bother below this
"routing": {
"code": "verbatim", # never touch code
"error": "verbatim", # never touch stack traces
"prose": "lingua", # local LLMLingua-2
"long_important": "selfllm", # cheap LLM summarization
},
"lingua": {"ratio": 0.5, "device": "cpu"},
"selfllm": {"model": "gpt-4o-mini", "api_key": "sk-...", "ratio": 0.3},
"observability": {"otel": True}, # opt-in OpenTelemetry
})leanctx is complementary to Anthropic / OpenAI / Gemini prompt caching, not competitive:
- Provider caching wins on stable prefixes: system prompts, tool definitions, retrieved-document pools that don't change across requests. Up to 90 % discount on cached reads.
- leanctx wins on dynamic per-query content: chat history, freshly retrieved docs, tool outputs, log dumps that vary every call.
- They compose. Mark your stable prefix with
cache_control(provider-specific) and let leanctx compress the variable suffix. Both savings stack.
The OTel telemetry leanctx emits includes a provider label that you can correlate with provider-side cache-hit metrics in the same dashboard.
leanctx emits OpenTelemetry spans + metrics for every compression call, opt-in via leanctx_config["observability"]["otel"]. The library is API-only: it never owns the OTel SDK or registers providers. The application configures OTel; leanctx emits.
client = leanctx.Anthropic(
leanctx_config={
"mode": "on",
"observability": {"otel": True},
},
)Each wrapper-routed call produces one root leanctx.compress span (provider, method, input_tokens, output_tokens, cost_usd, duration_ms) plus per-compressor child spans. Five metrics — 4 counters + 1 histogram — labeled by provider/method/status. Closed leanctx.method taxonomy: passthrough | below-threshold | empty | opaque-bailout | verbatim | lingua | selfllm | hybrid.
See docs/observability.md for the full attribute reference, stream-lifetime contract, app-side OTel SDK setup, and cardinality guidance.
The leanctx bench CLI ships seven named scenarios with versioned JSON output (schema_version: "1"):
leanctx bench list # show registered scenarios
leanctx bench run lingua-local --workload rag # offline, no API key
leanctx bench run agent-structural --workload agent # 5 invariants enforced
leanctx bench run anthropic-e2e --workload chat # full stack, respx-mocked
leanctx bench run selfllm-anthropic --workload rag # live API, set ANTHROPIC_API_KEY
leanctx bench run longbench-v2 --workload rag # public LongBench v2 ablationVersioned schema, multi-run isolation (--runs N constructs fresh client/middleware each run), clean diagnostics for missing extras / API keys (exit 3, no traceback). Built so downstream tooling can consume the JSON without breaking on schema changes.
pip install leanctx # core (passthrough only — useful for testing the wrapper)
pip install 'leanctx[anthropic,openai,gemini]' # provider SDKs
pip install 'leanctx[lingua]' # + LLMLingua-2 local compression (~1.2 GB on first call)
pip install 'leanctx[otel]' # + OpenTelemetry API/SDK
pip install 'leanctx[bench]' # + respx for offline scenarios
pip install 'leanctx[longbench]' # + HuggingFace datasets for LongBench v2
pip install 'leanctx[all]' # everythingDocker:
docker build -t leanctx:slim . # 341 MB, all provider SDKs
docker build -t leanctx:lingua --build-arg LINGUA=true . # + LLMLingua-2, ~3 GB| Provider | Drop-in client | Streaming | Compression | SelfLLM target |
|---|---|---|---|---|
| Anthropic | leanctx.Anthropic / AsyncAnthropic |
✅ | ✅ | ✅ |
| OpenAI | leanctx.OpenAI / AsyncOpenAI |
✅ | ✅ | ✅ |
| Gemini | leanctx.Gemini (.models + .aio.models) |
✅ | ✅ * | ✅ |
* Gemini text-only requests compress fully. Requests that include function_call, function_response, or multimodal (inline_data) parts automatically bail out to passthrough — leanctx never rewrites tool-call payloads (would change tool semantics) and doesn't touch images. Multimodal + function-call compression is on the v0.3.x roadmap. Spans for these calls carry leanctx.method = opaque-bailout so you can monitor the share.
12 wrapper request paths instrumented (sync + async × stream + non-stream × 3 providers). Stream-path span lifetime closes at the first of: iterator exhaustion, explicit .close(), or __del__ GC backstop — duration_ms covers the full stream lifetime.
v0.3.1 is on PyPI. Built across a 5-round Codex-reviewed RLCR loop; 257 tests passing, ruff + mypy --strict clean across 40 source files.
- v0.1 — Python SDK, drop-in wrappers, LLMLingua-2 + SelfLLM (Anthropic), classifier, router, dedup + purge-errors strategies, LangChain helpers, Docker
- v0.2 — SelfLLM on OpenAI + Gemini, block-aware compression (tool_use / tool_result preserved), Gemini contents normalization, LCEL
compress_runnable - v0.3 — OpenTelemetry observability across 12 wrapper paths,
leanctx benchCLI (6 scenarios + versioned schema),agent-structuralinvariant enforcement, public releasev0.3.1— 2026-04-26 - v0.3.x — full 503-item LongBench v2 sweep, ghcr.io Docker publish, OpenAI Responses-API intercept, multimodal + function-call compression for Gemini, LlamaIndex helpers, TypeScript SDK compression port
- v0.4 — per-tenant attribution (with cardinality cap), Helm chart / K8s sidecar, stateful session dedup with explicit session IDs
MIT. See LICENSE.