Generate optimal LLM context from pandas DataFrames within a token budget.
You have a 100K-row DataFrame. Your LLM has a context window.
df.to_string()gives you millions of tokensdf.head()gives you 5 rows with no statistical context
dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.
pip install dfcontextOptional dependencies for accurate token counting and YAML output:
pip install dfcontext[all] # tiktoken + pyyaml
pip install dfcontext[tiktoken] # accurate token counting only
pip install dfcontext[yaml] # YAML format output onlyimport pandas as pd
from dfcontext import to_context
df = pd.read_csv("sales.csv") # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)Output:
## Dataset overview
- 100,000 rows × 5 columns
## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |
## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)
### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]
### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly
### is_return (boolean)
True: 6.0% | False: 94.0%
## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |- Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
- Token budget management — output always fits within your specified token limit
- Adaptive detail — higher budgets produce richer stats (percentiles, skewness, outlier rates)
- Query hints — tell it what you're analyzing, and it prioritizes relevant columns
- Correlation detection — find relationships between numeric columns
- Outlier indicators — flag columns with potential outliers (IQR method)
- Multiple formats — Markdown, plain text, or YAML output
- Zero LLM dependency — pure data processing, works with any LLM provider
- Fast — handles 100K rows in under a second
Provide a hint to allocate more token budget to relevant columns:
ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysisctx_md = to_context(df, format="markdown") # default
ctx_plain = to_context(df, format="plain") # no markdown syntax
ctx_yaml = to_context(df, format="yaml") # requires pyyamlFor full control, use ContextConfig:
from dfcontext import ContextConfig, to_context
config = ContextConfig(
token_budget=3000,
format="markdown",
hint="churn analysis",
include_schema=True,
include_stats=True,
include_samples=True,
max_sample_rows=5,
)
ctx = to_context(df, config=config)Find relationships between numeric columns:
ctx = to_context(df, token_budget=2000, include_correlations=True)
# Output includes: "sales ↔ quantity: r=+0.823 (strong positive)"Get structured analysis results as Python objects:
from dfcontext import ColumnSummary, analyze_columns
summaries = analyze_columns(df)
for name, s in summaries.items():
print(f"{name}: {s.column_type}, {s.unique_count} unique")
if s.distribution_sketch:
print(f" histogram: [{s.distribution_sketch}]")
if "outlier_rate" in s.stats:
print(f" outliers: {s.stats['outlier_rate'] * 100:.1f}%")ColumnSummary fields: name, dtype, column_type, non_null_rate, unique_count, stats (dict), sample_values (list), distribution_sketch (str | None).
from dfcontext import count_tokens
tokens = count_tokens("some text")import anthropic
from dfcontext import to_context
df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"{ctx}\n\nWhat are the key sales trends?",
}],
)| Function | Description |
|---|---|
to_context(df, ...) |
Generate LLM context string from a DataFrame |
analyze_columns(df) |
Get structured column analysis results |
count_tokens(text) |
Count tokens in text |
| Class | Description |
|---|---|
ContextConfig |
Configuration dataclass for to_context() |
ColumnSummary |
Structured result from column analysis |
BudgetPlan |
Token budget allocation plan |
See the examples/ directory for runnable scripts:
- with_claude.py — Analyze a DataFrame with Anthropic Claude
- with_openai.py — Analyze a DataFrame with OpenAI GPT
- compare_dataframes.py — Year-over-year comparison
- budget_tuning.py — See how budget affects output (no API key needed)
- mcp_server.py — Build an MCP tool that summarizes CSV files
MIT