dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

df.to_string() gives you millions of tokens
df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
Token budget management — output always fits within your specified token limit
Adaptive detail — higher budgets produce richer stats (percentiles, skewness, outlier rates)
Query hints — tell it what you're analyzing, and it prioritizes relevant columns
Correlation detection — find relationships between numeric columns
Outlier indicators — flag columns with potential outliers (IQR method)
Multiple formats — Markdown, plain text, or YAML output
Zero LLM dependency — pure data processing, works with any LLM provider
Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Correlation Detection

Find relationships between numeric columns:

ctx = to_context(df, token_budget=2000, include_correlations=True)
# Output includes: "sales ↔ quantity: r=+0.823 (strong positive)"

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import ColumnSummary, analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")
    if s.distribution_sketch:
        print(f"  histogram: [{s.distribution_sketch}]")
    if "outlier_rate" in s.stats:
        print(f"  outliers: {s.stats['outlier_rate'] * 100:.1f}%")

ColumnSummary fields: name, dtype, column_type, non_null_rate, unique_count, stats (dict), sample_values (list), distribution_sketch (str | None).

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function	Description
`to_context(df, ...)`	Generate LLM context string from a DataFrame
`analyze_columns(df)`	Get structured column analysis results
`count_tokens(text)`	Count tokens in text

Class	Description
`ContextConfig`	Configuration dataclass for `to_context()`
`ColumnSummary`	Structured result from column analysis
`BudgetPlan`	Token budget allocation plan

Examples

See the examples/ directory for runnable scripts:

with_claude.py — Analyze a DataFrame with Anthropic Claude
with_openai.py — Analyze a DataFrame with OpenAI GPT
compare_dataframes.py — Year-over-year comparison
budget_tuning.py — See how budget affects output (no API key needed)
mcp_server.py — Build an MCP tool that summarizes CSV files

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
examples		examples
src/dfcontext		src/dfcontext
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dfcontext

Why?

Install

Quick Start

Features

Advanced Usage

Query Hints

Output Formats

Configuration Object

Correlation Detection

Column Analysis

Token Counting

Use with Claude

API Reference

Examples

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dfcontext

Why?

Install

Quick Start

Features

Advanced Usage

Query Hints

Output Formats

Configuration Object

Correlation Detection

Column Analysis

Token Counting

Use with Claude

API Reference

Examples

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages