Skip to content

sserada/dfcontext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dfcontext

Generate optimal LLM context from pandas DataFrames within a token budget.

PyPI version Python License: MIT Tests

Why?

You have a 100K-row DataFrame. Your LLM has a context window.

  • df.to_string() gives you millions of tokens
  • df.head() gives you 5 rows with no statistical context

dfcontext gives you the sweet spot — intelligent, column-type-aware summarization that fits within your token budget. No LLM calls required.

Install

pip install dfcontext

Optional dependencies for accurate token counting and YAML output:

pip install dfcontext[all]       # tiktoken + pyyaml
pip install dfcontext[tiktoken]  # accurate token counting only
pip install dfcontext[yaml]      # YAML format output only

Quick Start

import pandas as pd
from dfcontext import to_context

df = pd.read_csv("sales.csv")  # 100K rows
ctx = to_context(df, token_budget=2000)
print(ctx)

Output:

## Dataset overview
- 100,000 rows × 5 columns

## Schema
| Column | Type | Non-null |
|--------|------|----------|
| region | object | 100% |
| sales | float64 | 100% |
| quantity | int64 | 100% |
| date | datetime64[ns] | 100% |
| is_return | bool | 100% |

## Column statistics
### region (categorical, 4 unique)
Top values: East (28.0%), West (25.8%), North (23.2%), South (23.0%)

### sales (numeric)
Range: 4.64 — 8,172.45 | Mean: 1,010.55 | Std: 1,030.04
Distribution: [█▃▁▁▁▁▁▁]

### date (datetime)
Range: 2024-01-01 — 2024-02-11 | Granularity: hourly

### is_return (boolean)
True: 6.0% | False: 94.0%

## Sample rows (diverse selection)
| region | sales | quantity | date | is_return |
|---|---|---|---|---|
| East | 4.64 | 32 | 2024-01-14 | False |
| South | 697.55 | 50 | 2024-01-15 | False |
| West | 8172.45 | 68 | 2024-01-02 | False |

Features

  • Column-type-aware analysis — different strategies for numeric, categorical, text, datetime, and boolean columns
  • Token budget management — output always fits within your specified token limit
  • Adaptive detail — higher budgets produce richer stats (percentiles, skewness, outlier rates)
  • Query hints — tell it what you're analyzing, and it prioritizes relevant columns
  • Correlation detection — find relationships between numeric columns
  • Outlier indicators — flag columns with potential outliers (IQR method)
  • Multiple formats — Markdown, plain text, or YAML output
  • Zero LLM dependency — pure data processing, works with any LLM provider
  • Fast — handles 100K rows in under a second

Advanced Usage

Query Hints

Provide a hint to allocate more token budget to relevant columns:

ctx = to_context(df, token_budget=2000, hint="regional sales trends")
# "region" and "sales" columns get more detailed analysis

Output Formats

ctx_md = to_context(df, format="markdown")   # default
ctx_plain = to_context(df, format="plain")   # no markdown syntax
ctx_yaml = to_context(df, format="yaml")     # requires pyyaml

Configuration Object

For full control, use ContextConfig:

from dfcontext import ContextConfig, to_context

config = ContextConfig(
    token_budget=3000,
    format="markdown",
    hint="churn analysis",
    include_schema=True,
    include_stats=True,
    include_samples=True,
    max_sample_rows=5,
)
ctx = to_context(df, config=config)

Correlation Detection

Find relationships between numeric columns:

ctx = to_context(df, token_budget=2000, include_correlations=True)
# Output includes: "sales ↔ quantity: r=+0.823 (strong positive)"

Column Analysis

Get structured analysis results as Python objects:

from dfcontext import ColumnSummary, analyze_columns

summaries = analyze_columns(df)
for name, s in summaries.items():
    print(f"{name}: {s.column_type}, {s.unique_count} unique")
    if s.distribution_sketch:
        print(f"  histogram: [{s.distribution_sketch}]")
    if "outlier_rate" in s.stats:
        print(f"  outliers: {s.stats['outlier_rate'] * 100:.1f}%")

ColumnSummary fields: name, dtype, column_type, non_null_rate, unique_count, stats (dict), sample_values (list), distribution_sketch (str | None).

Token Counting

from dfcontext import count_tokens

tokens = count_tokens("some text")

Use with Claude

import anthropic
from dfcontext import to_context

df = pd.read_csv("sales.csv")
ctx = to_context(df, token_budget=2000, hint="sales trends")

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"{ctx}\n\nWhat are the key sales trends?",
    }],
)

API Reference

Function Description
to_context(df, ...) Generate LLM context string from a DataFrame
analyze_columns(df) Get structured column analysis results
count_tokens(text) Count tokens in text
Class Description
ContextConfig Configuration dataclass for to_context()
ColumnSummary Structured result from column analysis
BudgetPlan Token budget allocation plan

Examples

See the examples/ directory for runnable scripts:

License

MIT

About

Generate optimal LLM context from pandas DataFrames within a token budget.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages