API Reference

Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.

Core Classes

BaseBenchmark

Base abstract class that all benchmarks inherit from.

from benchbox.base import BaseBenchmark

class BaseBenchmark(ABC):
    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
        """Initialize benchmark with scale factor and output directory."""

Methods

`generate_data() -> List[Path]`

Generate benchmark data files.

Returns: List of Path objects pointing to generated data files.

Example:

from benchbox import TPCH

tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
    table_name = file_path.stem
    print(f"Generated {table_name} at {file_path}")

`get_queries() -> Dict[Union[int, str], str]`

Get all queries for this benchmark.

Returns: Dictionary mapping query IDs to SQL query strings.

Example:

queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}

`get_query(query_id: Union[int, str]) -> str`

Get a specific query by ID.

Parameters:

query_id: Query identifier (integer or string)

Returns: SQL query string.

Example:

# Basic query
query_1 = tpch.get_query(1)

# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")

`translate_query(query_id: Union[int, str], dialect: str) -> str`

Translate query to specific SQL dialect.

Parameters:

query_id: Query identifier
dialect: Target SQL dialect ("postgres", "mysql", "sqlite", "duckdb", etc.)

Returns: Translated SQL query string.

Example:

postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb")  # Recommended default

`get_create_tables_sql() -> str`

Get DDL statements to create benchmark tables.

Returns: DDL statements as string in standard SQL format.

Example:

ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)

`get_schema() -> List[Dict[str, Any]]`

Get benchmark schema information.

Returns: List of dictionaries describing tables and columns.

Example:

schema = tpch.get_schema()
for table in schema:
    print(f"Table: {table['name']}")
    for column in table['columns']:
        print(f"  {column['name']}: {column['type']}")

Benchmark Classes

TPCH

TPC-H Decision Support Benchmark implementation.

from benchbox import TPCH

tpch = TPCH(scale_factor=1.0, output_dir=None)

Properties

Query Count: 22 analytical queries (Q1-Q22)
Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)
Scale Factors: 0.001 to 1000+
Recommended Database: DuckDB

Example Usage

# Recommended DuckDB integration
import duckdb
from benchbox import TPCH

conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        CREATE TABLE {table_name} AS
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()

TPCDS

TPC-DS Decision Support Benchmark implementation.

from benchbox import TPCDS

tpcds = TPCDS(scale_factor=1.0, output_dir=None)

Properties

Query Count: 99 complex analytical queries
Tables: 24 tables (retail data warehouse schema)
Features: Window functions, CTEs, complex joins
Recommended Database: DuckDB

Primitives

Database Primitives Benchmark for focused operation testing.

from benchbox import ReadPrimitives

read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)

Additional Methods

`get_query_categories() -> List[str]`

Get list of available query categories.

Returns: List of category names.

Example:

categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]

`get_queries_by_category(category: str) -> Dict[str, str]`

Get queries filtered by category.

Parameters:

category: Category name ("aggregation", "join", "filter", etc.)

Returns: Dictionary mapping query IDs to SQL strings.

Example:

agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")

Other Benchmarks

SSB: Star Schema Benchmark
AMPLab: Big data benchmark
H2ODB: Data science benchmark
ClickBench: Analytical benchmark
JoinOrder: Canonical IMDb 2013 JOB benchmark for join-order optimization, fixed at scale_factor=1.0
TPCDI: Data integration benchmark
WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)
TPCHavoc: TPC-H syntax variants

All benchmarks follow the same basic interface as shown above.

Configuration

Constructor Parameters

All benchmarks accept these parameters:

Parameter	Type	Default	Description
`scale_factor`	`float`	`1.0`	Dataset size multiplier
`output_dir`	`Optional[Path]`	`None`	Data output directory

JoinOrder is the fixed-size exception: the public benchmark accepts only scale_factor=1.0 because it uses the canonical IMDb 2013 JOB dataset.

Scale Factor Guidelines

Scale Factor	Data Size	Use Case
0.001	~1MB	Unit tests
0.01	~10MB	Development
0.1	~100MB	Integration tests
1.0	~1GB	Benchmarking

Example Configuration

from pathlib import Path
from benchbox import TPCH

# Development configuration
tpch_dev = TPCH(
    scale_factor=0.01,
    output_dir=Path("./benchmark_data")
)

# Production configuration
tpch_prod = TPCH(
    scale_factor=1.0,
    output_dir=Path("/var/lib/benchbox")
)

Database Integration

DuckDB (Recommended)

DuckDB is the recommended database for BenchBox:

import duckdb
from benchbox import TPCH

# Setup
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Load data
data_files = tpch.generate_data()
ddl = tpch.get_create_tables_sql()
conn.execute(ddl)

for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        INSERT INTO {table_name}
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries (DuckDB uses ANSI SQL by default)
for query_id in range(1, 6):  # First 5 queries
    result = conn.execute(tpch.get_query(query_id)).fetchall()
    print(f"Query {query_id}: {len(result)} rows")

Other Databases

For other databases, use SQL dialect translation:

# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")

# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")

# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")

Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn't mean platform adapters exist for connecting to those databases. Currently supported platforms include: DuckDB, SQLite, PostgreSQL, ClickHouse, Databricks SQL, BigQuery, Redshift, Snowflake, Trino, Presto, Amazon Athena, Firebolt, Azure Synapse Analytics, Microsoft Fabric, and more. See the Platform Documentation for the full list.

Common Patterns

Basic Benchmark Execution

from benchbox import TPCH
import time

# Initialize
tpch = TPCH(scale_factor=0.1)

# Generate data
data_files = tpch.generate_data()

# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]:  # First 3 queries
    start_time = time.time()
    # Execute query with your database connection
    execution_time = time.time() - start_time
    print(f"Query {query_id}: {execution_time:.3f}s")

Multi-Database Testing

from benchbox import TPCH

tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()

# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"]  # Use supported platforms
for dialect in dialects:
    translated_query = tpch.translate_query(1, dialect)
    print(f"{dialect}: {len(translated_query)} chars")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Reference

Core Classes

BaseBenchmark

Methods

`generate_data() -> List[Path]`

`get_queries() -> Dict[Union[int, str], str]`

`get_query(query_id: Union[int, str]) -> str`

`translate_query(query_id: Union[int, str], dialect: str) -> str`

`get_create_tables_sql() -> str`

`get_schema() -> List[Dict[str, Any]]`

Benchmark Classes

TPCH

Properties

Example Usage

TPCDS

Properties

Primitives

Additional Methods

`get_query_categories() -> List[str]`

`get_queries_by_category(category: str) -> Dict[str, str]`

Other Benchmarks

Configuration

Constructor Parameters

Scale Factor Guidelines

Example Configuration

Database Integration

DuckDB (Recommended)

Other Databases

Common Patterns

Basic Benchmark Execution

Multi-Database Testing

See Also

FilesExpand file tree

api-reference.md

Latest commit

History

api-reference.md

File metadata and controls

API Reference

Core Classes

BaseBenchmark

Methods

generate_data() -> List[Path]

get_queries() -> Dict[Union[int, str], str]

get_query(query_id: Union[int, str]) -> str

translate_query(query_id: Union[int, str], dialect: str) -> str

get_create_tables_sql() -> str

get_schema() -> List[Dict[str, Any]]

Benchmark Classes

TPCH

Properties

Example Usage

TPCDS

Properties

Primitives

Additional Methods

get_query_categories() -> List[str]

get_queries_by_category(category: str) -> Dict[str, str]

Other Benchmarks

Configuration

Constructor Parameters

Scale Factor Guidelines

Example Configuration

Database Integration

DuckDB (Recommended)

Other Databases

Common Patterns

Basic Benchmark Execution

Multi-Database Testing

See Also

`generate_data() -> List[Path]`

`get_queries() -> Dict[Union[int, str], str]`

`get_query(query_id: Union[int, str]) -> str`

`translate_query(query_id: Union[int, str], dialect: str) -> str`

`get_create_tables_sql() -> str`

`get_schema() -> List[Dict[str, Any]]`

`get_query_categories() -> List[str]`

`get_queries_by_category(category: str) -> Dict[str, str]`