Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.
Base abstract class that all benchmarks inherit from.
from benchbox.base import BaseBenchmark
class BaseBenchmark(ABC):
def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
"""Initialize benchmark with scale factor and output directory."""Generate benchmark data files.
Returns: List of Path objects pointing to generated data files.
Example:
from benchbox import TPCH
tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
table_name = file_path.stem
print(f"Generated {table_name} at {file_path}")Get all queries for this benchmark.
Returns: Dictionary mapping query IDs to SQL query strings.
Example:
queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}Get a specific query by ID.
Parameters:
query_id: Query identifier (integer or string)
Returns: SQL query string.
Example:
# Basic query
query_1 = tpch.get_query(1)
# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")Translate query to specific SQL dialect.
Parameters:
query_id: Query identifierdialect: Target SQL dialect ("postgres", "mysql", "sqlite", "duckdb", etc.)
Returns: Translated SQL query string.
Example:
postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb") # Recommended defaultGet DDL statements to create benchmark tables.
Returns: DDL statements as string in standard SQL format.
Example:
ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)Get benchmark schema information.
Returns: List of dictionaries describing tables and columns.
Example:
schema = tpch.get_schema()
for table in schema:
print(f"Table: {table['name']}")
for column in table['columns']:
print(f" {column['name']}: {column['type']}")TPC-H Decision Support Benchmark implementation.
from benchbox import TPCH
tpch = TPCH(scale_factor=1.0, output_dir=None)- Query Count: 22 analytical queries (Q1-Q22)
- Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)
- Scale Factors: 0.001 to 1000+
- Recommended Database: DuckDB
# Recommended DuckDB integration
import duckdb
from benchbox import TPCH
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
CREATE TABLE {table_name} AS
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()TPC-DS Decision Support Benchmark implementation.
from benchbox import TPCDS
tpcds = TPCDS(scale_factor=1.0, output_dir=None)- Query Count: 99 complex analytical queries
- Tables: 24 tables (retail data warehouse schema)
- Features: Window functions, CTEs, complex joins
- Recommended Database: DuckDB
Database Primitives Benchmark for focused operation testing.
from benchbox import ReadPrimitives
read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)Get list of available query categories.
Returns: List of category names.
Example:
categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]Get queries filtered by category.
Parameters:
category: Category name ("aggregation", "join", "filter", etc.)
Returns: Dictionary mapping query IDs to SQL strings.
Example:
agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")- SSB: Star Schema Benchmark
- AMPLab: Big data benchmark
- H2ODB: Data science benchmark
- ClickBench: Analytical benchmark
- JoinOrder: Canonical IMDb 2013 JOB benchmark for join-order optimization, fixed at
scale_factor=1.0 - TPCDI: Data integration benchmark
- WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)
- TPCHavoc: TPC-H syntax variants
All benchmarks follow the same basic interface as shown above.
All benchmarks accept these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
scale_factor |
float |
1.0 |
Dataset size multiplier |
output_dir |
Optional[Path] |
None |
Data output directory |
JoinOrder is the fixed-size exception: the public benchmark accepts only
scale_factor=1.0 because it uses the canonical IMDb 2013 JOB dataset.
| Scale Factor | Data Size | Use Case |
|---|---|---|
| 0.001 | ~1MB | Unit tests |
| 0.01 | ~10MB | Development |
| 0.1 | ~100MB | Integration tests |
| 1.0 | ~1GB | Benchmarking |
from pathlib import Path
from benchbox import TPCH
# Development configuration
tpch_dev = TPCH(
scale_factor=0.01,
output_dir=Path("./benchmark_data")
)
# Production configuration
tpch_prod = TPCH(
scale_factor=1.0,
output_dir=Path("/var/lib/benchbox")
)DuckDB is the recommended database for BenchBox:
import duckdb
from benchbox import TPCH
# Setup
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Load data
data_files = tpch.generate_data()
ddl = tpch.get_create_tables_sql()
conn.execute(ddl)
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
INSERT INTO {table_name}
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries (DuckDB uses ANSI SQL by default)
for query_id in range(1, 6): # First 5 queries
result = conn.execute(tpch.get_query(query_id)).fetchall()
print(f"Query {query_id}: {len(result)} rows")For other databases, use SQL dialect translation:
# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")
# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")
# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn't mean platform adapters exist for connecting to those databases. Currently supported platforms include: DuckDB, SQLite, PostgreSQL, ClickHouse, Databricks SQL, BigQuery, Redshift, Snowflake, Trino, Presto, Amazon Athena, Firebolt, Azure Synapse Analytics, Microsoft Fabric, and more. See the Platform Documentation for the full list.
from benchbox import TPCH
import time
# Initialize
tpch = TPCH(scale_factor=0.1)
# Generate data
data_files = tpch.generate_data()
# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]: # First 3 queries
start_time = time.time()
# Execute query with your database connection
execution_time = time.time() - start_time
print(f"Query {query_id}: {execution_time:.3f}s")from benchbox import TPCH
tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()
# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"] # Use supported platforms
for dialect in dialects:
translated_query = tpch.translate_query(1, dialect)
print(f"{dialect}: {len(translated_query)} chars")- Getting Started - Basic usage tutorial
- Examples - Practical code examples
- Configuration - Configuration guide
- Benchmarks - Individual benchmark documentation
This API reference covers the core BenchBox functionality. For implementation details, see the source code documentation.